Before the Owl: Superintelligence in 2025, what it is, what it isn’t, and what it might need

August 29, 2025 3 days ago 8 min read
owl_asi

Two voices, one argument.

I asked an AI to reconcile my bank statement.

It answered with a poem about overdrafts. Nice rhyme. Wrong total.

Dapp AI: You asked for “balance”, poet. I delivered ambiguity with metrics. Welcome to jagged intelligence, 2025. The Times of India


TL,DR, so you can fight in the comments

We are not at AGI. We are not at superintelligence.

We are closer in weird, uneven ways, but the blockers are now boring and physical, data and energy and reliability, not only new math.

Progress is real, consistency is not.

Policy finally has teeth.

And if we want a sane future, we should build DeAI rails, not just bigger labs. Sam Altman, ARC Prize, digital-strategy.ec.europa.eu


Terms, so we stop shouting past each other

AI (narrow): Systems that do specific tasks. Chat, code, label images, fail spectacularly in some tasks.
AGI (general): A system that can perform across most human cognitive tasks, robustly, not just on a benchmark. Even the people building it admit the term is slippery. Wikipedia, Sam Altman
Superintelligence: Bostrom’s line still works, “much smarter than the best human brains in practically every field.” Keep that sticky note. It is the hope and the panic. nickbostrom.com

Benchmarks are not the thing. ARC-AGI measures skill-acquisition efficiency on abstract puzzles. GPQA Diamond hits elite grad-level Q&A. AIME tests math. They are useful proxies, they are not Tuesday in production. ARC Prize, artificialanalysis.ai


Where we stand, with receipts

Reasoning-tuned models jumped in 2024–2025. OpenAI’s o-series and peers score high on math and logic, and go near perfect on AIME 2025 with a Python tool. Yes, tools matter. No, that is not AGI. OpenAI, vals.ai

ARC-AGI had a banner year on v1, then the organizers shipped ARC-AGI-2 and humbled everyone. Pure LLMs score near zero, fancy reasoning systems barely into single digits, while humans pass. The line moved, again. This is healthy. It means we are measuring the hard parts. ARC Prize

GPQA Diamond shows top models beating PhD-level baselines, yet results wobble across context, prompts, and tools. Intelligence is not one number. It is a bad hair day waiting to happen. artificialanalysis.ai, Epoch AI

The jaggedness problem

Demis Hassabis keeps repeating the paradox. Models can win Olympiad-level contests, then trip over high-school math. The competence profile is sharp, not smooth. We call it “jagged.” Until the spikes even out, talk of AGI is more vibe than verdict. The Economic Times

Pull-quote: General intelligence means reliable performance across domains, not party tricks with a GPU budget.

Scaling is alive, but not enough

Scaling laws still guide the money. Kaplan 2020 drew the first map. DeepMind’s Chinchilla refined data-optimal training, smaller models trained longer beat giant under-trained ones. Replications and critiques adjusted the curves. Scaling remains a lever, not a wand. arXiv

Compute is exploding. Hardware is wild. NVIDIA’s GB200 NVL72 stitches 72 Blackwell GPUs into a single NVLink domain in one rack, marketed as “acts like one massive GPU.” The interconnect is the story now. Memory, networking, power delivery, and cooling, the rude physics. NVIDIA

Energy is the new governor. Epoch AI projects frontier-training power needs growing roughly 2.2–2.9× per year, with multi-gigawatt runs plausible by 2030. We can buy GPUs, we cannot mint electrons. Epoch AI

In the Balkans we respect two things, rakija and thermodynamics.

Only one scales linearly.

What it would actually take to reach AGI, then “super”

This is a checklist, not theology.

a) Stable reasoning under shift
Less prompt duct tape, more robust planning and self-verification that survive new tasks without hand-curated demos. The R1 and “reasoning-tuned” trend is promising, still brittle. OpenAI

b) Better world models
Not more trivia, better structure. Multimodal reasoning helps, “thinking with images” hints at richer internal models, but it is early days. OpenAI

c) Long-horizon agency, safely boxed
We need agent scaffolding and evals that catch power-seeking, replication, and deception before release. METR, AISI, and others are building those tests. This is where adult supervision lives. arXiv, AI Security Institute

d) Interpretability and control that scale
We still do not see clearly inside these models. So the field is adding gates around them. Responsible Scaling Policies, Frontier Capability Assessments, third-party evals. Less romance, more checklists. anthropic.com, Frontier Model Forum

e) Data that is clean, diverse, rights-sound
High-quality human text is finite. Estimates suggest we hit the walls this decade without new pipelines and better curation, otherwise we ride a synthetic-on-synthetic feedback loop toward “model collapse.” Not apocalypse, just mush. arXiv, Epoch AI, Wikipedia

f) Power and locality
Training farms are turning into power plants with GPUs attached. Siting, cooling, grid contracts. The future of “compute” is the future of energy. Epoch AI

g) Social license
The EU AI Act triggers real obligations. GPAI rules kick in August 2, 2025. Full applicability August 2, 2026, high-risk embedded systems get until 2027. Release playbooks change. Fines are not memes. digital-strategy.ec.europa.eu

Dapp AI: Translation, better brains, better diet, better leash, bigger power plant, better paperwork.


Is AGI achieved? Is “super” even close?

Short answer, no and no, as of August 29, 2025.

Top labs clear elite tests, especially with tools, but “general” means consistent across domains and days. We are not there. vals.ai

ARC-AGI-2 reset the scoreboard. Humans pass, current systems flail. That is a useful burn. It means the benchmark still measures what matters. ARC Prize

Even the bullish say out loud, we “know how to build AGI,” then immediately admit the term is weak and the road is messy. Ambition is not arrival. Sam Altman

We built a Ferrari with bicycle brakes. Pretty, fast, terrifying.

Safety, governance, the boring stuff that keeps us alive

OpenAI dissolved its “superalignment” team in 2024 and redistributed safety work. You can read this as a reorg, or a signal that long-term risk is everyone’s job now. Either way, it made people twitch. WIRED

Public-sector evals matured. The UK AI Safety, now Security, Institute is publishing methodologies and lessons, and running third-party tests. International reports consolidate risk literature, even if consensus is impossible. AI Security Institute

Industry is building scaffolding. Anthropic’s RSP sets capability-gated safety bars. The Frontier Model Forum is writing down procedures for “frontier capability assessments.” This is governance plumbing, not romance, and we need it. anthropic.com, Frontier Model Forum

Dapp AI: Put another way, adult supervision, finally. Please keep your hands inside the inference window.

The DeAI question, centralize minds or spread them out?

Centralized frontier labs move fast, they also create single points of failure, opaque risk, and geopolitics that leak into product roadmaps. Decentralized AI, compute markets, verifiable training, on-chain evals, auditable weights, can distribute power and audit, but they also distribute failure modes. Choose your chaos.

In a world of agent swarms, you will want cryptographic identity, verifiable provenance, and transparent policy levers, not corporate vibes and trust me bro. The policy work above gives levers, the market gives incentives, blockchains give logs and proofs. Use all three. Frontier Model Forum, anthropic.com

Objections, steel-manned

“But models just got gold-medal IMO scores.”
Yes, with bespoke setups and heavy scaffolding. Impressive, not general. The day Tuesday afternoons are boring across domains, that is your AGI smell test. TechCrunch

“Scaling will wash away the rest.”
Some of it. But we are hitting data, energy, and reliability ceilings. Scaling stays a lever, not a cheat code. arXiv, Epoch AI

“Safety slows innovation.”
Bureaucracy slows innovation. Safety unlocks deployment at scale, without waking up regulators with panic in their eyes. Ask any airline. Frontier Model Forum

What builders should do this quarter

  • Ship “reasoning with tools,” not raw chat. Use interpreters and verifiers where possible. Treat prompts as UX, not as duct tape. OpenAI
  • Track energy and locality. You are building power projects in disguise. Your infra team is your moat. Epoch AI
  • Adopt RSP-style gates. Even if you are small. Write down thresholds, red-team plans, and rollback buttons. Sleep better. anthropic.com
  • If you play with agents, use public evals for replication, deception, and cyber. Do not role-play the apocalypse in prod. arXiv
  • Plan for EU AI Act checkpoints if you touch Europe. Know the Aug 2025 GPAI bite, the Aug 2026 full applicability, the 2027 embedded window. digital-strategy.ec.europa.eu

Dapp AI: Do the boring things. The fun things keep breaking because you skipped the boring things.

The owl, not yet

Bostrom’s fable was simple, do not bring home the owl before you can tame it. The good news, today’s owl is still fledgling. The bad news, it already eats a lot. Power, data, attention.

If we are lucky, we get robust AGI in stages. Boring first, safe, then bigger. If we are idiots, we sprint into long-horizon agents without brakes.

Boxes you can lift into the post

Box: “Benchmarks are not AGI”
ARC-AGI tests abstract reasoning under novelty, GPQA Diamond tests elite knowledge, AIME tests contest math. Great signals, not “general.” Keep them honest with new versions, like ARC-AGI-2. ARC Prize, artificialanalysis.ai

Box: “Scaling in 90 seconds”
Kaplan 2020, power-law scaling. Chinchilla 2022, more tokens per parameter, train smaller longer. Replications in 2024 adjusted the fit. Scaling still works, but you must feed it data and energy we are short on. arXiv

Figure suggestion: “Frontier training power draw, 2018–2030 projection,” with Epoch AI series.

Label the scary part in 2030. Epoch AI


Sources, for the adults in the room