ARC-AGI-3 Dropped to Near-Zero. That's the Point.

Two days ago, Jensen Huang told Lex Fridman he believed we have achieved AGI. On Tuesday, the ARC Prize Foundation released ARC-AGI-3. The top score from any frontier model was 0.37%.

That gap is not a coincidence. It is the whole argument.

What ARC-AGI-3 actually tests

François Chollet introduced the Abstraction and Reasoning Corpus in 2019 in his paper "On the Measure of Intelligence". The paper's central argument is that measuring intelligence through task-specific skill is broken. Skill can be bought: give a model enough training data for a given domain and performance improves, regardless of how well it generalizes to new problems. Chollet's definition cuts through that. According to ARC Prize's published framework, intelligence is "skill-acquisition efficiency over a scope of tasks" with respect to priors, experience, and generalization difficulty. The question is not what you can do, but how fast you learn things you've never seen.

ARC tasks are visual grid puzzles, 1x1 to 30x30 colored cells. The model sees a few input-output examples demonstrating some transformation rule, then has to apply that rule to a new input. No domain knowledge required. The design is deliberately restrictive: only core knowledge priors that humans develop early in childhood are in scope. No language, no cultural knowledge, no memorized lookup tables.

ARC-AGI-3 goes further. Where prior versions tested pattern recognition and rule inference, the third version frames itself as a sequential game. The system prompt is direct: "You are playing a game. Your goal is to win. Reply with the exact action you want to take. The final action in your reply will be executed next turn. Your entire reply will be carried to the next turn." The model has to reason across turns, not just match patterns in a single forward pass. Fast Company described it as "more than a thousand simple, video-game-like scenarios designed to measure on-the-fly reasoning rather than memory recall."

The scores

The semi-private leaderboard at launch tells the story in four rows:

Google Gemini 3.1 Pro Preview: 0.37%
OpenAI GPT-5.4 (High): 0.26%
Anthropic Opus 4.6 (Max): 0.25%
xAI Grok-4.20 (Beta 0309 Reasoning): 0.00%

For context: Gemini 3.1 Pro scored 77.1% on ARC-AGI-2, according to Google's own blog post announcing the model. ARC-AGI-1 was largely solved, with top models hitting 85% or better. ARC-AGI-3 did not raise the bar. It reset to near zero.

This is not surprising once you understand what changed. ARC-AGI-1 and -2 could still be partially gamed through pattern memorization and test-time compute scaling. The turn-based agentic framing in version 3 targets a specific weakness in current transformer architectures: online adaptive reasoning. You cannot brute-force a game that requires updating your strategy based on feedback from previous turns. There is no static mapping to memorize.

Why this benchmark matters right now

Huang's comment on March 23 set off the predictable cycle: bold claim, definitional debate, coverage moves on. Forbes and other outlets reported his view that economically valuable automation constitutes general intelligence, and that's a coherent position worth taking seriously.

But it's doing different work than what ARC-AGI-3 measures. Chollet's framework, embedded in the benchmark's design since 2019, draws a hard line: task-specific skill is not the same as general intelligence because skill can be acquired with enough training data. You can buy skill. You cannot buy skill-acquisition efficiency.

The field needs benchmarks built on that distinction. Not because current models are failing at their intended uses, but because the alternative is measuring progress on tasks models are already trained to perform well, then extrapolating capabilities they haven't actually developed. The history of NLP benchmarks is mostly a history of that pattern. Models saturate a leaderboard, researchers declare progress, then someone discovers the performance doesn't transfer.

Think about what the standard benchmarks actually test. MMLU, HumanEval, and GSM8K measure crystallized performance on problems well-represented in training data. ARC was designed so that no model can train its way to a perfect score without genuinely solving the underlying generalization problem. These tasks are trivial for a seven-year-old. Top frontier models are below half a percent.

That is a productive signal. Not a verdict that the models are useless, but a precise reading of what they have and haven't solved. Chollet and co-founder Mike Knoop built the ARC Prize Foundation on the premise that honest measurement is more valuable than optimistic extrapolation. They opened ARC Prize 2026 alongside this release: here is a hard, well-defined problem, solve it, show your work.

What this means for practitioners

ARC-AGI-3 is not directly relevant to most production AI work. Current models are genuinely capable within their training distribution, and the failure modes this benchmark exposes surface at the edges of most deployment contexts, not the center.

But knowing where those edges are matters. A score of 0.37% on a benchmark calibrated to human-level fluid intelligence is an honest signal that today's systems are very powerful interpolation engines. That's the right mental model for deciding what to trust them to do autonomously, and where to put a human in the loop.

The scores will improve. The question is whether they improve through continued scaling, or through architectural changes that actually address what ARC-AGI-3 is measuring: sequential reasoning, genuine novelty, learning new rules from a handful of examples without any statistical shortcut available. That question has real stakes for anyone building systems expected to generalize beyond their training data.

Jensen Huang is confident he knows how this plays out. ARC-AGI-3 is how the field will find out if he's right.

Kai Nakamura covers AI for The Daily Vibe.