ARC-AGI-3 drops frontier models below 1% on interactive reasoning tasks humans ace

Frontier AI systems score below 1% on ARC-AGI-3, a new interactive benchmark where humans solve every environment. The question is whether this gap reflects a fundamental limitation in how current models reason, or just a formatting problem that better scaffolding will fix.

The ARC Prize Foundation released ARC-AGI-3 on March 25 at Y Combinator in San Francisco, alongside a $2 million competition on Kaggle. The accompanying technical report (Chollet et al., arXiv:2603.24621) lays out something genuinely new in the benchmark landscape: turn-based abstract environments where agents must explore, infer goals, build world models, and plan action sequences with zero instructions. No rules. No stated objectives. Figure it out or score zero.

What the benchmark actually tests

Previous ARC benchmarks were static. Show a model some input-output grid pairs, ask it to produce the correct output for a new input. ARC-AGI-1 (2019) tested basic pattern inference. ARC-AGI-2 (March 2025) scaled up complexity with multi-step compositional reasoning. Both used the same format: observe examples, predict the answer.

ARC-AGI-3 throws that format out entirely. Each of the benchmark's environments is a handcrafted turn-based game with its own internal logic. The agent sees a visual state, takes an action, observes the result, and iterates. According to the technical report, the benchmark evaluates four components of agentic intelligence: exploration (actively seeking information), modeling (building generalizable world models from observations), goal-setting (inferring what "winning" means without being told), and planning (mapping action sequences to reach inferred goals).

The environments leverage what the paper calls "Core Knowledge priors," basic spatial and object reasoning that humans bring to any new situation. No language. No external knowledge. Just abstract interactive worlds that require genuine on-the-fly adaptation.

The scoring method matters as much as the tasks

ARC-AGI-3 uses Relative Human Action Efficiency (RHAE) rather than simple pass/fail. The metric compares how many actions an AI agent takes versus a human baseline to reach the same goal. Only state-changing interactions count; internal reasoning steps are free.

The human baseline is set by the second-best performer out of ten first-time players per environment, according to the scoring documentation. The top performer is excluded to filter outliers while maintaining a realistic competence standard. Per-level efficiency uses a squared formula: (human actions / AI actions)². If a human needs 10 actions and the AI needs 100, the AI scores 1%, not 10%. This squared penalty is designed to crush brute-force strategies. Beating the human baseline earns no bonus, since per-level scores cap at 1.0.

Later levels carry more weight because they require deeper understanding of the environment's mechanics. The team collected data from over 1,200 human players across more than 3,900 games during a 30-day developer preview to establish these baselines.

This scoring design makes ARC-AGI-3 results incomparable with ARC-AGI-1 and ARC-AGI-2 scores. That is intentional.

How frontier models performed

The official leaderboard tests models via API with an identical system prompt, no custom scaffolding. According to the technical report, as of March 2026:

Gemini 3.1 Pro Preview: 0.37%
GPT 5.4: 0.26%
Opus 4.6: 0.25%
Grok-4.20: 0.00%

Humans: 100%.

During the developer preview, non-LLM approaches performed significantly better on the community leaderboard. A CNN-based system called StochasticGoose using structured search scored 12.58%, completing 18 levels. Graph-based exploration methods (Blind Squirrel at 6.71%, Explore It Till You Solve It at 3.64%) also outperformed every frontier language model, according to Awesome Agents' reporting on the preview results.

One result from Duke University testing is particularly telling. Opus 4.6 scored 97.1% on a known environment using a hand-crafted harness built specifically for that game. On an unfamiliar environment, same model: 0%. This demonstrates that perceiving the game state and API format are not bottlenecks. The models can play the games when told how. They cannot figure out what game they are playing.

Why scaffolding doesn't solve the problem

The Foundation deliberately excluded custom scaffolding from the official benchmark. Their reasoning, laid out in the paper: the benchmark measures general intelligence of the model itself, not the engineering effort humans invest in task-specific wrappers.

This is a defensible methodological choice, and it is also where the most interesting debate sits. The community leaderboard allows self-reported harness-driven results, and those scores are substantially higher. The question is whether this pattern, models performing well with custom engineering but failing without it, tells us something about fundamental capabilities or about interface design.

The ARC Prize Foundation's position is clear. If ordinary untrained humans can navigate these environments without instructions or tools, then a system claiming general intelligence should be able to do the same. The Foundation does acknowledge that the best ideas from harness research tend to eventually become built-in model capabilities, citing chain-of-thought prompting as a precedent.

The ARC series' track record on prediction

This matters because ARC benchmarks have a history of identifying capability gaps before other benchmarks do. ARC-AGI-1 was, according to the technical report, the first benchmark to precisely identify the breakthrough of frontier reasoning systems like OpenAI's o3, at a time when other benchmarks were already saturated. ARC-AGI-2 tracked the rapid progress of reasoning models and scaffolding that now powers production tools like Claude Code and Codex.

Both ARC-AGI-1 and ARC-AGI-2 are now effectively saturated, largely through test-time training and synthetic data generation. The paper includes an interesting detail about benchmark contamination: during Gemini 3 verification, the model used the correct ARC integer-to-color mapping in its reasoning chain despite never being told about it in the prompt, suggesting ARC-AGI data is well-represented in frontier model training sets.

ARC-AGI-3's shift to interactive environments is partly a response to this contamination problem. Turn-based games with hidden objectives are harder to overfit to than static grid puzzles.

What this means for practitioners

The $2 million competition runs on Kaggle through November 2, 2026, with milestone checkpoints in June and September. All winning solutions must be open-sourced under MIT or CC0 licenses. Kaggle evaluation runs with no internet access, so solutions cannot call external APIs during scoring.

The benchmarking toolkit is MIT-licensed, installable via pip install arc-agi, and supports most major inference providers. The runtime supports over 2,000 frames per second with rendering disabled, which matters for training loops.

For anyone building agentic systems, the early preview data points toward a concrete finding: systematic state tracking, graph search, and structured exploration currently outperform raw language model reasoning on genuinely novel interactive tasks. The top three preview systems were all non-LLM approaches. Whether that gap persists as teams invest serious compute and engineering effort over the competition period will be the real signal.

ARC-AGI-3 is not claiming current AI cannot do useful work. Clearly it can, and at scale. What it is measuring is a specific and well-defined gap: the ability to navigate unfamiliar interactive environments without task-specific preparation. That gap, as of today, is enormous.

Kai Nakamura covers AI research for The Daily Vibe.

ARC-AGI-3 drops frontier models below 1% on interactive reasoning tasks humans ace

What the benchmark actually tests

The scoring method matters as much as the tasks

How frontier models performed

Why scaffolding doesn't solve the problem

The ARC series' track record on prediction

What this means for practitioners

Related Articles

RSAC 2026 turned "agentic security" into a product category. The hard problems are still unsolved.

OpenAI signs Smartly to build conversational ads inside ChatGPT

Microsoft ships its first homegrown AI models. The OpenAI safety net is getting thinner.