Two days ago, Jensen Huang told Lex Fridman he believed we have achieved AGI. On Tuesday, the ARC Prize Foundation released ARC-AGI-3. The top score from any frontier model was 0.37%.
That gap is not a coincidence. It is the whole argument.
What ARC-AGI-3 actually tests
François Chollet introduced the Abstraction and Reasoning Corpus in 2019 in his paper "On the Measure of Intelligence". The paper's central argument is that measuring intelligence through task-specific skill is broken. Skill can be bought: give a model enough training data for a given domain and performance improves, regardless of how well it generalizes to new problems. Chollet's definition cuts through that. According to ARC Prize's published framework, intelligence is "skill-acquisition efficiency over a scope of tasks" with respect to priors, experience, and generalization difficulty. The question is not what you can do, but how fast you learn things you've never seen.
ARC tasks are visual grid puzzles, 1x1 to 30x30 colored cells. The model sees a few input-output examples demonstrating some transformation rule, then has to apply that rule to a new input. No domain knowledge required. The design is deliberately restrictive: only core knowledge priors that humans develop early in childhood are in scope. No language, no cultural knowledge, no memorized lookup tables.
ARC-AGI-3 goes further. Where prior versions tested pattern recognition and rule inference, the third version frames itself as a sequential game. The system prompt is direct: "You are playing a game. Your goal is to win. Reply with the exact action you want to take. The final action in your reply will be executed next turn. Your entire reply will be carried to the next turn." The model has to reason across turns, not just match patterns in a single forward pass. Fast Company described it as "more than a thousand simple, video-game-like scenarios designed to measure on-the-fly reasoning rather than memory recall."
The scores
The semi-private leaderboard at launch tells the story in four rows:
- Google Gemini 3.1 Pro Preview: 0.37%
- OpenAI GPT-5.4 (High): 0.26%
- Anthropic Opus 4.6 (Max): 0.25%
- xAI Grok-4.20 (Beta 0309 Reasoning): 0.00%
For context: Gemini 3.1 Pro scored 77.1% on ARC-AGI-2, according to Google's own blog post announcing the model. ARC-AGI-1 was largely solved, with top models hitting 85% or better. ARC-AGI-3 did not raise the bar. It reset to near zero.
This is not surprising once you understand what changed. ARC-AGI-1 and -2 could still be partially gamed through pattern memorization and test-time compute scaling. The turn-based agentic framing in version 3 targets a specific weakness in current transformer architectures: online adaptive reasoning. You cannot brute-force a game that requires updating your strategy based on feedback from previous turns. There is no static mapping to memorize.



