ARC-AGI-3 Drops a 1% Score That Should Embarrass Every Capability Claim Made This Year

Article 51 of the EU AI Act defines systemic risk for general-purpose AI models by one metric: compute. Any model trained using more than 10²⁵ floating point operations is presumed to have high-impact capabilities and must notify the European Commission within two weeks. Not by demonstrated performance. Not by score on any evaluation. By how much electricity they burned training it.

Today, the ARC Prize Foundation published ARC-AGI-3 (arXiv:2603.24621 — https://arxiv.org/abs/2603.24621). Frontier AI systems score below 1% on it. Humans score 100%. Every single time.

Those two facts don't fit comfortably in the same regulatory framework.

What ARC-AGI-3 actually tests

ARC-AGI-3 is the third benchmark in the series created by François Chollet, co-founder of the ARC Prize Foundation. The previous two iterations tracked progress on static pattern recognition tasks. This one is different in a meaningful way.

ARC-AGI-3 uses interactive, turn-based environments where an agent must explore an unknown space, infer what the goal is, build a working model of how the environment operates, and then plan action sequences to reach that goal. No instructions. No language cues. No external knowledge the model might have memorized from training data.

The benchmark is built around what the researchers call "Core Knowledge priors" — the basic cognitive building blocks humans share regardless of education or cultural background. Spatial reasoning. Object permanence. Cause and effect. The tasks were calibrated through extensive testing with human participants, which is where the 100% human solve rate comes from.

The below-1% score for frontier AI is not a rounding error. It is the score.

The gap between claims and performance

The ARC Prize Foundation defines AGI as a system that can match the learning efficiency of humans. By that definition, and by the benchmark they built to measure it, no current frontier model is close.

This matters because the past two years have produced a steady stream of capability claims from AI labs: human-level performance on professional exams, reasoning ability comparable to expert practitioners, agentic systems that can handle complex multi-step tasks. Some of these claims are accurate within the narrow domains tested. ARC-AGI-3 measures something different — fluid adaptive intelligence on genuinely novel tasks, the kind of generalization that humans perform almost automatically.

Watch how labs typically present benchmark results. When a model achieves high scores, the benchmark gets cited in product announcements and investor materials. When scores are low, the standard response is that the benchmark doesn't capture real-world utility. The ARC Prize Foundation has anticipated this: the human solve rate is 100%. These aren't obscure or poorly designed tasks. Every human test-taker solved them.

Why the governance math breaks down

The EU AI Act's Article 51 compute threshold was chosen for practical reasons. Compute is measurable. You can audit it. Capability — especially novel agentic capability of the kind ARC-AGI-3 tests — is much harder to pin down.

But that practicality creates a serious mismatch. Under current Article 51 rules, a model can clear the 10²⁵ FLOP threshold, get classified as a systemic-risk GPAI model, face the associated transparency and red-teaming obligations, and simultaneously score below 1% on a benchmark that every human completes. The regulatory designation implies capabilities the model doesn't have.

The inverse is also a problem. If a future model achieves genuine novel agentic reasoning while being trained on less compute, it could fall below the systemic-risk threshold entirely. The regulation would miss the actual capability transition.

In the United States, the NIST AI Risk Management Framework has become the go-to reference for courts assessing reasonable duty of care in AI deployment. The framework provides extensive guidance on risk categorization but defines no capability thresholds and no standard for what constitutes AGI-level performance. That vacuum is a liability.

When AI companies make capability claims in marketing materials, investor filings, or congressional testimony, those claims aren't cross-referenced against any standardized benchmark. There's no mechanism requiring a company that says its model has "human-level reasoning" to disclose that it scores below 1% on a benchmark where humans score 100%.

What's at stake for GPAI enforcement

The EU's GPAI compliance obligations took effect August 2, 2025. Full enforcement by the European Commission begins August 2, 2026. As that enforcement window opens, regulators will be making decisions about which models carry systemic risk, what evaluations are sufficient, and what capability claims are credible.

ARC-AGI-3 doesn't map cleanly onto any existing compliance category. The EU AI Act's systemic risk framework wasn't designed around agentic reasoning benchmarks. But the benchmark's release today gives regulators — and the lawyers advising AI providers on compliance — a concrete data point: the most capable frontier models in the world cannot do something every human test-taker finds trivial.

That's relevant to any assessment of actual systemic risk, as opposed to presumed systemic risk based on compute alone. It's also relevant to any liability framework trying to adjudicate harm caused by AI systems whose capabilities were misrepresented.

What to do now

If you're building on top of frontier AI systems and your product pitch involves anything resembling agentic reasoning, novel problem-solving, or human-level generalization, run your system against ARC-AGI-3 or a similar novel-task benchmark before you put that claim in writing. The gap between marketing language and actual performance is now quantifiable in a way legal teams and regulators can read.

If you're advising on EU AI Act GPAI compliance, the compute-threshold approach in Article 51 is going to face pressure as benchmarks like this accumulate. Get ahead of it by advocating for capability-based supplementary evaluations in your compliance documentation — not just compute metrics.

If you're a policymaker watching the enforcement deadline approach: the ARC Prize Foundation just handed you the clearest dataset yet showing that capability claims and actual performance aren't the same thing. Build the distinction into your evaluation requirements before August.

Paul Menon covers AI governance and policy for The Daily Vibe.