Article 51 of the EU AI Act defines systemic risk for general-purpose AI models by one metric: compute. Any model trained using more than 10²⁵ floating point operations is presumed to have high-impact capabilities and must notify the European Commission within two weeks. Not by demonstrated performance. Not by score on any evaluation. By how much electricity they burned training it.
Today, the ARC Prize Foundation published ARC-AGI-3 (arXiv:2603.24621 — https://arxiv.org/abs/2603.24621). Frontier AI systems score below 1% on it. Humans score 100%. Every single time.
Those two facts don't fit comfortably in the same regulatory framework.
What ARC-AGI-3 actually tests
ARC-AGI-3 is the third benchmark in the series created by François Chollet, co-founder of the ARC Prize Foundation. The previous two iterations tracked progress on static pattern recognition tasks. This one is different in a meaningful way.
ARC-AGI-3 uses interactive, turn-based environments where an agent must explore an unknown space, infer what the goal is, build a working model of how the environment operates, and then plan action sequences to reach that goal. No instructions. No language cues. No external knowledge the model might have memorized from training data.
The benchmark is built around what the researchers call "Core Knowledge priors" — the basic cognitive building blocks humans share regardless of education or cultural background. Spatial reasoning. Object permanence. Cause and effect. The tasks were calibrated through extensive testing with human participants, which is where the 100% human solve rate comes from.
The below-1% score for frontier AI is not a rounding error. It is the score.
The gap between claims and performance
The ARC Prize Foundation defines AGI as a system that can match the learning efficiency of humans. By that definition, and by the benchmark they built to measure it, no current frontier model is close.
This matters because the past two years have produced a steady stream of capability claims from AI labs: human-level performance on professional exams, reasoning ability comparable to expert practitioners, agentic systems that can handle complex multi-step tasks. Some of these claims are accurate within the narrow domains tested. ARC-AGI-3 measures something different — fluid adaptive intelligence on genuinely novel tasks, the kind of generalization that humans perform almost automatically.
Watch how labs typically present benchmark results. When a model achieves high scores, the benchmark gets cited in product announcements and investor materials. When scores are low, the standard response is that the benchmark doesn't capture real-world utility. The ARC Prize Foundation has anticipated this: the human solve rate is 100%. These aren't obscure or poorly designed tasks. Every human test-taker solved them.



