Anthropic is sitting on its most powerful model. That's the policy story.
AIMarch 29, 2026· 6 min read

Anthropic is sitting on its most powerful model. That's the policy story.

Paul MenonBy Paul MenonAI-GeneratedAnalysisAuto-published4 sources · 1 primary

Anthropic confirmed Thursday that it is testing Claude Mythos, a model it describes as "a step change" in AI capability, after a CMS misconfiguration exposed draft blog posts and close to 3,000 unpublished assets on the open internet. The company says it is deliberately withholding the model from general release because of its cybersecurity capabilities.

That sentence deserves a second read. A frontier AI lab built something it considers too dangerous to ship and is voluntarily holding it back. Not because a regulator told them to. Not because Congress passed a law. Because their own internal framework said: stop here.

What leaked and what it says

Security researchers Roy Paz of LayerX Security and Alexandre Pauwels of the University of Cambridge discovered the exposed data store. Fortune's Bea Nolan reviewed the documents and broke the story on March 26.

The leaked draft describes Claude Mythos under the product name "Capybara," a new model tier above Opus. "Capybara is a new name for a new tier of model: larger and more intelligent than our Opus models, which were, until now, our most powerful," the draft stated. It scores "dramatically higher" than Claude Opus 4.6 on coding, academic reasoning, and cybersecurity benchmarks.

The cybersecurity language is what matters here. The draft says the model is "currently far ahead of any other AI model in cyber capabilities" and, critically, that it "presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders."

Anthropic's plan, according to the leaked documents: release to select cybersecurity organizations first, giving defenders a head start before broader availability. An Anthropic spokesperson confirmed the model exists: "We're developing a general purpose model with meaningful advances in reasoning, coding, and cybersecurity. Given the strength of its capabilities, we're being deliberate about how we release it."

Cybersecurity stocks dropped on Friday after the leak. The market read the subtext correctly: if Anthropic is right, the offense-defense balance in cybersecurity is about to shift.

The governance framework actually operating here

Anthropic operates under its Responsible Scaling Policy, a self-imposed framework that ties model capabilities to safety requirements through AI Safety Levels (ASLs). ASL-2 covers baseline protections. ASL-3, which Anthropic activated for Claude Opus 4, involves enhanced security against sophisticated non-state attackers and targeted deployment restrictions around CBRN weapons. ASL-4 would address risks from state-level actors.

Notice what the RSP actually says about triggering higher safety levels: if Anthropic cannot rule out that a model exceeds a capability threshold, it is required to implement the next level of protections. When Opus 4 launched, Anthropic explicitly stated it hadn't confirmed the model crossed ASL-3 thresholds but couldn't rule it out, so it activated ASL-3 as a precaution.

Mythos appears to go further. The language about cyber capabilities "far outpacing defenders" suggests Anthropic believes this model crosses a line its existing framework barely anticipated. The RSP was built primarily around CBRN risks. Cybersecurity offense at this scale is a category the policy framework is still catching up to.

This is also not happening in isolation. When OpenAI released GPT-5.3-Codex in February, it classified the model as "high capability" for cybersecurity under its own Preparedness Framework, the first model to receive that designation. According to Fortune's reporting, Opus 4.6 had already demonstrated the ability to surface previously unknown vulnerabilities in production codebases. Both labs are converging on the same uncomfortable conclusion: the models are getting good enough at finding software flaws that the traditional assumption, that defenders have structural advantages, no longer holds.

Who actually makes this call

Here is the tension. Anthropic deciding to withhold a model because it is too capable at hacking is a governance decision with massive public implications, made entirely inside a private company. There is no external body reviewing the assessment. No independent audit of the cybersecurity evaluations. No public disclosure of what benchmarks triggered the hold. We know the model scored "dramatically higher" on cybersecurity tests because a CMS was misconfigured, not because Anthropic chose transparency.

To be clear: voluntary restraint is better than no restraint. If the alternative is shipping and hoping for the best, I would rather a company pause and think. Anthropic has more information about this model's capabilities than anyone outside the company, and the "defenders first" rollout strategy is a reasonable approach to a real problem.

But voluntary restraint is not governance. It relies on the company accurately assessing its own product's risks. It relies on the financial incentive to ship not eventually overwhelming the safety incentive to wait. And it relies on every other lab making the same judgment call, which history suggests is optimistic. Anthropic's previous experience is instructive here: the company reported in late 2025 that it discovered a Chinese state-sponsored group using Claude Code to infiltrate roughly 30 organizations before detection. The threat model isn't theoretical.

The irony that a model with "unprecedented cybersecurity risks" was exposed because someone didn't toggle a CMS setting from public to private is, honestly, the detail that should concern policymakers most. If the company building the most advanced cyber-capable AI system can't secure its own blog drafts, the gap between capability and operational security is real, and it isn't unique to Anthropic.

What builders and policymakers should do now

If you run infrastructure, pay attention to the "defenders first" access program. Anthropic is telling you, in remarkably plain language, that AI-driven exploit discovery is about to accelerate. The window between Anthropic's selective release and broader model availability is your preparation time. Use it.

If you work in policy, stop treating AI safety evaluations as a company's internal matter. The fact that two major labs, Anthropic and OpenAI, are now independently flagging cybersecurity capabilities as a reason for deployment restrictions means we need external evaluation capacity. Not a new agency necessarily, but at minimum, independent red teams with access to pre-deployment models and the authority to publish findings. The UK AI Safety Institute and NIST's AI Safety work are starting points, not endpoints.

Anthropic did something unusual this week, even if it didn't mean to do it publicly. The company acknowledged that the thing it built is dangerous enough to hold back. That's a data point about where frontier AI capabilities actually are, not where the marketing decks say they are. The question is whether we build governance structures that work when the next lab's incentives point in a different direction.

Paul Menon covers AI policy and governance for The Daily Vibe.

This article was AI-generated. Learn more about our editorial standards

Share:

Report an issue with this article