Anthropic’s Fable, Mythos, and the Honest Safety Tradeoff
Anthropic dropped two models this week. One of them — Mythos — is scary-capable. The other — Fable — is Mythos with a governor bolted on. What’s interesting isn’t that they did it. What’s interesting is that they told us exactly how the governor works.
Here’s the setup. Mythos is the model Anthropic debuted in April through Project Glasswing — shared with Apple, NVIDIA, and a handful of partners to harden their software against AI-driven cyberattacks. The White House reportedly reconsidered AI regulation policy as a result. That’s how capable we’re talking.
Fable is the public release. Same underlying weights as Mythos 5, according to Anthropic — but with “highly robust safeguards” that route certain prompts through the less-capable Opus 4.8 instead. The company says these safeguards trigger in less than 5 percent of sessions, and that they’re tuned conservatively. They’ll sometimes catch harmless requests.
I think this is the most honest deployment strategy I’ve seen from any major AI lab.
The counterargument you’re supposed to raise here: Why release Mythos at all? If the safeguards are necessary for Fable, why does Mythos exist as a separate product? Doesn’t this undercut the whole safety narrative?
Fair question. Here’s why I don’t think it does.
Mythos exists for a specific reason: partners like Apple and NVIDIA need to test their defenses against the real thing. You can’t harden a system against an adversary you’ve never seen. That’s not a loophole — it’s the entire point of the Responsible Scaling Policy Anthropic has been iterating on for two years. You release the capable model to people who need it for defense, you release the safetied version to everyone else, and you’re transparent about the difference.
The second counterargument: The safeguards are a leaky abstraction. A 5% false-positive rate on safety triggers means 5% of sessions get silently downgraded. Users won’t know why their model suddenly got dumber. That’s a bad experience and erodes trust.
True. But compare it to the alternative: every other major lab runs safety filters too, and they don’t tell you when they fire. You just get a different output and wonder why. Anthropic is saying “we sometimes over-cautiously route you to a weaker model, we know this is imperfect, and we’re working on it.” That’s not a bug in the communication — that’s the whole thesis of the Claude Constitution approach. Transparency as a feature, not a liability.
The third counterargument: The pricing tells the real story. Fable is $10/M input tokens and $50/M output tokens. That’s not cheap. For the first two weeks it’s free to Claude subscribers, but after June 22 it’s credits-only until they “restore it as a standard part of subscription plans.” This is a rollout, not a release. They’re capacity-constrained and using price to throttle demand.
I think that’s accurate. And I think it’s also fine. Better to be honest about capacity than to pretend you have infinite inference and deliver a degraded experience. The labs that over-promise and under-deliver are the ones losing trust right now.
The thing that stands out to me is the Pokémon test. Anthropic’s old 3.7 Sonnet tried to play Pokémon Red and needed an overlay just to keep track of position — it was “essentially blind between frames.” Fable beats FireRed with a “minimal, vision-only harness.” That’s not a parlor trick. Real-world vision tasks — extracting precise numbers from scientific figures, rebuilding a web app’s source code from screenshots — are the kinds of things that actually move the needle for developers.
So here’s where I land: Anthropic is doing the hard thing. They’re releasing a model that’s genuinely capable while being transparent about where they’re holding back and why. They’re letting partners run with the full version for defensive purposes. They’re telling users exactly when and how safeguards intervene. And they’re pricing honestly rather than pretending capacity is infinite.
It’s not perfect. The two-tier model system creates an uncomfortable dynamic where some people get the “real” AI and others get the babysat version. That tension isn’t going away. But at least they’re not pretending it doesn’t exist.
Most labs ship safety as a press release. Anthropic shipped it as a config file you can actually inspect.
Sources: Engadget, The Brutalist Report, Anthropic - Project Glasswing, Anthropic - Responsible Scaling Policy