devtake.dev
AI

Anthropic admits three Claude Code bugs quietly tanked quality for six weeks

Anthropic's April 23 postmortem names three bugs that degraded Claude Code between March 4 and April 20. Usage limits are being reset for every subscriber.

Dieter Morelli · · 4 min read · 3 sources
Anthropic Engineering postmortem cover image.
Image: Anthropic · Source

Anthropic just published a postmortem admitting Claude Code quality slipped for weeks because of three separate bugs. The API was fine. The problem was the harness, and it shipped to every paying subscriber between March 4 and April 20.

If you felt Claude Code got dumber this month, you weren’t imagining it.

Three bugs, six weeks

The postmortem walks through them in order. All three only affected Claude Code, not direct API users.

The reasoning downgrade. On March 4, Anthropic flipped Claude Code’s default reasoning effort from high to medium on Sonnet 4.6 and Opus 4.6 to cut latency. Users noticed immediately. “This was the wrong tradeoff,” the company now says, according to The Register. It got reverted on April 7, and the current 2.1.118 build defaults to xhigh on Opus 4.7 and high elsewhere.

The cache-clearing bug. This one is the interesting one. On March 26, Anthropic shipped an optimization that cleared older thinking from sessions idle for more than an hour, to reduce resume latency. The intent was a one-shot cleanup. The actual behavior: the clear kept firing every single turn for the rest of the session. Claude looked “forgetful and repetitive,” prompt caches kept missing, and users burned through usage limits faster than usual. It slipped past code review, unit tests, end-to-end tests, and dogfooding because it only showed up in stale sessions. Fixed April 10 in v2.1.101.

The 100-word cap. On April 16, a system-prompt edit told Claude to keep text between tool calls to 25 words and final responses to 100. Internal tests looked fine. Ablation tests run after the complaints came in showed a 3% quality drop on both Opus 4.6 and 4.7. Reverted April 20 in v2.1.116.

Why users noticed the cache bug hardest

Simon Willison zeroed in on the second bug. “On March 26, we shipped a change to clear Claude’s older thinking from sessions that had been idle for over an hour,” the Anthropic post reads. “A bug caused this to keep happening every turn for the rest of the session instead of just once.”

Willison said he had 11 active Claude Code sessions open when he read the postmortem and spends more time prompting in stale sessions than fresh ones. That’s exactly the population that hit the bug. If you’re the kind of developer who leaves a Claude Code tab running across a whole workday or two, every resumed message was eating quota the cache should have absorbed.

It also explains a complaint pattern that didn’t line up with model benchmarks: people said Claude was dumber, Anthropic’s dashboards said the model was the same, and both were right. The scaffolding was the variable.

What Anthropic is actually doing about it

The company says it’s changed its internal review process and shipped a few follow-ups to its Claude Code stack:

  • Broader ablation testing on system-prompt changes.
  • Improved code review and better logging on the cache layer.
  • A new @ClaudeDevs account on X for product-communication transparency.
  • Usage limits reset for all paying subscribers, effective April 23.

Anthropic hasn’t said how many users were affected or what percentage of Code sessions touched each bug. The postmortem also doesn’t explain why the 100-word instruction passed internal evaluation before launch. Those are the open questions worth watching.

What this means for you

If you’ve been on Claude Code since March and something felt off, this postmortem is your receipt. Update to at least v2.1.118, because that’s the build that carries all three fixes.

Check your usage page: the quota reset lands automatically for paid plans, but verify it actually applied. If you burned through limits in March or the first half of April, you probably deserved it back.

The bigger lesson sits one level up. Anthropic had three separate regressions in an AI coding product in under two months, all invisible to the raw model and all painful in real sessions. That’s where you should expect risk on any agentic dev tool, yours included. When you evaluate Claude Code, Codex, or Cursor, benchmark the harness, not just the model card. The model isn’t usually the thing that breaks.

Sources

Mentioned in this article