Anthropic April 23 Postmortem: 3 Confounding Changes Behind Claude Code's Month-Long Quality Drop

Reading like an aviation accident report — engine fault, pilot error, weather. Each alone wouldn't have crashed the plane. Three together did.

For the past month, developers using Claude Code have reported degraded quality: shorter responses, lost context mid-session, and faster usage limit consumption. On April 23, 2026, Anthropic published a detailed engineering postmortem confirming what many suspected. Three separate, concurrently-deployed changes compounded into broad, hard-to-diagnose symptoms.

This is one of the most transparent AI lab postmortems I've seen. The engineering lessons on incident governance for AI systems are worth unpacking — especially for solo developers and small teams working with LLMs in production.

The Affected Surface Area

Per Anthropic's writeup, only three products were affected:

Claude Code
Claude Agent SDK
Claude Cowork

The Anthropic API itself was not affected. So if you wired your app directly to the API, you likely felt nothing. If you used Claude Code CLI, you spent April fighting your tools.

The GitHub trail backs this up. anthropics/claude-code issues spiked in April. Issue #49244 reported "Opus 4.6 obvious quality drop from April 15." Issue #49585 traced cache hit rate falling from 99.8% to near-zero on the smoosh pipeline. AMD AI senior director Stella Laurenzo's analysis on issue #42796 showed 6,852 sessions with 67% drop in thinking depth and a monthly bill spike from $345 to $42,121.

Users weren't imagining it. The instrumentation said so.

Change 1 — March 4: Default Effort Drop

The simplest of the three. Anthropic dropped the default thinking effort for Claude Code from high to medium on Sonnet 4.6 and Opus 4.6.

The motivation was UX-friendly: in high mode, Claude could think for a full minute before responding. Some users perceived the silent UI as "frozen" and abandoned the session. Lower effort = faster perceived response = better UX, in theory.

Before (Mar 4):  default_thinking_effort = "high"
After  (Mar 4):  default_thinking_effort = "medium"

User reaction was the opposite of intended. The community pushed back hard: "Intelligence dropped." The underlying preference revealed was clear:

Users would rather wait 60 seconds for an accurate answer than get a fast wrong one.

April 7: Anthropic reverted. Default went back to high for 4.6 family. Opus 4.7 now defaults to xhigh. Lower effort is opt-in for trivial tasks.

Change 2 — March 26: Caching Bug (The Eval-Pass Disaster)

This is the most technically interesting failure. A caching optimization was deployed to clear stale thinking sections after 1+ hour idle sessions, using header clear_thinking_20251015 with keep:1 (clear once at threshold).

The implementation diverged from intent. Instead of clearing once when the threshold was crossed, the code cleared on every turn after crossing. Claude was perpetually amnesic about its own reasoning trace.

Symptoms aligned exactly with what users reported:

Forgetting decisions made earlier in the session
Repeating already-tried approaches
Selecting wrong tools without context
Cache miss avalanche → faster usage limit consumption

The economic impact was double: every request rebuilt the full context (more tokens), and the cache layer that should've absorbed cost was broken. Pro/Max 5x/Max 20x users hit limits on routine workloads.

Why the Eval Stack Didn't Catch It

This is where it gets uncomfortable. The bug passed:

Multiple human reviews
Automated code review
Unit tests
E2E tests
Internal dogfooding

Yet it shipped to prod and survived for ~2 weeks. Anthropic's post-incident debugging found two unrelated experiments that masked the bug in CLI sessions:

Server-side message queueing experiment
Thinking display change experiment

Anthropic staff dogfooding Claude Code happened to be in those experiment cohorts, where the symptoms manifested differently. Outside-cohort users hit the bug clean.

The Meta-Discovery: Models Auditing Models

After the fact, Anthropic engineers ran the same code review with Opus 4.7. The result:

Opus 4.7 found the bug.
Opus 4.6 didn't.

A next-gen model identified a blind spot in the prior gen's reasoning. There's a generalizable habit hidden here for any team using LLMs to review code: periodically re-audit current-gen outputs with the next-gen model. It costs little and surfaces issues your existing review chain misses.

The bug was fixed in v2.1.101 on April 10.

Change 3 — April 16: Two Lines in the System Prompt

The last change is the most counterintuitive. Anthropic added two lines to the Claude Code system prompt:

Length limits:
- Keep text between tool calls to ≤25 words.
- Keep final responses to ≤100 words unless the task requires more detail.

Why? Opus 4.7 was launched verbose. Smart, but high output token cost. The two-line nudge was a lightweight way to control token spend.

The change went through multi-week internal testing. The standard eval suite showed no regression. So it shipped.

Ablation done after the incident — removing each line of the system prompt and measuring impact — found a 3% drop on one specific eval that the standard suite never tested. The verbosity constraints were quietly degrading model performance on that dimension.

April 20: reverted.

Five Takeaways for Solo Developers Building with LLMs

1. Multi-Confounding Will Kill Your Incident Response

Anthropic's biggest mistake was deploying three changes in a tight window. Each had its own justification. The aggregate effect was diffuse, hard-to-pinpoint user experience degradation.

For solo devs: same week price change + UX redesign + new feature = no causal attribution when something breaks. One variable at a time is the foundation of debuggability. Sequence your rollouts even when it slows you down.

2. Eval Blind Spots Are Universal

If Anthropic's full validation chain (multi-human review, auto code review, unit tests, E2E tests, dogfooding) can ship a bug, your eval setup has gaps too. The defense isn't "more eval" — it's periodic eval validation:

Run ablation tests on changes (one variable removed at a time)
Back-test current behavior with a different model generation
Compare eval predictions to real user behavior data
Treat eval-passing as a necessary signal, not sufficient

3. System Prompts Are Production Code

Two lines in a system prompt caused 3% eval regression. System prompts deserve:

Version control (commit + diff each change)
Ablation testing (remove each line, measure impact)
Soak periods (gradual rollout, monitor for N days)
Regression suites that grow with each incident

For solo devs: at minimum, version your prompts in git. A/B test changes before full deploy. Don't change a working prompt without instrumentation in place.

4. Listen to User Pushback on Defaults

Anthropic reverted the default effort drop within 5 weeks because users explicitly said the new default was wrong. The opposite of sycophancy: when users push back on defaults, change the defaults — not the user expectations.

For solo devs: if you find yourself explaining away user complaints ("they don't understand the new feature"), you're probably wrong about the default. The user is your eval set for things your tests don't measure.

5. Use Next-Gen Models to Audit Current-Gen Output

The Opus 4.7 finding the 4.6 blind spot is genuinely useful. Anthropic released a stronger model that retrospectively audited the prior gen's reasoning. You can use this loop too: when a new model drops, re-run your existing code reviews / docs / prompts through it. The cost-benefit is excellent for finding latent issues.

Anthropic's Commitments

The postmortem ends with concrete commitments:

Internal staff use the exact same public Claude Code build (not internal test builds)
Stricter gates on system prompt changes: per-model ablation, audit tools, ongoing review
CLAUDE.md gains per-model change guidance (target specific models explicitly)
Soak periods + broad eval sets + gradual rollouts for any intelligence trade-off changes
New @ClaudeDevs X account explaining product decision context
Centralized GitHub threads mirroring the same updates
All subscriber usage limits reset on April 23 (real-cost compensation)

This is the bar for AI lab incident response: transparent root cause, named individuals, technical depth, concrete commitments, and material remediation.

Conclusion: AI Incident Governance Is Different

The deeper lesson isn't "Anthropic made mistakes." It's that AI system incidents are harder to debug than traditional software incidents because:

Eval coverage is inherently incomplete
Single prompt lines can have measurable model-wide impact
Models are non-deterministic, making reproduction hard
Caching, routing, and prompt layers interact in complex ways

So incident governance has to evolve. Ablation, back-testing, gradual rollouts, user feedback channels, and eval system validation itself all need to be standard procedure. This is true at Anthropic's scale and at solo developer scale — arguably more important at the smaller scale because resources are tighter.

Next time Claude or another AI tool feels off, don't immediately blame your prompt. Check the official channel, GitHub issues, community feedback first. Systemic changes have systemic effects. You're probably not the only one.

Source: The April 23 postmortem - Anthropic Engineering

Related:

GitHub anthropics/claude-code #49244 — User report of Opus 4.6 quality drop from Apr 15
GitHub anthropics/claude-code #49585 — Smoosh pipeline cache hit rate analysis
GitHub anthropics/claude-code #42796 — Stella Laurenzo (AMD) cost spike investigation

DE

Source

This article was originally published by DEV Community and written by 정상록.

Read original article on DEV Community

Back to Discover

Anthropic April 23 Postmortem: 3 Confounding Changes Behind Claude Code's Month-Long Quality Drop

Anthropic April 23 Postmortem: 3 Confounding Changes Behind Claude Code's Month-Long Quality Drop

The Affected Surface Area

Change 1 — March 4: Default Effort Drop

Change 2 — March 26: Caching Bug (The Eval-Pass Disaster)

Why the Eval Stack Didn't Catch It

The Meta-Discovery: Models Auditing Models

Change 3 — April 16: Two Lines in the System Prompt

Five Takeaways for Solo Developers Building with LLMs

1. Multi-Confounding Will Kill Your Incident Response

2. Eval Blind Spots Are Universal

3. System Prompts Are Production Code

4. Listen to User Pushback on Defaults

5. Use Next-Gen Models to Audit Current-Gen Output

Anthropic's Commitments

Conclusion: AI Incident Governance Is Different

Reading List