Claude Opus 4.7 is here, and the long-context benchmarks got worse

Anthropic's Opus 4.7 is state-of-the-art on SWE-bench and CursorBench, but independent tests show regressions on long-context retrieval and thematic reasoning.

Anthropic released Claude Opus 4.7 on April 16, and the story is messier than the launch post suggests. The headline benchmarks are strong, but independent tests caught meaningful regressions in the first 24 hours.

What Anthropic is claiming

Opus 4.7 ships at the same price as 4.6: $5 per million input tokens, $25 per million output. The release post leads with SWE-bench Verified, where Anthropic says Opus 4.7 “resolves 3x more production tasks” than Opus 4.6. On CursorBench the score jumps from 58% to 70%. On Finance Agent and GDPval-AA, Anthropic claims state-of-the-art performance. Vision got the biggest structural lift: the model now accepts images up to 2,576 pixels on the long edge, roughly 3.75 megapixels, which is a large step up for reading dense screenshots and PDFs.

The pitch, straight from the announcement: “Users report being able to hand off their hardest coding work, the kind that previously needed close supervision, to Opus 4.7 with confidence.”

There’s one more Anthropic-issued caveat that deserves equal weight. Anthropic explicitly says Opus 4.7 is “less broadly capable than our most powerful model, Claude Mythos Preview,” with some cyber capabilities intentionally dialed down so the company is comfortable shipping 4.7 broadly. The frontier model is real. 4.7 is the version they’re willing to let out.

Where the numbers went the other way

Within hours of release, independent test results started landing, and several of them went backward.

MRCR v2, Anthropic’s own long-context multi-needle retrieval test, dropped hard. At 256k context, Opus 4.7 scores 59.2%, down from 91.9% on 4.6. At 1M context, it falls to 32.2%, down from 78.3%. Constellation Research’s migration note is blunt about the implication: “RAG pipelines and deep-research agents should A/B before migrating,” and Opus 4.6’s 64k extended-thinking mode remains the better option for long-document retrieval.
Thematic Generalization Benchmark, which probes how a model abstracts reasoning patterns across domain shifts, came in at 72.8 versus 4.6’s 80.6. That’s an eight-point drop. Anthropic has not publicly addressed it.
Simon Willison’s pelican test. Willison wrote that Opus “managed to mess up the bicycle frame” on both attempts, including with maximum thinking enabled. Qwen3.6-35B-A3B, a 35B open-weights mixture-of-experts model with only 3B active params running on his laptop, got the frame right. His take: “right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7” for this specific creative task.

None of this invalidates the coding wins. It does mean the one-line story (“most powerful Opus yet”) is a half-truth.

The identity-verification change nobody expected

On the same release cycle, Anthropic updated its identity-verification policy. Some users hitting certain Claude capabilities will now see a prompt for “a valid government-issued photo ID: the physical document, in hand” (passport, driver’s license, or national identity card), plus a live selfie from a device with a camera. Scans, photocopies, and mobile driver’s licenses aren’t accepted.

Anthropic’s own scoping language: “We are rolling out identity verification for a few use cases, and you might see a verification prompt when accessing certain capabilities.” That’s not every user. But it is a real shift for a frontier LLM API, and it reads closer to how KYC works in finance than how it used to work in dev tools. It rhymes with OpenAI’s Trusted Access program for GPT-5.4-Cyber, which also gated capabilities behind verified identity.

What’s still unclear

A few questions the docs don’t answer yet.

Which capabilities trigger the ID check, and at what usage threshold. Anthropic says “a few use cases” without naming them.
Whether the MRCR regression is fixable in a point release or is a side effect of training choices that would take a full model refresh to undo.
How Opus 4.7 compares head-to-head with Mythos on anything beyond Anthropic’s internal characterizations. No public eval from Mythos has been published.

What this means for you

If you’re running a Claude API workflow today, don’t auto-upgrade. Opus 4.7 is clearly stronger at agentic coding, and if you’re pushing long autonomous sessions, you’ll probably want it. But if your pipeline hinges on long-context retrieval, subtle cross-domain reasoning, or deep research with 256k+ contexts, pin to 4.6 and A/B test before you switch. The 3x SWE-bench headline isn’t wrong. It’s one axis of evaluation.

Two other calls to make now. Treat Anthropic’s “less capable than Mythos” framing as a signal about roadmap pacing: they have something stronger in reserve, they’re choosing not to ship it broadly, and that shapes whether Opus 4.7 is the endpoint you plan around or the middle step. And if your usage pattern could plausibly trigger identity verification, decide how you’ll handle that with your customer and your compliance team now. Surprise face-scan prompts land badly in product, and Anthropic’s push toward cloud-run agents like Routines means more of your workflows are going to hit identity-gated surfaces over time, not fewer.