Is SWE-bench Verified completely useless now?

Not useless, but no longer differentiating. Frontier models cluster near 80% and most remaining problems are unsolvable. It's a saturated benchmark, the same way GLUE was for NLP in 2019.

What is SWE-bench Pro?

A successor benchmark from Scale AI with 1,865 tasks across 41 professional repositories, including GPL-licensed and proprietary code that's harder to memorize. Top models currently score around 23%, leaving real headroom.

Should I trust any coding leaderboard right now?

Treat any score from a saturated benchmark as marketing. Look for benchmarks where the top score is below 60% and where problems use code your model couldn't have memorized. SWE-bench Pro, GDPVal, and held-out repository sets are the closer-to-real options.

Will Anthropic and Google stop reporting SWE-bench Verified too?

Probably yes. Anthropic already publishes SWE-bench Pro numbers alongside Verified. Google's recent Gemini coding releases lean on internal evals plus Pro. Expect Verified to disappear from launch decks within two model cycles.

OpenAI just retired SWE-bench Verified. The headline coding benchmark of 2025 is officially saturated.

OpenAI says SWE-bench Verified is saturated and contaminated, and 60% of remaining problems are unsolvable. Here's what comes next, and why every coding leaderboard is suspect.

OpenAI just told the industry to stop using its favorite coding benchmark. In a post by its Frontier Evals team, the lab said it’ll no longer report SWE-bench Verified scores in model releases, because the benchmark is saturated, contaminated, and full of broken tests.

That’s a big deal. SWE-bench Verified has been the coding metric for two years. Every Anthropic, Google, and OpenAI launch slide ran a Verified bar chart, and most of the “best coding model” headlines you’ve read since 2024 were anchored to it. If you’ve ever picked Claude Opus or GPT-5 over the other based on a Verified score, you were almost certainly looking at a number that doesn’t mean what you thought it meant.

What “saturated and contaminated” actually means

Saturation is the easier half. When a benchmark first launches, the gap between models is meaningful: a 30% jump tells you something. Once frontier models cluster at the top, a 1-2 point delta is noise. SWE-bench Verified has been there for months. Per OpenAI, “most frontier models now consistently score around 80%,” and the curve has flattened.

Contamination is the harder half. The contamination problem on Verified is two layers deep.

The first layer is straightforward training-data overlap. Verified pulls real GitHub issues from popular Python repos like Django and SymPy. Those issues, and their fixed-up patches, were on the public web before the benchmark existed. Frontier models trained on a 2024 web crawl have functionally memorized them. OpenAI’s audit found that “all frontier models” demonstrated familiarity with Verified content, and some leaked unspecified test requirements directly into their reasoning chains. The model wasn’t reasoning about the problem; it was recalling the answer.

The second layer is more damning: a chunk of Verified is just broken. Of 138 problematic test cases OpenAI dug into, 49 tests were too narrowly defined, rejecting functionally correct fixes because they didn’t match the original commit’s exact implementation. 26 tests required features never mentioned in the problem description, which means a model could only pass them by guessing or by remembering the original commit. OpenAI estimates over 60% of the remaining problems are effectively unsolvable as written. That isn’t a benchmark; it’s noise dressed up as a leaderboard.

Why this happened, and why now

SWE-bench Verified was OpenAI’s own attempt to clean up the original SWE-bench from Princeton. They published Verified in August 2024 specifically to fix the obvious mis-graded problems in the parent set. It worked, briefly. By late 2025, Mia Glaese and Olivia Watkins on the Latent Space podcast describe a team that already saw the wall coming: scores were ticking up by halves of a point, contamination signals were growing, and post-training teams were pushing for a harder target.

The deprecation is overdue. Most of the leaderboard hopping in the past six months has been unfalsifiable, in the sense that you couldn’t tell whether a model “improved” or just memorized one more issue. When a Verified score moves from 73.4 to 73.9, the change tells you nothing about coding ability. It tells you a lab fine-tuned on a different sample of public Python.

That’s why our Anthropic Claude Code quality postmortem didn’t lean on Verified: the regression was real, but Verified couldn’t see it. And it’s why OpenAI’s GPT-5.5 launch leaned on Pro and internal evals instead of leading with the Verified score.

Enter SWE-bench Pro

The recommended replacement is SWE-bench Pro from Scale AI. It’s a different shape of benchmark: 1,865 tasks across 41 professional repositories, of which 731 are public, 276 come from proprietary codebases, and 858 are held out for evaluation runs. The proprietary and copyleft (GPL) chunks are deliberately chosen because they’re harder for a frontier model to have ingested.

The numbers tell the story. On SWE-bench Verified, frontier models cluster around 80%. On SWE-bench Pro, the current leader is GPT-5.4 (xHigh) at 59.10%, followed by Muse Spark at 55.00% and Claude Opus 4.6 (thinking) at 51.90%. Tasks require, on average, 107 lines of code edited across 4.1 files, and the language mix tilts toward Go and Python with JavaScript noticeably harder.

Two things to notice about Pro. First, there’s still differentiation between models. A 7-point gap between #1 and #3 means a developer choosing tools is reading a real signal, not noise. Second, performance varies by language and repository, which means a “best coding model” claim is now contingent: best at Python in this codebase, third-best at TypeScript in another. That’s closer to reality than a single Verified percentage ever was.

What OpenAI is actually building toward

Beyond Pro, OpenAI is investing in GDPVal, a benchmark built around “tasks privately authored by domain experts and graded holistically by trained reviewers.” That’s expensive to run and expensive to maintain, which is the trade-off you accept when you give up a static, pre-generated test set. The upside is contamination resistance: if a model has never seen the task, it can’t have memorized the answer.

Glaese and Watkins on Latent Space telegraph the broader direction: open-ended design tasks, code quality and maintainability evals, multi-hour autonomy challenges. Less “fix this Django ticket,” more “build this feature, then justify your design.” The cost goes up. The signal goes up too.

The leaderboard hopping problem isn’t fixed

Worth flagging the obvious follow-up: SWE-bench Pro will eventually saturate too, probably faster than Verified did. The current 23% top score gives the field maybe 18 months of room before the leaders cluster again, and contamination pressure is rising. Scale’s choice to keep 858 of the 1,865 tasks held out is the safety valve, but every public benchmark tilts toward leakage over time. The held-out set has to be rotated, and rotation is a maintenance burden Scale will eventually have to fund or hand off.

There’s also the harder structural problem: the labs evaluate against the same benchmarks they’re optimizing toward. Even Pro’s “private” half can leak through aggregated stats, error analysis blog posts, or partial sample releases. The only durable answer is what GDPVal is gesturing at: human-graded, freshly-authored tasks, run sparsely so the cost stays manageable. Expect the next benchmark debate to be about whether human grading scales, not about whether the test set is contaminated.

What this means for you

If you’re a developer comparing models for daily work, stop reading SWE-bench Verified scores. They’re a 2024 artifact. Look at SWE-bench Pro’s leaderboard for a directional read, and weight your own internal evals heavier than any public number. If a model is 5 points better on Pro at the language and codebase you work in, that’s worth something. A two-point Verified delta is worth nothing.

If you’re a buyer or a CTO, ask vendors specifically how they evaluate. “We score X on SWE-bench Verified” is now a yellow flag, not a green one. The labs that have moved past Verified, and are publishing Pro, GDPVal, or held-out internal numbers, are the ones taking evaluation seriously. The labs still leading with Verified scores are either behind on infrastructure or hoping you won’t notice.

The deeper lesson is structural. Static benchmarks have a shelf life of about 18 months at the current pace of model improvement. The same fate awaits Pro, eventually, and probably faster than its predecessors did. Plan for the next deprecation now: invest in your own held-out test set, on your own code, that no public model has seen. That’s the only benchmark that doesn’t expire.

OpenAI just retired SWE-bench Verified. The headline coding benchmark of 2025 is officially saturated.

What “saturated and contaminated” actually means

Why this happened, and why now

Enter SWE-bench Pro

What OpenAI is actually building toward

The leaderboard hopping problem isn’t fixed

What this means for you

Share this article

Sources

Frequently Asked

Mentioned in this article