#benchmarks

Alibaba's Qwen3.8-Max beats Fable 5 on Terminal-Bench, and the weights go public next week

Qwen3.8-Max is a 2.4-trillion-parameter MoE that tops Claude Fable 5 on Terminal-Bench 2.1 and trails it badly on SWE-bench Pro. It's the first open Max-tier Qwen.

$The GitHub social preview card for the openai/ten-proofs repository, described as Lean certificates accompanying proofs in mathematics and theoretical computer science, showing 386 stars and 35 forks.$

AI·54 minutes ago

Ten decade-old math problems fell to an unreleased OpenAI model, for about $2,000 of tokens each

OpenAI published ten results in math and theoretical CS from an internal build of Astra, with Lean 4 certificates for every proof. What that verification does and doesn't settle.

AI·2 days ago

DeepSeek's new 304B agentic model now runs on a single 128GB workstation

Salvatore Sanfilippo repacked DeepSeek V4 Flash into a lossless MXFP4 GGUF that streams from SSD at over 20 tokens a second. The hardware bill, and where hosted still wins.

Anthropic's Claude Opus 5 announcement artwork: a large numeral 5 formed from an arrangement of vintage speckled bird-egg illustrations on a cream background.

AI·last week

Claude Opus 5 nears Fable 5's frontier intelligence at half the price

Anthropic shipped Claude Opus 5 at the same $5/$25 per million tokens as Opus 4.8. It nears Fable 5's intelligence at half the cost, with new effort and fallback controls.

Two pelicans face each other with crossed beaks on dark water, mirrored in the surface, a nod to the informal 'pelican on a bicycle' LLM benchmark

AI·2 weeks ago

Kimi K3 trades blows with Anthropic's Fable, and Moonshot is opening the weights

Kimi K3, GLM 5.2 and DeepSeek V4 put open-weight AI next to the frontier this month. What each model is good at, and why the benchmarks mislead.

AI·4 weeks ago

GitHub ran four frontier models through Copilot's harness. None won every task.

GitHub benchmarked Copilot's agent harness against Claude Code and Codex CLI on five tests. The token savings are real, and the best model depends on the task.

Anthropic Claude Sonnet 5 announcement graphic

AI·last month

Claude Sonnet 5: cheaper agents on paper, until you count the new tokenizer's tokens

Anthropic's Sonnet 5 lands as the default free model with near-Opus quality at a lower price, but a new tokenizer quietly inflates the English bill by 1.4x.

Google Gemini chat interface shown on screen

AI·last month

Google reportedly delays Gemini 3.5 Pro to July to keep tuning the model

Google has pushed its frontier Gemini 3.5 Pro to July while Flash already ships, according to Business Insider. Here's what slipped and why it matters.

Benchmark comparison card for GLM-5.2 showing it as the leading open weights model

AI·last month

GLM-5.2 was trained on Huawei chips, not Nvidia. The open weights beat GPT-5.5 on coding.

Zhipu AI's GLM-5.2 is a free-to-download model trained without Nvidia silicon. Here's what the benchmarks claim and why developers should care.

AI·2 months ago

Claude Fable 5 is Anthropic's first public Mythos-class model. It tops SWE-Bench Pro at 80.3%.

Claude Fable 5 hits 80.3% on SWE-Bench Pro and ships on Bedrock and Copilot at $10/$50 per million tokens, free on paid plans only through June 22.

The Stanford Law School building on Stanford University's campus

AI·2 months ago

Stanford tested AI against law professors. The pros picked the AI 75% of the time.

A blinded Stanford Law study had 16 professors grade AI tutoring answers against their own. Here's what the 75% win rate actually measures, and what it doesn't.

AI·2 months ago

Claude Opus 4.8 flags the bugs it writes four times more often than Opus 4.7

Anthropic's Opus 4.8 posts 69.2% on SWE-Bench Pro, lets code flaws slip 4x less often, and ships parallel subagents in Claude Code. Here's what matters.

The DELEGATE-52 project repository on GitHub, showing Microsoft's benchmark for testing LLM document editing fidelity

AI·3 months ago

Microsoft tested 19 LLMs as document editors. Even the best ones corrupted 25% of the content.

The DELEGATE-52 benchmark tests AI editing across 52 professional domains. Frontier models corrupt a quarter of document content over long workflows.

A mathematics lecture hall with equations on blackboards

AI·3 months ago

Timothy Gowers gave GPT 5.5 an open math problem. It returned a novel proof in 17 minutes.

The 1998 Fields Medal winner reports GPT 5.5 Pro produced a novel proof for an unsolved math problem in 17 minutes, and says the era of owning theorems is ending.

AI·3 months ago

OpenAI just retired SWE-bench Verified. The headline coding benchmark of 2025 is officially saturated.

OpenAI says SWE-bench Verified is saturated and contaminated, and 60% of remaining problems are unsolvable. Here's what comes next, and why every coding leaderboard is suspect.

DeepSeek social card from the V4 API documentation release post.

AI·3 months ago

DeepSeek V4 lands: 1.6T-param open MoE, 1M-token context, and SWE-bench within 0.2 of Opus 4.6

DeepSeek shipped V4-Pro and V4-Flash under MIT on April 24. V4-Pro hits 80.6% on SWE-bench Verified. V4-Flash is $0.14 in / $0.28 out.

Claude Opus 4.7 launch artwork from the Anthropic news post

AI·4 months ago

Claude Opus 4.7 is here, and the long-context benchmarks got worse

Anthropic's Opus 4.7 is state-of-the-art on SWE-bench and CursorBench, but independent tests show regressions on long-context retrieval and thematic reasoning.