devtake.dev
AI

DeepSeek V4 lands: 1.6T-param open MoE, 1M-token context, and SWE-bench within 0.2 of Opus 4.6

DeepSeek shipped V4-Pro and V4-Flash under MIT on April 24. V4-Pro hits 80.6% on SWE-bench Verified. V4-Flash is $0.14 in / $0.28 out.

Dieter Morelli · · 5 min read · 4 sources
DeepSeek social card from the V4 API documentation release post.
Image: DeepSeek · Source

DeepSeek dropped two preview models on Hugging Face Friday under the MIT License: DeepSeek-V4-Pro (1.6 trillion parameters, 49B activated) and DeepSeek-V4-Flash (284B total, 13B activated). Both ship with a 1-million-token context window and, critically, open weights. V4-Pro’s download is 865 GB. Flash is a more civilian 160 GB.

The numbers that matter

  • SWE-bench Verified. V4-Pro reports 80.6%, within 0.2 points of Claude Opus 4.6, per DeepSeek’s own technical report on Hugging Face.
  • Math. V4-Flash-Max hits 81.0 on Putnam-200, a perfect 120/120 on Putnam-2025, and 95.2 on HMMT 2026. Coding + math is where DeepSeek has always over-indexed.
  • Pricing (per million tokens). V4-Flash: $0.14 input / $0.28 output. V4-Pro: $1.74 / $3.48. By comparison, Claude Sonnet 4.6 runs $3 / $15 and GPT-5.5 just launched at $5 / $30.
  • Compute efficiency. Per DeepSeek’s paper, V4-Pro uses 27% of the single-token FLOPs and 10% of the KV cache size of V3.2. That’s what makes the 1M context window viable on non-frontier hardware.

Simon Willison’s first-pass review calls it “almost on the frontier, a fraction of the price” and puts V4 “approximately 3 to 6 months” behind GPT-5.4 and Gemini-3.1-Pro on subjective tasks. His SVG-generation test found Flash solid (clean bicycle) and Pro oddly worse (malformed pelican), which is a reminder that benchmarks and vibe tests don’t always agree.

What’s new architecturally

DeepSeek calls the attention stack DSA (DeepSeek Sparse Attention) plus token-wise compression. It’s the continuation of the efficiency trend DeepSeek kicked off with V2’s MLA and V3’s native-sparse attention. The practical outcome: 1M-token context that actually serves, not just benchmarks.

The API exposes both a thinking and non-thinking mode inside a single endpoint, and the service speaks both the OpenAI ChatCompletions and Anthropic Messages protocols natively. That second part matters more than it looks. Drop-in compatibility with the Anthropic API means Claude Code, OpenCode, and other agent frameworks can target DeepSeek with a base-URL swap. DeepSeek’s release notes call this out explicitly.

The open-weights angle

MIT License. Full weights on Hugging Face, ModelScope mirrored. No access gating, no RAIL-style use restrictions, no paid-tier API lock-in. That puts V4 in the same license tier as Qwen 3.6 (which came in on an open MoE last week) and ahead of Llama, which still carries a monthly-active-user carveout.

For teams that care about running frontier-adjacent models on private infrastructure, V4-Flash is the more interesting of the two. 160 GB is one-file-per-H100 territory and Willison mentioned fitting a quantized build on an M5 MacBook Pro. V4-Pro’s 865 GB is a data-center ask; you’re renting at least a four-H100 node to serve it.

Who’s actually affected

  • Frontier-lab pricing. Anthropic and OpenAI just doubled API prices in the last quarter (Opus 4.7 held the line, GPT-5.5 doubled). V4-Pro at $1.74 / $3.48 is roughly an order of magnitude cheaper than Opus and half the cost of Sonnet. Every enterprise with a not-absolutely-frontier workload now has a real negotiating lever.
  • Coding agents. SWE-bench Verified at 80.6 is what Claude Code runs on internally. Last week’s Claude Code quality postmortem highlighted how much of the product is the underlying model, not the CLI. A cheap Anthropic-compatible drop-in changes the economics of long-running agent loops dramatically.
  • The open-source-is-China story. DeepSeek, Qwen, GLM, and Yi now collectively carry the weight of “frontier-adjacent, truly open” in a way no US lab currently matches. That matters for export-control debates, Pentagon procurement rules, and the ongoing Anthropic-blacklist conversation (see NSA/DoD split).

What’s unclear

  • Whether “Preview” sticks for long. DeepSeek’s deprecation note says the existing deepseek-chat and deepseek-reasoner endpoints retire July 24. V4 is currently labeled “Preview.” The final release probably lands within weeks, not months.
  • Third-party reproducibility. DeepSeek’s own benchmark numbers are plausible given V3’s track record, but independent evals (Artificial Analysis, LMSYS, EQ-Bench) take days to weeks. Treat the 80.6% SWE-bench figure as vendor-reported until OpenRouter or Hugging Face’s leaderboard confirms it.
  • Inference cost outside the official API. The DeepSeek Platform price is heavily subsidized. Renting V4-Pro from Together, Fireworks, or Groq will land at a multiple of the $1.74 input rate. Factor that in before replacing Claude in production.

What this means for you

If you’re running Claude or GPT in a cost-sensitive coding loop, V4-Flash is worth a same-day eval. $0.14 input is in “leave it on by default” territory, and the Anthropic-API compatibility means swapping base URLs in a Claude Code config file is a ten-minute job. The downside risk is a model that’s subjectively worse at the last 10% of the task, and that’s exactly what Willison’s SVG test hinted at. Use V4 for volume, keep Claude for the hard calls, and run your own eval harness on the tasks that pay your bills.

If you’re picking a model for an on-prem deployment: V4-Flash is the new floor. The old answer of “Llama or Qwen” is now “Qwen or V4-Flash.” Llama 4 is no longer in the conversation at this price-performance point.

Sources

Mentioned in this article