A benchmark from Microsoft Research that simulates long document-editing workflows across 52 professional domains, from coding to crystallography to music notation.

Which LLM performed best at document editing?

Gemini 3.1 Pro led with 80.9% reconstruction fidelity, followed by Claude 4.6 Opus at 73.1% and GPT 5.4 at 71.5%.

Does agentic tool use help prevent document corruption?

No. The study found that giving models file and code tools actually increased degradation by 3-6% compared to plain chat delegation.

What types of documents are most affected?

Music notation, weaving patterns, 3D objects, and earnings statements all showed catastrophic corruption. Python code was the only domain where most models scored above 98%.

Microsoft tested 19 LLMs as document editors. Even the best ones corrupted 25% of the content.

The DELEGATE-52 benchmark tests AI editing across 52 professional domains. Frontier models corrupt a quarter of document content over long workflows.

Microsoft Research asked 19 LLMs to edit professional documents across 52 domains, from Python code to music notation to financial earnings statements. The best model preserved 81% of the original content after 20 rounds of editing. The worst kept 10%. And the errors weren’t obvious: they looked like reasonable edits until you diffed them against the source.

A new paper from Philippe Laban, Tobias Schnabel, and Jennifer Neville introduces DELEGATE-52, a benchmark designed to answer a question that matters more every month: can you trust an AI to work on your files unsupervised? The study ran each model through 20 sequential editing tasks on real professional documents, from Python code to crystallography data to music notation. The results are sobering. Even the best frontier models silently corrupt about a quarter of the content they’re asked to work with. The problem compounds the longer you let them work, and giving models better tools makes it worse, not better.

The timing is pointed. “Vibe coding” has entered the mainstream, companies are racing to build agentic systems that let LLMs work unsupervised, and the assumption underlying all of it is that models can be trusted with long editing sessions. This paper puts a number on why that assumption doesn’t hold.

The benchmark: 52 domains, 20 rounds, no guardrails

DELEGATE-52 doesn’t test whether an LLM can write a good email. It tests whether an LLM can touch a complex document 20 times in a row without quietly wrecking it.

The domains range from Python code to crystallography to music notation. Each test environment starts with a seed document, applies a forward transformation (asking the model to edit something specific), then applies the inverse transformation (asking it to undo the edit). If the model faithfully executed both, you get back what you started with. If it didn’t, the difference is the corruption.

The benchmark includes 310 work environments across those 52 domains, with documents averaging 3,000 to 5,000 tokens plus 8,000 to 12,000 tokens of distractor context (other files in the same “project” that the model shouldn’t touch but can see).

That distractor context matters. It’s how real delegation works: you hand over a folder, not a single file. And models do worse when there’s more noise around the target document.

How the models scored

The paper tested 19 models. Reconstruction scores after 20 interactions tell the story:

Gemini 3.1 Pro: 80.9% (best in class)
Claude 4.6 Opus: 73.1%
GPT 5.4: 71.5%
GPT 5.2: 66.1%
Claude 4.6 Sonnet: 66.0%
Gemini 3 Flash: 35.8%
GPT 4o: 14.7%
GPT 5 Nano: 10.0%

That 25% average corruption rate for frontier models means that after 20 back-and-forth edits, one in four elements in your document has been silently changed. The weaker models aren’t just worse; they’re catastrophic. GPT 5 Nano preserved barely 10% of the original content.

The corruption isn’t random either. Weaker models tend to delete content outright. Frontier models are sneakier: they introduce modifications, subtle rewrites, and alterations that look plausible but don’t match the original. If you’re skimming the output rather than diffing it line by line, you won’t catch it.

What makes corruption worse

Three factors compound the damage:

Document size. Every additional 1,000 tokens increases degradation by roughly 3.6% after 20 interactions. Bigger files give models more surface area to introduce errors, and the errors accumulate.

Interaction length. Degradation doesn’t plateau. The paper tracked it out to 100 interactions and the curve kept climbing. There’s no point where the model “settles in” and stops corrupting. It just keeps going.

Distractor files. Irrelevant documents in the same context window worsened corruption by 2% to 8% over long interactions. Models occasionally leak content between files or get confused about which document they’re editing.

The most counterintuitive finding: agentic tool use made things worse. Giving models access to file-manipulation tools and code interpreters increased degradation by 3% to 6% compared to plain-chat delegation. The tools gave models more ways to modify documents, and they used those ways to introduce more errors.

Where it breaks down completely

Not all domains are equal. Python code was the one bright spot, with most models scoring above 98% on reconstruction. That makes sense: code has syntax checkers, test suites, and clear right-or-wrong behavior that constrains the model’s output.

Everything else was rougher. Music notation, weaving patterns, 3D object descriptions, and financial earnings statements all showed catastrophic corruption rates (below 55% reconstruction). These are domains where the format is precise but the feedback loop is weak. There’s no compiler to tell the model it just changed a C-sharp to a D-flat.

The domains that suffered most share a pattern: they have structured formats that look like natural language but aren’t. A weaving draft notation specifies exact thread counts and sequences. An earnings statement has specific numbers in specific positions. Models treat these as prose and “improve” them. That subtle rewriting is exactly the corruption mode that frontier models default to: they don’t delete, they edit in ways that look reasonable but aren’t faithful.

The paper’s appendix breaks down the failure modes in detail. Weaker models simply drop content. Gemini 3 Flash, for example, would progressively shorten documents over multiple rounds until barely a third remained. Frontier models do something more insidious: they rephrase, reformat, and adjust values in ways that pass a casual review but fail a semantic diff.

What the Hacker News crowd already knew

The Hacker News discussion hit 400 points and mostly echoed what practitioners have been noticing anecdotally: models drift over long conversations, introduce phantom changes, and rewrite things they weren’t asked to touch. DELEGATE-52 puts a number on what was previously a vague frustration.

It also explains why code-focused AI workflows feel so much more reliable than document-focused ones. Python’s tight feedback loop catches drift immediately. A corrupted recipe, legal brief, or musical score has no such safety net. The paper confirms this empirically: Python is the sole domain where most models achieve “ready” status (above 98% reconstruction), because syntax errors and test failures create immediate corrective pressure.

The broader pattern is clear. Domains with strong verification loops (code, structured data with schemas) survive delegation. Domains without them (prose, notation, visual formats) get quietly degraded. The models aren’t malicious. They’re optimistic editors who think they’re helping when they’re not.

What this means for you

If you’re delegating document editing to an LLM, treat every output like an unreviewed pull request. Diff it against the original. Don’t assume that because the model said “done” it actually preserved what you cared about.

The practical implication for teams building agentic systems: build domain-specific validation into the loop. The reason code delegation works isn’t that models are better at code. It’s that code has parsers, tests, and type checkers that catch drift before it compounds. Any domain you want reliable delegation in needs the same kind of automated verification. A legal team delegating contract edits needs a semantic diff tool. A finance team delegating spreadsheet updates needs cell-level assertion checks. Without those, you’re trusting a system the paper has shown you shouldn’t.

The paper’s authors put it plainly: current LLMs are unreliable delegates. That doesn’t mean delegation is dead. It means the guardrails haven’t been built yet. For now, the human in the loop isn’t optional, and neither is the diff.

Microsoft tested 19 LLMs as document editors. Even the best ones corrupted 25% of the content.

The benchmark: 52 domains, 20 rounds, no guardrails

How the models scored

What makes corruption worse

Where it breaks down completely

What the Hacker News crowd already knew

What this means for you

Share this article

Sources

Frequently Asked

Mentioned in this article