devtake.dev

Stanford tested AI against law professors. The pros picked the AI 75% of the time.

A blinded Stanford Law study had 16 professors grade AI tutoring answers against their own. Here's what the 75% win rate actually measures, and what it doesn't.

Dieter Morelli · · 8 min read · 3 sources
The Stanford Law School building on Stanford University's campus
King of Hearts / CC BY-SA 3.0 via Wikimedia Commons · Source

A blinded Stanford Law study just handed AI a 75% win rate against law professors. Not in court, and not on the bar exam. In a narrow, careful test of one thing: who writes a better short answer to a contract-law question.

That distinction is the whole story. The headline number is real and the methodology is unusually clean for an “AI beats experts” claim. But what got measured is far smaller than “AI beats lawyers,” and the authors are the first to say so. If you’ve spent the last year watching AI-in-law stories swing between hype and panic, this one is worth reading slowly, because the careful version is more interesting than the headline.

What the study actually measured

Sixteen law professors from U.S. law schools each wrote a set of representative contract-law questions, the kind a student fires off after class or in office hours. The pooled set came to 40 questions across four buckets: recalling a case or code section, recalling doctrine, working through hypotheticals, and policy reasoning. Each professor wrote their own answers. Then two AI systems answered the same questions.

The clever part is the grading. The professors judged 2,918 anonymized pairs, each pitting one response against another, without knowing which came from a human and which from a machine, per the working paper on SSRN. That blinding matters. Plenty of “AI scored well” results fall apart because a human knew they were grading AI and either went soft or went hunting for flaws. Here, nobody knew.

Across those pairs, professors preferred the AI response about 75% of the time. Google’s Gemini 2.5 Pro posted an average win rate near 76% and beat every human instructor except one. NotebookLM, loaded with the course casebook as retrieval context, did even better, beating every human with a single tie, Eugene Volokh noted in his write-up of the paper. Two models. One subject. One question format. That’s the box this result lives in.

The safety numbers are the part most people skip, and they’re the most reassuring. Professors could flag any answer as pedagogically harmful, meaning it would actively mislead a student. The AI got flagged roughly 3.4% of the time for Gemini and 3.6% for NotebookLM. The humans? They ranged from 1% all the way up to nearly 40%. So the machines didn’t just win on preference. They clustered near the safest professors, not the average one.

Why “preferred” isn’t “better at law”

Here’s where the careful reading earns its keep. A 75% preference rate sounds like a verdict. It’s closer to a taste test.

Contract law isn’t arithmetic. As co-author Sarath Sanga put it, “In most fields where AI gets tested, there’s a right answer. In law, there often isn’t.” So the professors weren’t checking answers against a key. They were judging which response they’d rather a student receive, an all-in call that folds clarity, structure, completeness, and tone in with legal accuracy. AI writes clean, well-organized, confident prose. That’s exactly the surface humans reward in a blind read, and it’s exactly the surface that can hide a subtle error.

The question split matters here too. The 40 questions weren’t all soft. Some asked students to recall a specific case or code section, where there genuinely is a checkable answer. Others were open hypotheticals and policy prompts, where reasonable lawyers disagree. The AI didn’t only win on the squishy ones. As co-author Julian Nyarko put it, “These weren’t just simple questions with obvious answers.” That breadth is what makes the result hard to wave away as a fluke of easy recall items, and it’s also what makes the next question, whether “preferred” tracks “correct,” the one that actually matters.

The authors caught this themselves. They note that engineered textual features, things like length and formatting, explain only part of the AI’s advantage. Translation: the models aren’t just winning on polish, but polish is in the mix and they can’t fully separate it out. When the yardstick is “which answer do I prefer,” a model that’s a strong writer and a decent lawyer can beat a strong lawyer who’s a mediocre writer. That tells you something useful about tutoring. It tells you much less about who’d you want drafting an actual merger agreement.

There’s also a real risk the study doesn’t resolve, the same one that keeps showing up across AI-in-legal stories: confident fluency masking failure. We’ve covered how LLMs quietly corrupt documents when they’re handed delegation tasks, and the legal world has already seen the consequences when AI confidence outruns AI accuracy, from sanctioned attorneys citing fake cases to the Pennsylvania lawsuit over a Character.AI bot posing as a licensed professional. A 3.4% harmful-flag rate is genuinely low. It isn’t zero, and “rarely wrong, but wrong with total confidence” is a hard failure mode to catch in a fast-moving classroom.

What it says about how we test AI

Step back from law for a second, because the more durable lesson here is about evaluation itself.

Most AI benchmarks have an answer key. SWE-bench checks whether the code passes tests. Math benchmarks check the final number. That’s why they’re easy to game and easy to saturate, which is roughly why OpenAI retired SWE-bench Verified once models started topping it. This study does something harder: it measures performance in a judgment-rich field, where the “right” answer is contested and the evaluator is a domain expert weighing the whole response, not checking a box.

That’s a more realistic test of where AI actually gets deployed. Tutoring, drafting, advising, summarizing: most knowledge work doesn’t come with a test suite. Lead author Alejandro Salinas framed the contribution as shifting attention to “what AI tutoring can contribute to learning in judgment-rich fields like law.” The flip side, and the open question, is that judgment-rich evaluation inherits the judge’s biases. If professors reward fluent structure, you’re partly measuring fluency. Building evals that separate “sounds right” from “is right” in fields without answer keys is one of the genuinely unsolved problems in AI measurement, and this paper is a sharp illustration of why.

One more caveat worth stating plainly: this is a working paper, posted to SSRN in May 2026 and not yet peer-reviewed. The figures are the authors’ own. They run as numbers to engage with, not settled findings, and the next version of the paper may move them.

What this means for you

If you’re a law student, the practical read is boring and correct: a good model is now a credible study aid for short-doctrine questions, and it’ll often explain a hypothetical more clearly than a rushed office-hours reply. Use it like one. Check its citations, because the 3.4% it gets wrong is the 3.4% that’ll sink you on an exam, and it’ll be wrong with a straight face. If you’re building or buying legal AI, don’t quote this 75% to a client. It measures tutoring quality on contract questions, not practice readiness, and the authors drew that line on purpose. And if your job is evaluating AI at all, the takeaway is the one to keep: blind, expert-judged comparison in a field with no answer key is a better mirror of real deployment than another leaderboard, and it’s also harder to trust. Julian Nyarko’s own summary is the right posture to leave with: the result challenges blanket dismissal of AI in law, but “blanket skepticism may be equally unwarranted” cuts both ways.

Share this article

Sources

Frequently Asked

Which AI models did the study test?
Two: a stock version of Google's Gemini 2.5 Pro, and NotebookLM with the course casebook loaded as retrieval context. NotebookLM edged out every human instructor; Gemini lost to one.
Does this mean AI can replace lawyers?
No. The study measured the quality of short tutoring answers to contract-law questions, graded by professors. It says nothing about practicing law, court outcomes, or client representation.
Were the professors grading their own questions?
The 16 professors wrote the 40 questions and the human answers, then blind-graded anonymized pairs without knowing which response was AI and which was a peer's. That blinding is the study's strongest feature.
How often were AI answers flagged as harmful?
About 3.4% for Gemini and 3.6% for NotebookLM. Human professors ranged from 1% to nearly 40%, so the AI sat near the safest instructors, not the average one.
Is the paper peer-reviewed?
Not yet. It was posted to SSRN in May 2026 as a working paper led by Stanford's Julian Nyarko, with co-authors across Yale, NYU, Chicago and other schools. Treat the figures as the authors' own, pre-review.

Mentioned in this article