A Stanford-led team built MedHELM, a benchmark that tests medical AI on 121 real clinical tasks instead of licensing-exam trivia. The results reshuffle which models look best once cost is counted.

Large language models pass the US medical licensing exam with scores that would make most medical students weep. That number gets quoted a lot. It also means almost nothing about whether the same model can write a discharge summary a nurse would trust, or draft patient instructions a worried parent can actually follow.
A large team led by researchers at Stanford set out to close that gap. In Nature Medicine, they introduce MedHELM, a framework built to grade medical AI on the work clinicians actually do rather than on multiple-choice questions with tidy right answers. The starting point was not a dataset. It was a taxonomy, put together with clinicians, describing what medicine looks like when you break it into pieces.
That taxonomy sorts medical AI applications into five broad categories: clinical decision support, clinical note generation, patient communication, medical research, and administration. Underneath sit 22 subcategories and 121 specific tasks. Diagnostic decisions and treatment planning fall under decision support. Visit documentation and procedure reports sit under note generation. Scheduling and workflow coordination, the unglamorous machinery of a clinic, get their own place too.
To turn that map into a test, the authors assembled a benchmark suite of 37 evaluations spanning every subcategory. The design choice worth noticing is that these tasks reflect daily practice, not the rare zebra cases that make good exam questions. Writing an education handout is boring. It is also most of what a working clinical tool would be asked to do.
Grading open-ended clinical writing is hard, because there is rarely one correct output. The team leaned on an automated approach they call an LLM-jury: multiple AI evaluators scoring each response against criteria that human experts defined in advance. That lets the benchmark scale across dozens of tasks without a physician reading every answer, though it also means the graders are the same kind of system being graded, which is a tension the authors are open about.
They then ran nine frontier models through the gauntlet. The lineup included Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3, and o3-mini. The reasoning-heavy models came out ahead. DeepSeek R1 and o3-mini posted win rates of 66 percent, the strongest in the group.
The more interesting result is what happened once cost entered the picture. Claude 3.5 Sonnet matched those top scores while running at roughly 15 percent lower computational cost. On a leaderboard that ranks only by accuracy, that model looks like a runner-up. In a hospital deciding what it can afford to run at scale, it might be the obvious pick. MedHELM is built to surface exactly that kind of trade-off, so a health system can choose a tool on evidence instead of on a licensing-exam headline.
A few limits are worth stating plainly. The AI-jury method is efficient but circular by nature, and the paper does not claim it substitutes for careful human review on high-stakes outputs. A high win rate on a documentation task is not a green light for a model to make diagnostic calls unsupervised. And model rankings from early 2026 will age fast, since the field ships new systems on a schedule that no benchmark keeps up with. The framework matters more than any single scoreboard it produced.
What MedHELM offers is a shared, extensible yardstick. New tasks can be added. New models can be dropped in. That turns a vague question, is this medical AI any good, into a set of specific ones tied to specific jobs. For anyone weighing whether to put one of these systems in front of a patient or a physician, that shift from exam scores to task evidence is the point.
Weekly research updates, breakthrough summaries, and new articles — straight to your inbox. Free, always.
Comments