Biomedical Tools & Diagnostics

An AI That Shows Its Work Tackles the Rare-Disease Diagnostic Odyssey

A multi-agent AI system called DeepRare ranks likely rare-disease diagnoses and links each guess to verifiable medical evidence. Across 2,919 diseases it topped existing tools, and specialists signed off on more than 95 percent of its reasoning.

Abel Chen
·
March 12, 2026
·
4 min
Article hero

A child with an undiagnosed genetic condition can spend more than five years bouncing between specialists. Blood draws, referrals, scans, second opinions. Families call it the diagnostic odyssey, and for the roughly 300 million people worldwide living with a rare disease it is often the hardest part. A team based mostly at Shanghai Jiao Tong University wants to shorten that journey with software that behaves less like a search engine and more like a careful clinician who explains every step.

Their system, described in Nature and named DeepRare, is built around large language models but does not rely on a single model guessing an answer. Instead it coordinates more than 40 specialized tools and current medical knowledge sources, then produces a ranked list of candidate diagnoses. Each hypothesis comes attached to a chain of reasoning that points back to specific, checkable medical evidence. That last part matters. A diagnosis you cannot audit is hard to trust, and clinicians have good reason to be wary of a black box.

Feeding it what doctors actually have

Real patient records are messy. DeepRare was designed to swallow that mess. It accepts free-text descriptions written in plain clinical language, structured Human Phenotype Ontology terms (a standardized vocabulary for symptoms), and results from genetic testing. Combining those inputs is closer to how a physician works than most diagnostic tools, which tend to demand tidy, pre-formatted data.

The evaluation was unusually broad. The authors tested the system on nine datasets drawn from published literature, case reports, and clinical centers across Asia, North America, and Europe, covering 14 medical specialties and 2,919 diseases. On tasks that used phenotype terms, DeepRare correctly placed the right diagnosis at the top of its list 57.18 percent of the time. That figure, called Recall@1, beat the next best method by 23.79 percentage points. When genetic and clinical data were combined in multi-modal tests, it reached 69.1 percent on 168 cases, compared with 55.9 percent for Exomiser, a widely used gene-prioritization tool.

Reasoning that experts could check

Numbers on a benchmark only go so far. The more interesting test was whether human specialists agreed with how the system got to its answers. When experts reviewed DeepRare's reasoning chains, they agreed with them 95.4 percent of the time. That high agreement is the point of the whole design. Because every conclusion links to verifiable evidence, a doctor can follow the logic rather than take it on faith, and can spot where the machine went wrong when it does.

This is a shift in how AI diagnostic tools are pitched. For years the selling point was raw accuracy. Here the selling point is traceability, the idea that a system worth using in a clinic has to justify itself in terms a specialist can evaluate and, if needed, overrule.

What the study does not settle

DeepRare is described by its authors as a decision-support system, not a replacement for a physician, and the distinction is worth holding onto. Recall@1 near 57 percent means the top guess is still wrong more often than it is right on the phenotype-only task, though the correct answer frequently appears lower in the ranked list. The evaluation ran on curated datasets and case reports, which are cleaner and better documented than a chaotic first visit to an emergency department. Expert agreement on reasoning is reassuring, but agreement is not the same as an independent trial showing that patients actually get diagnosed faster. None of that is in this paper. The work shows the approach is promising and measurable, not that it is ready to run unsupervised.

Still, the framing is a useful correction. A tool that reaches a defensible answer and can walk a clinician through why is a different kind of object than one that simply spits out a label. For families years into an odyssey with no name for what is wrong, cutting even a fraction of that time would be worth a great deal. Whether DeepRare delivers that in practice is the next thing to find out.

Sources
Sources content
Comments

Comments

Stay current on biology.

Weekly research updates, breakthrough summaries, and new articles — straight to your inbox. Free, always.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.