Biomedical Tools & Diagnostics

A CT-Reading AI Trained on 6 Million Slices Learns Without Anyone Labeling Them

Researchers built Merlin, a 3D AI model that reads whole abdominal CT scans by learning from radiology reports and health records instead of hand-drawn labels. It was tested on more than 44,000 scans from outside hospitals.

Abel Chen
·
March 4, 2026
·
4 min
Article hero

An abdominal CT scan is not one picture. It is a stack of hundreds of cross-sectional slices, a full 3D volume of a person's insides, and reading it well takes years of training. There are not enough radiologists to keep up with how many scans hospitals now order. That gap is the problem a team spanning Stanford and several medical centers set out to attack, and their answer is a model they call Merlin.

Merlin is a vision-language model, meaning it learns to connect images with words. What sets it apart from most medical AI is that it works on the whole 3D scan rather than a single flattened slice, and it learns from the messy text clinicians already produce. No one had to sit down and manually outline tumors or organs to train it. The model read volumetric CT scans alongside the matching radiology reports and electronic health record data, and it figured out the associations on its own.

Learning from paperwork nobody wanted to relabel

The training set was large. Merlin ingested more than 6 million images drawn from 15,331 CT scans, paired with over 1.8 million diagnosis codes and radiology reports totaling more than 6 million tokens of text. The point of this pairing is that a report already describes what a radiologist saw. If the model can line up the pixels of a scan with the sentence that describes them, it learns clinically meaningful patterns without a separate, expensive annotation project.

That design choice matters beyond convenience. Hand-labeling medical images is slow and it is one of the main bottlenecks holding back this kind of tool. By leaning on reports and codes that hospitals generate anyway, the multistage pretraining approach sidesteps the annotation step entirely.

Tested on tens of thousands of outside scans

The authors did not just check whether Merlin could recite its training data. They ran it across 6 task types and 752 individual tasks. Some were what they call off-the-shelf, where the model gets no extra tuning: spotting 30 different findings it had never been explicitly told to look for, sorting scans into 692 phenotype categories, and matching images to the right report text. Other tasks required adapting the model, including predicting six chronic diseases up to five years out, writing draft radiology reports, and segmenting 20 organs in 3D.

Generalization is where medical AI usually stumbles. A model that shines on the hospital where it was born often falls apart somewhere else. Merlin was tested internally on 5,137 scans and then externally on 44,098 scans pulled from three independent sites and two public datasets. Across institutions and anatomies it held up, and it outperformed 2D vision-language models, other CT foundation models, and off-the-shelf radiology tools.

What the numbers do and do not settle

A benchmark win is not a clinical deployment. These evaluations measure performance against reference labels and existing reports, not what happens when a tired physician relies on the model at 2 a.m. The paper frames Merlin as a way to assist radiologists and lighten their load, not to replace the read, and that framing is worth taking seriously. Prognostic claims like five-year disease prediction also carry the usual caveat that a strong statistical signal in retrospective data can behave differently going forward.

One thing separates this work from a lot of AI announcements. The team released the trained models, the code, and a dataset of 25,494 paired abdominal CT scans and radiology reports. Open weights and open data let other groups probe where the model fails, which is the only honest way to find out whether a tool this broad earns a place in the reading room. The authors also worked out scaling laws for how performance grows with more data, a hint at how much further this approach can be pushed.

Whether Merlin ends up in clinics or mostly seeds the next round of research, it is a concrete example of a shift already underway: building general-purpose medical models from the records hospitals produce every day, rather than from datasets assembled by hand.

Sources
Sources content
Comments

Comments

Stay current on biology.

Weekly research updates, breakthrough summaries, and new articles — straight to your inbox. Free, always.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.