BioDaily

Ask a biologist what a mystery protein does, and the honest answer is often a shrug. Sequencing has raced ahead of understanding. Databases now hold hundreds of millions of protein sequences, and a large fraction of them have no confidently assigned function. The usual move is to search for a lookalike: find a known protein with a similar sequence or a similar fold, then assume the unknown one behaves the same way. That works when a close relative exists. It fails when it does not.

A tool called ProTrek, described in Nature Biotechnology on October 2, 2025, takes a different route. Instead of matching proteins only to other proteins, it learns to line up three separate descriptions of the same molecule at once: its amino acid sequence, its three-dimensional structure, and a plain-English sentence about what it does. Once those three views live in a shared mathematical space, you can search from any one to any other. Type a sentence, get back candidate proteins. Feed in a structure, get back a written guess at its job.

Teaching a model that shape and meaning belong together

The team, led by Jin Su and colleagues at Westlake University, built ProTrek using contrastive learning. That is the same broad idea behind image models that pair photos with captions. Show the system millions of examples where a sequence, a structure, and a functional description all belong together, and it gradually learns to place matching trios near each other while pushing mismatches apart. Sequence, shape, and meaning stop being separate files and become points you can measure distances between.

The point of all this is speed and reach. The authors report that ProTrek outperforms established alignment tools, including Foldseek and MMseqs2, on both speed and accuracy when the task is finding functionally related proteins. Those older tools are workhorses, but they lean heavily on sequence or structural similarity. ProTrek can connect proteins that share a function without sharing an obvious family resemblance, which is exactly the case where traditional search runs out of road.

Five billion proteins, already fingerprinted

The scale is the other headline. The public ProTrek server holds precomputed embeddings for more than five billion proteins. Embeddings are the compact numerical fingerprints the model assigns to each entry. Because the heavy computation is already done, a query does not have to grind through raw data every time. The authors say the server can process and analyze these large repositories efficiently, and they backed the approach with both computational tests and wet-lab experiments rather than benchmarks alone.

Think about what a natural-language search unlocks. A researcher can describe the activity they are hunting for, an enzyme that cleaves a particular bond, say, and pull candidates straight out of billions of sequences, including ones no human has ever annotated. The description becomes the query. The protein is the answer.

Where the shortcuts run out

There are limits worth stating plainly. A model that learns from existing annotations inherits their blind spots. Whole swaths of protein space remain poorly described, and a tool trained on what we already know will be least reliable exactly where biology is strangest and least documented. Matches are ranked suggestions, not verdicts. A high-scoring hit still points to a hypothesis that has to be checked at the bench, which is why the paper pairs its computational claims with experimental validation. Speed and coverage do not remove the burden of proof.

Still, the framing feels useful. Function has always been the hard part of protein science, harder than reading a sequence or even solving a structure. By treating a sentence about what a protein does as a first-class search key, sitting alongside its sequence and its fold, ProTrek turns a question biologists usually answer with a shrug into one they can at least aim a query at. Whether the top hits pan out is a separate story, and it is the story that happens in the lab.

Sources

Sources content

A Protein Search Engine That Understands Plain English

Teaching a model that shape and meaning belong together

Five billion proteins, already fingerprinted

Where the shortcuts run out

Comments

A Protein Search Engine That Understands Plain English

Teaching a model that shape and meaning belong together

Five billion proteins, already fingerprinted

Where the shortcuts run out

Comments

Stay current on biology.