Machine Learning and the New Enzyme Thrift Shop
Machine learning changes the way we search for new biological functions, revealing that biology is bigger and weirder than we thought. Biological product developers are entering a new world where the best enzyme for a given product is likely to be unfamiliar or even completely uncharacterized.
Transcript
As a biologist, I learned to mentally organize the living world in terms of sequence similarity. There are good reasons why we do this. For starters, that's how evolution does it. Evolution, without which "nothing in biology makes sense" (Dobzhansky), is a process in which DNA sequences start off identical and gradually diverge over time. Because sequence and function move together, this creates a pattern where similar sequences code for similar functions - usually.
Sequence-based thinking is also built into many of the tools that we use to read and write biology. Before we could synthesize DNA, changes had to be introduced one edit at a time. So new functional sequences were never that different from their parent. You might end up with something that was a little better: faster kinetics, better specificity, but not radically new.
Or think about BLAST searching. When BLAST came out in the 1990s, it was such a powerful tool for searching DNA based on sequence similarity that there was rarely a need to use anything else. BLAST spoiled us. It made us soft. If you discovered an unknown DNA sequence and you wanted to know the function, you would simply BLAST it to find similar sequences. It was usually safe to assume that the hits would have similar functions. And if there were no BLAST hits? That function was probably going to be a mystery forever.
The upshot of these historical constraints is that a certain mental model took hold. We envision "enzyme space" as a cloud of similar sequences. Usually the center of this cloud is a single enzyme that happens to have functional data associated with it.
You can see this mental model at work in the many biological patents that claim a single amino acid sequence and everything that is 95% similar, or some other threshold. The assumption is that the function, the thing that matters, can be well approximated by sequence, the thing that can be easily edited and searched.
Anyway machine learning blows this up completely. The sequence-based mental model needs to go. At least when it comes to engineering biology with a function in mind. It is structure, not sequence, that determines function. Now that ML-generated structural models are always accessible, structure needs to be the primitive data type around which we organize our biological thinking.
Sequence is still important, of course, but secondary. Enzymes with similar sequences can perform different chemistry. Enzymes with different sequences can perform the same chemistry. Scientifically speaking, this is not a particularly hot take. Biologists have known for a while that "convergent evolution in enzyme active sites is not a rare phenomenon”1. But it is a case of we-saw-but-we-did-not-see. We all knew structure mattered, but without the tools to work on structure directly we fell into the habit of putting sequence first.
So what does it mean to organize your mental model of enzyme engineering around structure, rather than sequence? I'm struggling to wrap my mind around it, just like everybody else. We're going to need practice and a lot of new metaphors. Here's my attempt: enzyme engineering is like shopping.
The old way of engineering enzymes was like shopping for designer clothes. You start with a name brand. The store is well organized but offers only a few different styles and sizes. You come out looking like everyone else.
The new way is like thrift shopping. There are tons of options just laying out in piles - who knows what you might find? Every new DNA sequencing project is a new box of options to sift through. You should expect to be surprised and you should expect to come away with something that nobody else has.
P. F. Gherardini, M. N. Wass, M. Helmer-Citterich, M. J. E. Sternberg, Convergent Evolution of Enzyme Active Sites Is not a Rare Phenomenon. Journal of Molecular Biology. 372, 817–845 (2007). PMID: 17681532





For a more technical take on the topic of sequence-based versus structure-based search tools, check out this commentary in nature communications
https://www.nature.com/articles/s41467-023-44082-5