Stamps, Fishing, and the New Alpha Nerds
Biology is like stamp collecting. But maybe stamp collecting has never been hotter? Today we’re talking about the power of data and which kind of scientist is the coolest.
Transcript
There’s a famous quote attributed to the physicist Ernest Rutherford: “All science is either physics or stamp collecting.” Now, as a biologist, I try to not take it personally. But I don’t think this is meant as a compliment to non-physicists or to stamp collectors.
Honestly, I’m not even mad. If you’ve studied biology, you probably get what he meant. In physics they have these big beautiful equations that provide unifying explanations: the ideal gas law, the Stokes-Einstein equation, the Maxwell–Boltzmann distribution.
Biology, in comparison, can feel like a bunch of observations that are collected but not really unified. The names of species. The parts of a frog. The kinases of the Ras-Raf-MEK-ERK pathway. Hell, I’ve been doing biotech for 15 years and that one still gives me nightmares.
What can I say? We're scientists. We take what nature gives us and we do our best. Maybe the complexity and diversity of biology just can’t be unified with simple math. It is what it is.
But of course, scientists can also get competitive. We want to know what kind of science is the best kind. Physics, I've got to admit, has had an awesome two or three centuries. Unifying theories expressed as closed-form mathematical expressions have been fucking crushing it.
So naturally, physics is the alpha nerd and we humble biologists look on with a kind of equation envy. This isn’t a bad thing. The drive to bring physics-like approaches to biology has been productive. We can write equations too.
Fields like systems biology, synthetic biology, computational biology benefit from formalism, rigor, and a love of math. Engineering biology goes beyond mere stamp collecting.
What I’m on my soapbox about today is the “mere” part. Because maybe stamp collecting isn't so “mere” after all. Maybe it’s time to revisit who is really the alpha nerd around here in the era of big data and AI.
You see when I look at a book of stamps I see a dataset. A comprehensive, unbiased, structured, labeled, dataset. I see exactly the kind of resource I need to train an AI model. If those stamps are blocks of text or digital images, you might use them to build the next big thing in tech.
And if those stamps are pieces of biological data, you might have the essential resource for taking on the hardest problems in biotech.
The key difference is scale. That’s what changes a book of stamps from a novelty in the curio cabinet of some 19th century naturalist into a cutting edge tool for gene therapy, biologics, or RNA therapeutics. How many stamps can you generate? At what price? Are they structured, labeled, interoperable and relevant to the problem at hand. This is what biotech needs and what I think Ginkgo can deliver.
And as AI transforms biotech, I’m noticing an interesting mentality shift around the power of data and the social status of scientists who know where to get it. Here’s what I’m talking about.
Let’s say you were a physicist working in the 20th century. You were trying to solve a hard problem. Your approach probably involved making very precise quantitative observations and then structuring them into a mathematical equation. That approach doesn’t work for all problems. But you would have seen it drive breakthrough after breakthrough and it gave you confidence it was the smartest possible strategy.
The biologists who are winning at AI have developed a similar kind of confidence. The AI dataset belongs at the center of the R&D program. Not because it is guaranteed to work, but because it is the smartest possible strategy. In contrast, I see a lot of biologists who were trained in the older data strategies, who hesitate. They’ll say “yes, I love the whole AI thing, but do I really want to go on a fishing expedition?”
The idea that building a large dataset for biotech is a “fishing expedition” has been around since at least the 1980s. This is when early lab automation first made it possible to generate data hundreds or thousands of samples at a time. High-throughput screening for drug discovery is probably the most well known example. Let’s test 10,000 drugs and see if one of them hits our target.
The data from a high throughput screen, in those days, was fundamentally seen as a kind of stamp collecting. Each sample in the screen was an isolated, ununified observation. The goal was to produce one hit, one drug or one DNA mutation that was effective, out of the many hundreds that were not. If your screen didn’t produce a hit, the rest of the data had very little value.
It was risky and expensive, like a lot of biotech R&D. But it was also intellectually worse. Going on a fishing expedition made you the lesser nerd. All the cool scientists had pathway models and mechanisms of action. They probably didn’t have equations, like the super cool physicists had, but they were hypothesis driven. You were just doing trial-and-error, hoping one of your little stamps turned out to be worth something.
My friends, those days are over. The term “fishing expedition” should be retired. In the AI era, there is no reason why anyone should be building a large dataset just for the purpose of getting one hit. A well built AI model is a predictive and generative resource, as useful as a mathematical expression and as cool.
The power of an AI model is an emergent property of the training data as a whole. The art of building an AI dataset is as sophisticated as anything scientists do.
That’s why, in the AI era, the term stamp collecting deserves to take on a new meaning. If Ernest Rutherford was alive today, he’d be a stamp collector. Data generation at scale is the smartest strategy to take on the hardest problems in biotech R&D and that’s kind of awesome.
We biologists should be proud of what we can do with biological data. We’re the alpha nerds now. The future belongs to the stamp collectors.