Cracking The Code of Life
The Evo 2 machine learning model enlists the power of AI in the fight against diseases

From tiny tree frogs to towering redwoods — to you and me — DNA drives all life on earth. Embedded in every cell in every organism, DNA acts as a kind of biological instruction manual, containing all the genetic information needed to make life.
That process begins with transcription: DNA makes a copy of part of its code to produce RNA, a type of molecule that can catalyze biological reactions that express the information embedded in the DNA. In those reactions, proteins are synthesized and become living cells. Altogether, this is known as the central dogma of molecular biology: DNA makes RNA, and RNA makes proteins.
A single strand of DNA can contain millions of pairs of nucleotides, the molecular building blocks that carry genetic information. And a single strand of RNA can contain tens of thousands of them. There are virtually countless ways nucleotides can coalesce to become life. And the combinatorial complexity is simply too much for a human mind to make sense of. But that’s where AI comes in.
“Machine learning can pull together higher order patterns from massive data sets,” says Patrick Hsu, assistant professor of bioengineering. “AI has already done this in natural language, vision and robotics. Now, we are doing this in biology.”
In February 2025, Hsu and his collaborators released a machine learning model trained on more than 9.3 trillion nucleotides. Called Evo 2, Hsu compares it to a biological ChatGPT that can analyze genetic data at scale. It is already the largest AI model in biology, and one day, Evo 2 could engineer new biological tools and treatments.
“Right now, we have a lot of observational data,” he says. “We know of correlations between genes and disease, but we still don’t know much about causal relationships. Having something with the ability to predict cause and effect would be really powerful.”
This type of prediction is the near-term vision for Evo 2. Hsu gives the example of BRCA1 — a breast cancer gene. If a woman has a BRCA1 gene mutation, her lifetime risk of breast cancer increases dramatically. More than 60% of women with a BRCA1 gene mutation will develop breast cancer at some point in their lifetimes, compared to just 13% of women overall. Some BRCA1 mutations are known to be pathogenic, while others are known to be benign. But most mutations are variants of unknown significance — we just don’t know what they do.
“If you have a pathogenic mutation, you get a mastectomy. And if you have a benign mutation, you get an annual mammogram. But what do you do if you have a variant of unknown significance?” asks Hsu. “It turns out that Evo 2 has an opinion about this, and the model is state-of-the-art in classifying the pathogenicity of BRCA1 mutations. It achieved over 90% accuracy in predicting which mutations are benign over which are potentially pathogenic.”
Predicting biological properties
Evo 2 is a product of a Bay Area independent nonprofit called the Arc Institute, which Hsu co-founded with bioengineer and neuroscientist Silvana Konermann. The institute aims to accelerate scientific progress and deepen our understanding of the root causes of disease, and it brings together leading biomedical researchers from UC Berkeley, UCSF and Stanford.
The AI model builds on its predecessor Evo 1, which launched in 2024 and was trained entirely on single-celled organisms. Evo 2 takes it up several notches. The model was trained on a vast trove of biological information — including more than 128,000 whole genomes and 9.3 trillion nucleotides from 100,000 species from across the tree of life, including bacteria, plants and animals.
There are five base nucleotides that make up DNA and RNA: adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U). DNA contains A, C, G and T, while RNA contains A, C, G and U. Our genetic material is made from these nucleotides in countless different sequences, and Evo 2 uses this information to make probabilistic predictions about what is likeliest to come next within these sequences. The model uses principles similar to those that drive well-known large language models like OpenAI’s ChatGPT or Anthropic’s Claude. And to build this cutting-edge model, researchers collaborated with the industry-leading AI chipmaker NVIDIA.
“A machine learning model predicts the next token — a term for the fundamental unit of data that a model processes,” says Hsu. “ChatGPT predicts the next character and the next word. If you ask it to finish the sentence ‘to be or not to be’…there is a very high probability ‘that is the question’ will come next. Because Hamlet. But what comes next in a sequence of nucleotides is less clear. If I gave you a sequence like ‘G, T, G, C, A, T, C,’ would you predict the next one to be ‘C’ or ‘G’? You would have no idea, and I don’t either. But an AI model can capture complex biological properties based on sequence variation alone.”
Evo 2 is a large language model for a language that is never spoken, only expressed in physical form — whether that expression is the growth of a cancerous tumor or the color of a baby’s eye. Evo 2 can process up to a million nucleotides at once, so it can pick out patterns in the data and identify relationships with other parts of a genome. That doesn’t just enable predictions about whether a gene mutation is likely to be pathogenic. It also makes it possible to predict therapeutics that could potentially treat a disease and provide insights into the biological mechanisms that cause it to progress. It could even help guide the direction biomedical research takes.
“Researchers are already able to generate bigger data sets than ever before — and do bigger experiments — but it is not clear this has led to more insights than ever before,” says Hsu. “Even the biggest data sets are very small relative to the complexity of biology. That’s where machine learning models come in. We can take large biological data sets and train the models to find higher order patterns in the data that are more complex than we could even imagine.”

‘Efficiency really matters’
For the most part, the science of biology developed through the process of trial and error. A researcher formulates a hypothesis, tests it in a scientific experiment and analyzes the results. Then, the researcher moves on to the next hypothesis. And so on and so forth.
The approach is time consuming, but it has yielded results — humans are living longer than ever before. Clinical trials for new medical treatments take years to conduct, and the overwhelming majority of new treatments never make it to market. Hsu compares the process to a hike in California’s mountains.
“Being a biomedical researcher can feel like walking in the wilderness,” Hsu says. “You see a peak in the distance, and you walk toward it. Then, three hours into the walk, you realize you haven’t gotten much closer. And you need to make a decision about whether you’re walking in the right direction at all.”
In biology, experiments have tended to unfold at the time scale of life — in days, and weeks, and months, and years. And if you are headed in the wrong direction, you could be off course for quite some time.
“Efficiency really matters. You can spend years working on the wrong thing, and just be out of luck,” he says. “We have gone really far in biology with something close to guess and check.”
One of the main aims of the Evo 2 researchers is to use AI to accelerate the development of discoveries into actual therapies. The concept has roots in the COVID-19 pandemic, which saw mRNA vaccines deployed widely and rapidly.
Evo 2 was trained on a vast trove of biological information—including more than 128,000 whole genomes and 9.3 trillion nucleotides from 100,000 species from across the tree of life, including bacteria, plants and animals.
“That breakthrough was 60 years in the making,” says Howard Chang, senior vice president of global research at the biotechnology firm Amgen and former Arc Institute researcher. “Messenger RNA was discovered as a fundamental biological entity back in 1961. It shouldn’t have taken so long.”
According to Chang, Evo 2 can already do things that should help speed the process. It is able to accurately predict which RNA genes are essential to cell function and which ones are dispensable. It can tell you which genes are involved in controlling cell behavior that leads to diseases. This can put researchers on the right path earlier on.
“If you track individual families prone to a particular disease, there are a lot of inherited differences that map to places on the genome where changes in information could be causing the disease, but we’re not sure what they are. Evo 2 allows us to pinpoint that,” Chang says.
“If Evo 2 can tell us that a disease occurs because a protein is too active, we know what the problem is, and we can try to make a drug that addresses it. These are the kind of possibilities you have with Evo 2,” he adds. “It is a new kind of oracle.”
Hsu argues this type of advancement will be especially transformative in molecular biology. Research can take many years to complete and the overwhelming majority of clinical trials fail.
“The clinical trial failure rate is 90%. So, a lot of the time, we are just working on the wrong drug target,” Hsu says. “AI can help us find the right target much more effectively.”
Toward a healthier future
For Hsu, the pursuit of cures for complex diseases is a deeply personal endeavor. When he was a pre-teen, his grandfather was diagnosed with Alzheimer’s disease. His grandfather lived with his family, and Hsu bore witness to his inevitable decline. Slowly, he came to the realization that there was no coming back. The neurodegenerative condition is incurable and ultimately fatal.
The experience was formative. As a teenager, Hsu worked in university neuroscience labs at Stanford. He researched Alzheimer’s during his graduate studies at Harvard, and the disease remains a focus of his work at Berkeley and the Arc Institute.
“If you look at a list of the top five killers in the United States from 30 years ago, you will see they are the same as they are today: heart disease, cancer, Alzheimer’s,” says Hsu. “This is a pretty dire situation. It implies that despite more and more biomedical research being done, and more and more money being spent, we are not making more and more progress at curing these diseases.”
AI is essential to improving things, Hsu argues. The complexity of biology is simply too much for the human mind to fully grapple with — and analyzing vast quantities of data is exactly what AI is great at. Hsu envisions a future where AI makes biomolecular research more efficient and enables treatments tailored to a patient’s likely health outcomes.
“We don’t just want to understand the effects of specific genetic mutations and whether they are pathways to disease,” Hsu says. “We want to use Evo 2 to conduct genome-wide association studies that sequence both healthy people and unhealthy people to determine which genetic mutations are associated with a disease and tell you something more specific about your own risk. We want to better understand genetic combinations and integrate this with your own health record and genome to make more accurate predictions about your health. And hopefully sooner, rather than later.”