Variations within our DNA sequence are what make each of us biologically unique, influencing everything from our appearance to our susceptibility to disease. For decades, one of biology’s greatest challenges has been to move from simply cataloging these genetic variations to predicting how they affect the function of our cells and tissues.
Recent advances in artificial intelligence have paved the way for forecasting the consequences of genetic variations, or variants. But because the effects of a variant can differ depending on the type of tissue it affects, the other genes and variants carried by an individual, and a person’s ancestry, these approaches have been limited in their ability to predict the impact variants have in each individual’s genome. Predicting these impacts can also be especially challenging for low-frequency variants, for which there is far less published data.
Developed by Biohub researchers, VariantFormer is the first sequence-based AI model to directly translate personal genetic variations into tissue-specific activity patterns at scale. By integrating genetic information from both regulatory regions (such as promoters and enhancers) and transcribed gene sequences, VariantFormer also implicitly captures the genetic influence of the epigenome on gene expression. VariantFormer uses an end-to-end approach to predict gene expression profiles directly from a person’s DNA sequence. This approach offers a powerful new method for exploring how someone’s distinctive genetic makeup impacts their health.
The model shows “what could be” based on genetics, but doesn’t account for lifestyle, environment, or other clinical factors essential for a complete understanding of an individual’s health, and it has been designed to advance scientific research rather than serve as a clinical or diagnostic tool. Consistent with Biohub’s commitment to responsible AI, VariantFormer and its outputs are designed for research purposes only and are not intended for clinical or diagnostic decision-making.
VariantFormer is an important step in meeting our grand challenge of building an AI-based virtual cell model, by providing a new platform that captures the complex genetic interactions that drive cells’ behavior. On a larger scale, it is a first step in a long journey toward using predictive models to study, treat and prevent even the most elusive genetic drivers of disease.
How VariantFormer works
VariantFormer interprets an individual’s genome to predict how their genetic makeup alters gene expression across the different tissues of the body, but it can also learn the functional effects of genomic variants across ancestral backgrounds. It is a powerful tool to understand how variants affect individuals from populations that are underrepresented in genetics datasets.
Moving beyond reference genomes, VariantFormer takes an individual’s unique genetic variants as input and predicts the expression of genes in different tissues. It achieves this by focusing on genes and their surrounding regulatory elements, such as promoters and enhancers, and uses a 1.2-billion-parameter transformer with a two-stage hierarchical architecture to learn the complex grammar of gene regulation.
In order to understand gene expression in the novel context of an individual’s genome, across different tissues and ancestries, VariantFormer’s training data was sourced from large-scale public and private consortia — including ENCODE, GTEx, ADNI and MAGE. This collection of datasets provides paired whole-genome sequencing and gene expression data across thousands of individuals, representing different tissues and ancestries. To our knowledge, this is the largest curated collection of paired whole-genome sequencing and bulk RNA-seq samples to date, spanning 21,000 samples from 2,330 donors across 54 tissues and seven cell lines.
By training directly on thousands of paired individual genomes and expression profiles, VariantFormer goes beyond single-variant analysis to learn the complex effects of combinations of variants. VariantFormer also learns how a variant’s effect will change depending on the tissue context and against different genetic backgrounds. This provides a more robust and generalizable understanding of a variant’s impact.
Notably, VariantFormer excels at predicting expression of non-protein-coding genes — including long non-coding RNAs and other regulatory RNAs that represent over 60% of annotated genes. These genes are notoriously difficult to predict because they exhibit highly tissue-specific and context-dependent expression patterns. VariantFormer’s 7-16% improvement over existing methods in this challenging regime demonstrates its ability to capture subtle regulatory signals that other approaches miss.
Furthermore, the model can generate rich “genetic embeddings,” a numerical fingerprint of a gene’s activity for an individual, that can be used to power downstream prediction tasks for complex diseases. While previous models act more like a paper map, providing a static view of streets and landmarks, VariantFormer adds the detail of GPS, accounting for the traffic patterns and road closures that make each trip unique.
This approach enables researchers to generate and test hypotheses from a systems perspective, rather than creating isolated predictions that fail to consider the underlying and specific biology of an individual patient. Using VariantFormer, a scientist can take the DNA from a patient and create a “digital twin” of their gene expression profiles, even for tissues from organs that are difficult to biopsy, like the brain or heart.
By combining VariantFormer’s predictions with models that link gene expression to disease, we can now also create tissue-specific disease risk scores. We are utilizing this workflow as an extra layer on top of VariantFormer to predict disease risk by analyzing how a person’s unique genetic variants affect gene expression patterns across multiple genes and tissues.
This capability has been validated in Alzheimer’s disease research, where VariantFormer was used to analyze patient genomes from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The model successfully prioritized disease-relevant genes and tissue contexts, with the strong signals emerging from known risk genes like APOE and TOMM40, and pinpointing specific brain regions — including the frontal cortex, cerebellar hemispheres, and anterior cingulate cortex.
In a particularly striking validation, researchers performed “in silico gene editing,” computationally replacing disease-risk variants of APOE with protective variants in patient genomes. VariantFormer correctly predicted that a disease-risk variant (APOE-ε4) increased Alzheimer’s risk scores, while a protective variant (APOE-ε2) decreased them — recapitulating clinical observations, but now enabling patient-specific predictions.
Researchers could also use VariantFormer to score and rank genetic variants to identify which ones are most likely to actually affect gene expression, especially low-frequency variants that are hard to study in large populations. Since most disease variants are of low to moderate frequency, researchers can utilize the model to confirm or generate hypotheses about potential targets associated with disease-causing traits.
How VariantFormer performs
VariantFormer achieves state-of-the-art performance in predicting individual-level gene expression variability. For protein-coding genes, VariantFormer achieved a 2.2% improvement over genotype-based models and 3.9% over Enformer. The advantage becomes even more pronounced for non-protein-coding genes, which make up the majority of our genome: VariantFormer showed a 16% improvement over genotype-based models and outperformed DNA foundation models like Enformer by 7.3%.
The performance gap widens dramatically on mutation-rich cancer cell lines, where VariantFormer achieved correlations ranging from 0.76-0.85 across different cell lines, compared to 0.55-0.75 for other DNA models. This demonstrates VariantFormer’s ability to handle the high mutation burden characteristic of cancer genomes — a scenario in which traditional population-based models cannot even make predictions because they rely on pre-computed weights from specific variant sets.
Not a “black box,” VariantFormer allows researchers to peer directly into the inner workings of the model, ensuring confidence in its ability to accurately capture biological mechanisms. By inspecting the model’s internal attention mechanisms, Biohub researchers were able to visualize which regulatory elements are influencing a gene’s activity in a specific tissue. These learned interactions correlate strongly with experimental data (like that from DNase-seq), proving VariantFormer’s efficacy.
When developing VariantFormer, Biohub researchers made sure to prioritize privacy and security considerations that are essential when working with genomics data. Crucially, because the model operates on personal genetic data, its development is guided by a strong commitment to responsible AI. The VariantFormer team adheres strictly to all data use agreements, like those from the National Institutes of Health, and they designed the model to be non-generative, meaning it cannot be used to create new DNA sequences. Maintaining patient privacy and preventing the re-identification of individuals is a core principle guiding VariantFormer’s development.
What’s next for VariantFormer
VariantFormer represents a major step towards personalized genomics, allowing researchers to better understand the complex, multi-system consequences of genomic variation at an individual level and accelerating the discovery of new therapeutic strategies.
While VariantFormer already marks a significant step toward individualized genetics predictions, the model is still in its first iteration. Beyond Alzheimer’s, VariantFormer offers the potential to dive deeper into the genetic mechanisms of a broad range of diseases.
At Biohub, our ultimate mission is to cure and prevent all disease. This goal can’t be accomplished with a blanket approach to studying the genes driving disease. Models like VariantFormer demonstrate our commitment to work towards realizing this vision for each individual, accounting for the genetic differences that make both biology and human beings complex and beautiful.
Responsible use of VariantFormer
VariantFormer is designed to help researchers understand disease mechanisms and apply them in contexts like identifying potential drug targets, prioritizing genetic variants for study, and creating digital twins for research purposes — not to tell individuals whether they have or will develop a disease.
The model has not undergone clinical validation or regulatory approval, and it does not meet the standards required for medical diagnostics, so it should not be used for these purposes. Our intent is not to engage in diagnostic activity, but to share and describe what may be possible in therapeutic research and discovery.
The model generates gene expression predictions based on patterns in training data and layers risk scores for disease on those predictions; it does not provide definitive diagnoses. In this early stage, we can expect these statistical estimates to be incorrect in the context of one person, and sharing personal results poses significant risk of misinterpretation.
For this reason, along with the model release, we have made certain generic and anonymized data available strictly for research purposes.
Researchers can access VariantFormer on the virtual cells platform, including a tutorial; quick start; the codebase on GitHub; and the preprint on bioRxiv.
VariantFormer team: Led by Sayan Ghosal and Theofanis Karaletsos, along with Youssef Barhomi, Tejaswini Ganapathi, Amy Krystosik, Lakshmi Krishnan, Sashidhar Guntury, Donghui Li, and Francesco Paolo Casale.