The human body is made up of trillions of diverse cells. Even within a single tissue, cells exhibit vast differences in which genes they express and how they respond to the local environment. Single-cell RNA sequencing has revolutionized the ability to measure this diversity, revealing cellular states that were previously invisible. But scRNA-seq experiments remain expensive, time-consuming, and typically only capture snapshots of cells at single moments in time.
What if researchers could simulate cellular behavior by simply specifying the conditions they wanted to explore? Imagine providing inputs such as tissue type and experimental conditions, then receiving gene expression outputs that accurately simulate what would happen in the lab. This capability — conditional generation of cells — would enable researchers to run virtual experiments, predict how cells respond to drugs or genetic changes, and explore cellular behaviors that are difficult or impossible to observe directly.
Developed by Biohub’s AI team, scLDM (single-cell latent diffusion model) is a new AI model that generates realistic single-cell data at unprecedented fidelity. The model accepts features as input — tissue type, cell type, experimental conditions — and outputs gene expression vectors that realistically simulate experiments with those parameters. Think of it as “prompting” for cells: Researchers describe what they want to generate, and scLDM produces synthetic cell data matching those specifications with state-of-the-art accuracy.
How scLDM works
scLDM, a deep generative model of transcriptomics, innovates by creating rich tokenized embeddings of raw transcript data through a variational autoencoder, then modeling those embeddings in latent space using flow matching. This two-stage approach — first compressing high-dimensional gene expression into meaningful representations, then learning to generate those representations conditionally — enables both computational efficiency and biological fidelity.
The model adapts ideas from popular image generation models — like those behind DALL-E and Stable Diffusion — while embracing unique biological properties that images don’t possess. Importantly, while pixel position matters for images, genes have no inherent ordering. Just as the content of a bag of marbles remains the same regardless of how the marbles are arranged, a cell’s identity is determined by which genes are expressed, not by any particular gene ordering. scLDM respects this “exchangeable” nature of genes by using specialized attention mechanisms to handle the orderless nature of gene expression data.
The model’s multi-conditional generation capabilities make it a versatile simulator. In our experiments, we evaluated the model in its ability to generate realistic single-cell profiles in both observational and perturbation scenarios, even for combinations never directly observed in training data.

How scLDM performs
scLDM achieves state-of-the-art performance specifically in conditional generation — a strategic capability that enables virtual experiments. For cell reconstruction, the model achieves up to four times better correlation with ground truth compared to previous approaches, with particularly strong improvements on complex datasets like the Human Lung Cell Atlas.
For cell generation, scLDM surpasses any publicly available model, producing more realistic synthetic cells as measured by numerous metrics. Whether generating cells of specified types or perturbation conditions, the model captures the full diversity of cellular states while maintaining biological plausibility.
For downstream applications, embeddings learned by scLDM prove valuable for tasks like classifying cell types, such as identifying COVID-infected cells or liver cancer cells. scDLM consistently matched or exceeded the performance of specialized foundation models, despite not being explicitly trained for these tasks.
An exciting property of generative models like scLDM’s is the ability to generate counterfactual cells — answering “what if” questions, for example in one case about cell perturbations that would be expensive or impractical to address experimentally. Validated on two challenging datasets — Parse 1M, containing over 1.2 million immune cells treated with 90 different cytokines, and Replogle, featuring genetic knockout experiments across multiple cell lines — the model successfully learned to map input conditions to realistic gene expression outputs in previously unseen combinations of cell context and perturbation.
The capability to perform virtual experiments this way has the capacity to transform how experimental design is thought about down the line. Before committing resources to expensive screens, researchers can conduct virtual gene expression experiments across thousands of conditions, identifying the most promising candidates for laboratory validation. For example, scLDM can generate realistic single-cell profiles from sparse experimental design, in order to “in silico” fill in the gaps of missing phenotypes.
What’s next
scLDM represents a foundational component of Biohub’s vision to create increasingly sophisticated virtual cell models that integrate multiple data modalities. The scLDM framework is designed to be extensible. While today’s initial focus is on gene expression, the same input-output mapping principles can be applied to other molecular data types like proteins, epigenetic marks and even images. As we continue to advance AI for biology, models like scLDM bring the scientific community closer to a future where computational models can reliably predict cellular behavior — transforming how we understand cells and ultimately prioritize experimentation.
Researchers can access scLDM on the virtual cells platform, including the codebase on GitHub and the preprint.
scLDM team: Giovanni Palla, Sudarshan Babu, Payam Dibaeinia, James D. Pearce, Donghui Li, Aly A. Khan, Theofanis Karaletsos and Jakub M. Tomczak.