Machine learning isn’t new to biology; biologists have been applying machine learning since the 1970s, with application intensifying in the 1990s. What is new is the ability to train AI models at scale. Advances in transformer architecture now enable the field to train larger models and deploy greater amounts of compute, creating an opportunity for a paradigm shift in biology.
Yet unlike the wealth of online data used to train large language models like ChatGPT, we don’t have an internet’s worth of data in biology. Humans are made up of trillions of cells, and each cell is made up of billions of molecules, all of which are constantly interacting with each other. A brute force approach, where all measurements of all cells and tissues are collected, is likely beyond the scope of current technology, and certainly beyond the capability of any single entity. Instead, the scientific community needs to think strategically about the biological processes that are most important to model, the type of data needed, and how to best gather it.

At Biohub, we’ve been thinking deeply about that strategic path forward: Given the maturity, scalability and availability of technologies, which datasets are the right place to start making progress, and how is that data best gathered and shared? We explore these concepts in a new preprint on arXiv, and, to propel the conversation forward, here are the four ways I believe the data community can produce AI models with true biological impact.
1. Gather the right kind of biological data
To achieve AI models that capture cell biology across molecular, cellular and systems levels, Biohub will produce vast amounts of data through four scientific grand challenges and the development of virtual cell models. Our data generation is structured around three key areas where I believe data collection efforts will be foundational.
The first is cellular diversity and evolution. Deeply understanding the diversity within humans and across organisms will be highly informative about the different ways life functions and disease emerges. By collecting data from a vast diversity of cell types across and within different organisms, the scientific community has an opportunity to understand the states of cells that lead to varying types of life.
The second area is chemical and genetic perturbation. Scientists are no longer limited to looking at the genetics of individual people; in the laboratory, we can go in and perturb individual genes and cells to understand which changes are tolerated and which lead to phenotypes associated with disease. By creating a large dataset that perturbs each cell type, or each gene in each cell type, it will be possible to gain a fundamental causative understanding of how cells function.
The third area is multiscale imaging and dynamics. Cells are systems that change over time, so one cannot rely on a static snapshot of a cell-to-cell interaction or a biological system to understand how disease occurs; we need to see how cells and tissues change over time. Large-scale, dynamic imaging is a rising capability in the field that gives us the opportunity to do that.
2. Build on existing ontologies and knowledge frameworks
The field is in a good position to create a large, interoperable data resource thanks to the scientific community’s past development of robust ontologies and knowledge frameworks. Model organism communities have not only developed robust ways of referring to their organisms, but also how to map concepts across them. For example, while most people have a sense of what adolescence means in humans, how do we find similar developmental stages in mice and fish? Organizations such as the European Bioinformatics Institute have created ways of describing the relationships between different measurement technologies and different kinds of cells. These thoughtful, structured, consistent, machine-readable metadata and standard formats that encode decades of biological knowledge and experimentation provide a foundation that will enable us to bring together large datasets generated from multiple institutions, which in turn will let us train models with superior performance.
3. Approach biological data collection and management systematically
As a field, we must develop a shared, systematic approach to biological data management, consistency and accessibility. We need to focus on:
- Speed — To tackle biology’s most profound challenges, we need technologies that can reliably produce high-quality data at pace. This may mean using commercially available technologies or an intense focus on making sure that technologies being developed generate data in a fast, reproducible way.
- Cost — Data generation is already spanning petabyte-scale repositories and will eventually exceed exabyte scale, so the field will need to find ways to drive costs down while gathering data in iterative cycles.
- Focus — The space of all the ways one might perturb cells, or measure how cells interact with their environment, is too broad to be measured directly. We need to pick the right biological problems to focus on. At Biohub, we are focused on the three key areas mentioned in my first point.
- Interoperability — It will be necessary to ensure that data generated by one organization can be combined in future modeling iterations with data generated by other organizations across the community. Establishing a mechanism to share data in a consistent format is of paramount importance.
- Infrastructure — Data collection and management can be straightforward when one is talking about data that fits on a laptop. But when we’re talking about petabytes of data, there are real, tangible costs associated with a lack of interoperability. It might cost $100,000 to move a copy of a dataset or to reprocess it into a new format. The field needs to think proactively about the right ways to house these data, so that when we want to create a new model tomorrow, we can successfully build on foundations created today.
4. Adopt a federated data architecture
Assuming there are at least one or two dozen institutions interested in building virtual cell models at scale, then there are two possible data-sharing models. Either each institution internalizes a copy of every other institution’s data — which means all of us pay to transfer the whole world’s data — or we come together to construct a mechanism to move data directly from its at-rest location to the compute needed to train a model. The former option is going to cost an order of magnitude more, so I argue that the right, most cost-effective approach is to build a federated model for collaborative, AI-driven workflows.
At Biohub, we’ve developed a command-line interface aimed at streamlining access to our federated data collection, which is poised to grow to a resource of billions of cells and petabytes of imaging data. We invite those interested in participating in the alpha cohort testing of our interface — where one can search, analyze and download multiple modalities of data in a single query — to sign up here.
Ambrose Carr is senior director of data at Biohub, where he leads the data team, which works at the intersection of cell biology, data science and software engineering. The data team creates datasets fit for AI training and builds applications to publish, aggregate and integrate high-dimensional cell biology data. It is responsible for some of the largest open-access public biological datasets, such as CZ CELLxGENE and the CryoET portal.