Virtual Cells for Microscopy Assays
Overcoming the physical constraints of the lab
Since the invention of the microscope by Antonie van Leeuwenhoek and Robert Hooke in the 17th century—a story beautifully told in “The Song of the Cell”— microscopy has been one of our main ways of observing cells and their states. It made cells directly observable and, over time, evolved from simple visual inspection to rich quantitative measurements of morphology, organization, and cellular response. Today, industrialized labs can run millions of microscopy experiments each week. And yet, even the most advanced labs can cover only a tiny fraction of all the experiments we may want to run. If we could simulate the outcomes of these microscopy-based assays, we would have a virtual assay that acts as an in silico proxy for a large fraction of these experiments.
Welcome back to “Inside Valence”, a series where you’ll get a behind-the-scenes look at our research, exploring new ways to predict, explain, and ultimately how we decode biology at Recursion.
In our previous post, we discussed how, for the first time in history, we have the right combination of compute and massive, fit-for-purpose datasets to begin simulating biology at scale. We also introduced our vision for Virtual Cells through three pillars: Predict, Explain, and Discover. Today, we’re zooming in on the first of these (Predict) through models that predict how assay readouts change under perturbation. We’ll call these perturbation effect prediction models virtual assays because they simulate the measurements we can collect in real experiments. Such models aim to predict assay outcomes under perturbation, even when the underlying mechanism is only partially understood. In this post, we’ll look at why they are needed, the core ideas behind them and how the latest research—including our own—fits into all of this.
Modeling Beyond Physical Constraints
While global collaborations like the Human Genome Project or Human Cell Atlas have shown the power of pooling resources for large, well-scope observational datasets, drug discovery is fundamentally perturbational. We perturb diverse cell types to create various disease models—through gene knockout or soluble signals such as cytokines—and measure how they respond to compounds drawn from a vast chemical space of drug-like molecules (theoretical estimates go as high as 1060, with the commercial Enamine REAL Space library already spanning ~78B compounds). This complexity grows further when considering double perturbations, such as double gene knockouts or combination therapies. In short, the number of experiments one may want to run quickly outpaces global lab capacity.
To work within these physical constraints, drug discovery has long used computational models to prioritize what to test experimentally. Traditionally, those models were built to predict narrow endpoints, such as binding affinity, toxicity, or cell viability.
With the emergence of Virtual Cells, a new approach is taking hold: rather than predicting a single property, itself a function of the cell’s state, the aim is to directly model how cellular state shifts under intervention. That makes them a more flexible tool for exploring biology and for supporting downstream decision-making. We refer to these as virtual assays and they form the core intuition behind the “Predict” pillar in our Virtual Cell vision.
Virtual Assays Across Modalities
As we discussed in last week’s post, the research on Virtual Cells can be broadly divided into two directions: top-down and bottom-up. Bottom-up approaches—rooted in biomolecular simulation—are currently focused on protein–ligand and protein-protein systems, and there is promising work using ML to drastically increase the scale of the systems we can model. These approaches have already proven useful in predicting binding assays, and scaling them to the mesoscale could one day allow us to model complex ADMET properties or even the whole-cell behavior. We will return to that direction in a future post. Here, we shift our attention to top-down approaches.
Thanks to the growing availability of public datasets in transcriptomics, such as Tahoe-100M, Xaira’s X-Atlas/Orion, and Ginkgo’s VCPI, initial top-down modeling efforts have primarily focused on predicting shifts in RNA expression under perturbations, such as gene knockouts or compound treatment. Arc released STATE, which uses a State Embedding (SE) and State Transition (ST) module to learn cell state transitions; Xaira released X-Cell, which uses a diffusion language model to generate the perturbed distribution from an initial control distribution; we released TxPert, which leverages knowledge graphs and prior knowledge to improve OOD prediction.
Naturally, this first generation surfaced a key lesson: no single modality is enough to fully capture a cell’s state. Transcriptomics has been a natural and important starting point, but it has clear limits. RNA abundance is informative but it does not fully determine protein abundance or activity, which is a gap proteomics (see, for example, the Perturb-PBMC dataset) could help fill. Traditional transcriptomics also only captures a snapshot—not the dynamics of the cell’s transient processes—because the measurement itself destroys the cell. Spatial transcriptomics (as used by Noetik’s OCTO-vc) may partially bridge some of these gaps by adding tissue organization and local context, but it remains difficult to deploy at scale in large perturbational settings. Microscopy-based phenomics (see the JUMP-CP and rxrx.ai datasets) offers a complementary view: it captures morphology and other structural and spatial consequences of cellular response that transcriptomics alone cannot. This matters not only because it provides an additional modality, but because it offers a different biological perspective on cellular response. All in all, each modality comes with its own trade-offs.
Strong unimodal models remain essential on the path to multimodal Virtual Cells. Integrated systems are built on modality-specific foundations, and each modality forces us to solve a different part of the broader problem. In imaging, computer vision has shown that large-scale modeling can produce flexible and broadly useful representations, and biology is beginning to see the same with models such as Recursion’s Phenom series. But representation learning is not the same as simulating how cell states transition under perturbation. That generative step remains a major gap, and closing it means building virtual assays that can predict microscopy readouts under different perturbations.
Modeling Microscopy Readouts Under Perturbation and Experimental Variation
Virtual assays, including microscopy phenomics, are fundamentally a conditional generative modelling problem. If we could explicitly label all relevant sources of heterogeneity in our data, we could (in principle) reduce the problem to conditional prediction, and a sufficiently well-conditioned mean predictor might perform well. In practice, however, cellular responses often depend on latent or unobserved factors, such as cell-cycle stage, microenvironment, or pre-existing cell state. As a result, the response to a perturbation is not always concentrated around a single outcome: it may be multimodal, heavy-tailed, or contain rare but biologically important events. Generative models are useful precisely because they aim to represent this full conditional distribution, rather than only its average.
When we apply a perturbation i (e.g., a small molecule, CRISPR guide, antibody, etc.), we intervene on cellular processes and induce a new distribution of cellular phenotypes, P(x|do(i)), where the “do” indicates that the cells have been experimentally perturbed rather than passively observed (for a reminder on do-calculus and causality, Pearl’s The Book of Why is great). To learn such interventional distributions, we need to account for potential confounding effects. In these assays, perturbations are experimentally assigned, so the main source of confounding comes from spurious dependencies induced by experimental conditions, such as batch effects. For example, if the cell population drifts or experimental conditions change over time, the measured effect of a perturbation may reflect both the perturbation itself and these underlying changes, i.e. P(x|i, b), where b denotes batch effects. Because we care about the perturbation effect rather than these nuisance factors, we need to account for this confounding.
To address this, one can model the difference between control cells, P(x|c, b), and perturbed cells, P(x|i, b), within each batch rather than comparing across batches. This follows the same intuition as in classical causal inference. If we looked at correlations between ice cream sales and sunburn, for example, we might be tempted to conclude that eating ice cream (our “perturbation”) makes you more likely to get sunburned. But this relationship is heavily confounded by the weather—we eat more ice cream and get more sunburned on sunny days. If we instead compare “within batch” by looking at the correlation only on hot days (and similarly for cold days), the apparent association might disappear. More generally, we reduce confounding by restricting the analysis to subsets of the data in which treated and control groups differ primarily by whether they received treatment, and then taking a weighted average across those subsets. Incidentally, this is the intuition behind Pearl’s backdoor adjustment formula:
To model the effects of perturbations on cells, we therefore need a way to model the conditional response of cells to a perturbation within each batch, P(x|i, b). Fortunately, modern generative modelling provides several tools for this (e.g. Diffusion Models and Flow Matching). We will focus on Flow Matching (Lipman et al.) and its minibatch variants (Tong et al). These models are flexible enough to capture the biological variation that results from perturbations, while conditioning on molecular representation—for example, implemented via Feature-wise Linear Modulation (FiLM)—to learn perturbation-specific effects.
This is very similar to the methods that underlie text-conditioned generative models of images, but we have an additional design decision in biology. In perturbation experiments, we often observe control wells in every batch. So rather than learning the map from noise to data (as is typically done in image generators), we can instead learn to map from the distribution of control cells, P(x|c, b), to the distribution of perturbed cells, P(x|i, b). This is the approach taken by morphology image generative models like CellFlux, CellFlux v2, and even many recent approaches in transcriptomics data (CellFlow, the original minibatch OT paper from Tong et al, MetaFlow Matching, etc.). Conceptually, this assumes that the perturbed distribution can be viewed as a transformed version of the control distribution transformed (a pushforward measure) via some function, fi, that only depends on the perturbation (and not the batch effects).
This is a meaningful restriction: not every target distribution can be written in this form, but it aligns well with the biological intuition that perturbations transform an existing cell state distribution. The alternative, of course, is just to learn P(x|i, b) directly, which is more flexible but may require more data to learn—an example of the classic bias-variance tradeoff.
A Simple, Stable, and Scalable Recipe for Phenomics

So which method works best in practice? Which architecture should one use (U-Nets, DiTs)? What about optimal transport maps and the other refinements of basic flow matching? The answers to these questions are inherently data-dependent, so in our recent preprint, Jones et al. 2026 (inspired by the excellent Karras et al. 2022), we set out to study them systematically.
We began by decomposing perturbation effect prediction problem into two components, each of which will impact the final performance of the model:
The generative modelling problem of how best to represent, P(x|i,b), where we compared the approaches to addressing batch effects, and the various modelling decisions.
Learning how cells respond to unseen perturbations. This requires considering how to represent a perturbation such that it can capture a cell’s response, which takes us to the world of molecule and gene representations, which we will return to in future posts.
Intuitively, these can be thought of as (1) knowing what cellular responses “look” like and (2) knowing which responses are associated with which perturbations. If your model fails on either task, it will do a poor job modelling perturbations. For this paper, we could build on our past molecular encoder, MolGPS, to address the second problem, so we primarily focused on the first task.

We found that for microscopy data, a simple, relatively standard generative modelling approach (Gaussian noise-to-data via a modified DiT for stability) led to the best performance and significantly improved on the state-of-the-art for perturbations we have already seen during training. Although seen perturbations are less interesting biologically, they provide a direct test of the first problem, namely the quality of the generative model itself, whereas unseen perturbations are harder because they depend on both components. This is somewhat surprising—one would have expected that directly modelling the transformation from control to perturbed cells would perform better—but given that modelling P(x|i,b) directly makes fewer assumptions about the data, it is also consistent with the bitter lesson of the last decade: simple flexible approaches tend to perform better at scale.
Using this recipe, we then evaluated performance on unseen perturbations using the BBBC021 dataset. This dataset from the Broad Institute tested 113 compound perturbations at different concentrations in the MCF7 cancer cell line. We consider both Morgan fingerprints and MolGPS for compound representation. This led to performance on unseen perturbations that was consistently as good or better than the recent CellFluxV2 work.
All in all, this research proposes a framework for developing conditional generative models for perturbation effect prediction. In ongoing work, we’re applying these same learnings to other modalities and continue exploring the design space.
Benchmarks that Validate the Biology
In the above research, we evaluate our generative models with the Fréchet Inception Distance (FID) and Kernel Inception Distance (KID). It’s the current standard for evaluating generative models in imaging more generally, but does not account for the specific applications of cell imaging in biology. Most glaringly, our unconditional model, which ignores the perturbation, outperforms everything else except CellFlux v2, showing that the generative modelling task dominates these metrics. We believe it’s time to rethink these benchmarks to go beyond image fidelity measures and instead track performance on the downstream applications we care about in drug discovery.
Labs to Validate, not Search
Virtual assays represent the most immediate, high-impact application of the Predict pillar within our Virtual Cell vision. By simulating millions of perturbations computationally and reserving the lab for the most promising experiments, this enables us to fundamentally rethink how we discover new disease biology and therapies. The lab becomes a place to validate, not to search.
This may sound like a subtle shift, but it isn’t, because it changes what drives the entire system and, over time, reshapes the core question we ask. Instead of asking “what experiment should we run next?”, we begin to ask “what do we still not understand well enough to simulate?”, and that reframing moves us into a different frontier altogether. One where the bottleneck is no longer the throughput of our labs, but the fidelity of our models—and how quickly we can close the gap between what we can simulate and what we still need to measure.
This post is part of “Inside Valence”, a series where you’ll get a behind-the-scenes look at our research, exploring new ways to predict, explain, and ultimately decode biology. If this resonates, consider subscribing!









