Virtual Cell @ ICLR: A Sea of Silos
Reflections from ICLR 2026
Between April 23-27, ICLR brought the AI community to Rio, a city that only enhanced an already rich annual experience. Conferences are not only paper lists and poster sessions, they are an opportunity to get together and learn from new people working in similar areas, accelerating the exchange of novel ideas. The weather was warm, the city was alive, the food was excellent, and the mood was open. Between talks, workshops, lunch lines, food trucks, and late night conversations, Rio gave the conference a different rhythm: less sterile and more human.
For those of us working at the intersection of machine learning and biology, ICLR 2026 was especially interesting. As one of the biggest conferences in AI, the workshop topics are an important indicator of the areas receiving global attention. It was exciting to see the program include multiple biology focused workshops, which underscored that this research is no longer a niche side conversation but one that has firmly entered the mainstream.
The workshops included Machine Learning for Genomics Explorations (MLGenX), Generative AI in Genomics (Gen^2): Barriers and Frontiers, and Learning Meaningful Representations of Life(LMLR). Unfortunately, several biology workshops happened on the same day, often in overlapping sessions. This made the day intellectually rich but logistically difficult. For students, researchers, and industry scientists interested in biological foundation models, perturbation effect prediction, multimodal cellular data, and drug discovery, it often meant choosing between sessions that should probably have been in dialogue with one another.
That tension is a good metaphor for the field itself. There is more work than ever. There is more investment than ever. There are more models than ever. But the pieces are not yet organized into a coherent whole.
And nowhere was that clearer than in the conversation around virtual cells.
What Do We Mean by a “Virtual Cell”?
The term “virtual cell” is now everywhere. It appears in papers, talks, workshops, company decks, and funding narratives. But the more often the phrase is used, the less meaningful it becomes.
A narrow definition would say that a virtual cell is a model that predicts how a cell responds to a perturbation. Knock out a gene, add a chemical compound, change an experimental condition, and the model predicts the resulting gene expression profile.
That is useful, but it is not enough.
A more helpful definition is broader: a virtual cell should help us predict, explain, and discover. It should predict how cellular systems respond to interventions. It should explain why those responses happen, at least enough to generate biological insight. And it should help us discover something new, whether that is a mechanism of action, a therapeutic hypothesis, a biomarker, or the most informative experiment to run next.
Under that definition, virtual cell modeling becomes broader than any single task, modality, dataset, or benchmark. It includes transcriptomics and phenomics perturbation effect prediction, but also goes beyond them towards building general, useful computational representations of cellular behaviour.
The reason this distinction is important is because much of the current discussion around virtual cells has narrowed toward perturbation effect prediction from single cell RNA-seq. This is an important task, and it may become useful for parts of drug discovery, but treating it as the full definition of a virtual cell makes the ambition (and potential) of the field feel much smaller than it should be.
A cell is not just a vector of gene expression counts. It has morphology, spatial context, lineage, state, history, environment, and function. A useful virtual cell will eventually need to take into account many of these layers. If the field defines the entire ambition around narrow benchmarking tasks, it risks building a very small cathedral and calling it a city (just like Cathedral of St. Sebastian, though large, is not all there is to Rio).
The Field Is Moving Fast. Maybe Too Fast
The last few years have produced a wave of models for transcriptomics perturbation effect prediction: GEARS, CellFlow, TxPert, STATE, X-Cell, and plenty more. They mostly try to predict how gene expression shifts under perturbation. Some explore flow matching or conditional generation, some inject biological knowledge bases. There is a lot happening, and it’s sometimes hard to know what to make of it all.
The ICLR 2026 conference showed how much creativity is entering perturbation effect modeling. It featured models that update biological priors, models that introduce new training paradigms, papers that define new splits, and papers that introduce new metrics. This is what an emerging field would look like: many groups trying different ways of turning messy biological measurements into predictive systems.
But the same creativity also exposed a weakness. One recurring pattern at ICLR 2026 was that many papers introduced a new model, a new understudied benchmark, and a new set of understudied metrics, where they claim state-of-the-art performance without benchmarking their performance against many of the models mentioned above. Without shared evaluation conventions, progress becomes hard to read. A model evaluated on one split with one metric and one set of baselines cannot easily be compared to another model evaluated under a different setup. This is especially problematic when simple baselines are sometimes surprisingly strong, and when recent models are not always included in the comparison.
The result was less a shared scientific conversation than a collection of parallel claims. Each model could look strong under its own evaluation setup, but it was often difficult to know whether the field had actually moved forward.
This is not only an academic complaint. In drug discovery, weak evaluation can be expensive. A model that looks impressive on a convenient split may fail when asked to generalize to a new cell type, a new perturbation, a new disease context, or a new experimental platform. If virtual cell models are going to influence real decisions, the field needs to know what they are, when they work, when they fail, and why.
Recent benchmarking work has underscored this concern. A Nature Methods study comparing several deep learning and foundation model approaches for genetic perturbation effect prediction found that none outperformed deliberately simple baselines, highlighting the importance of critical benchmarking in this area. Another ICLR 2025 paper, PertEval-scFM, similarly argued that zero-shot single cell foundation model embeddings provide limited improvement over simple baselines for perturbation effect prediction, especially under distribution shift.
This is the uncomfortable truth: in most settings, the baseline is still stronger than the cutting edge research.
That does not mean virtual cell modeling is doomed. It means the field is young. It also means we should be wary of the hype that comes exclusively from a model with a memorable name, that has run on thousands of GPUs, is scaled to billions of parameters, or has used a very large training set.
Defining Virtual Cell Beyond the Hype
One of the most interesting moments came during the LMRL workshop panel. The session included Daniel Burkhardt from NVIDIA, Anton Osokin from Isomorphic Labs, Sara-Jane Dunn from Relation Therapeutics, and Ben Kompa from Lila Sciences, moderated by Kristina Ulicna from Iambic Therapeutics.
What stood out was the frustration. Across very different industry vantage points, the panelists kept circling the same complaint: too many claims, too many models, too many fragile benchmarks, and not enough agreement on what should count as progress.
That frustration is healthy. It suggests the field is beginning to mature. The first phase of a new research area often rewards proof-of-concept models and anything that shows the idea is plausible. The next phase asks for something harder: standardized tasks, robust metrics, strong baselines, careful splits, and honest reporting.
Virtual cells are entering that second phase, yet the incentives haven’t caught up.
Academia rewards novelty. Companies reward momentum. Investors reward narratives. But biology rewards truth, and truth is slower. If we are serious about using virtual cell models for drug discovery, the field needs to make peace with less glamorous work: benchmark design, dataset auditing, baseline construction, metric analysis, and failure characterization.
The Benchmarking Problem Is Not a Small Detail
One of the strongest lessons from ICLR was that we’re doing ourselves a disservice by considering evaluation as a secondary concern.
Several papers still relied on understudied baselines. Some compared against GEARS while not comparing against newer or stronger models. Others did not include simple mean baselines or nearest neighbor style baselines that are known to be surprisingly competitive in certain perturbation settings. Some emphasized simple Pearson correlation, without exploring Pearson Delta, and without deeper analysis of differential effects or perturbation specific biological fidelity.
This matters because perturbation effect prediction can be deceptively easy to make look good. Many genes do not change much, quite a few perturbations share broad effects, and batch effects can leak information. Beyond that, dataset curation can inflate performance (interestingly discussed in this paper in ICLR 2026), and splits can accidentally reward memorization rather than generalization.
In essence, a model can appear to predict biology when it is really predicting the structure of the dataset.
That is why metrics need to be stable, interpretable, and difficult to game. They should not be overly sensitive to arbitrary preprocessing choices. They should distinguish between predicting the average cell state and predicting the actual perturbation response. They should reveal whether a model captures meaningful differentially expressed genes, directionality, and relevant biological programs, but also whether these signals are aligned with the downstream use cases we actually care about, from mechanism discovery to experimental prioritization.
This is where some of the most valuable work at the workshops appeared. One paper by Qiyuan Liu, on the effects of distance metrics and scaling on the perturbation discrimination score (PDS), went after a small but consequential problem: one of the field’s go-to metrics shifts dramatically depending on preprocessing and scaling choices, which leaves different models effectively incomparable. This is exactly the sort of issue that can quietly distort a field. A metric that is not comparable across models is not a reliable metric.
The broader lesson is simple: we need fewer leaderboard illusions and more precise science.
A good benchmark should make progress harder to fake. This can be achieved by including strong baselines, defining biologically meaningful splits, and testing generalization across perturbations, cell types, contexts, and datasets. Such a benchmark would make preprocessing conventions explicit, and would tell us not just which model wins but what kind of biological features the model has learned.
Foundation Representation Models Are Not Yet Virtual Cells
Another confusion worth clearing up is the way we increasingly talk about transcriptomics foundation models as if they were already virtual representations of cellular biology. Most of these models are, first and foremost, representation learners. They compress gene expression data into embeddings that can support downstream tasks such as cell state characterization, perturbation analysis, disease stratification, experiment design, or retrieval.
This does not make them unimportant. Quite the opposite. In the short term, foundation models for transcriptomics may be among the most useful outputs of the whole field for drug discovery. A good representation model can help organize large transcriptomic datasets, reduce dimensionality, identify related biological states, compare experiments, and support better decisions about what to test next.
They can also become important components of future virtual cell systems. A useful virtual cell will likely need strong representations of cellular states, just as a good simulator needs a meaningful internal description of what it is simulating. In that sense, representation models and simulation models are not competing directions. They are complementary pieces of the same longer term ambition.
The problem appears when we start treating these representation models as virtual cells by default, or when we evaluate their usefulness almost entirely through perturbation effect prediction. Perturbation effect prediction matters, and pretrained representations may help with it. But if we make this the main proof that a foundation model is biologically useful, we end up narrowing both the model and the field around one task.
This is especially limiting because representation models may already be useful in many other ways. They can help with mechanism of action analysis, cross dataset alignment, disease state modeling, patient stratification, experiment retrieval, modality integration, cellular morphology prediction, response biomarkers, and treatment outcome prediction. Many of these tasks are closer to how drug discovery decisions are actually made today.
So the question should not only be whether a foundation model can improve post-perturbation expression prediction but whether it gives us better biological structure, experimental decisions, links across modalities, and representations of disease. If we want to get the most out of foundation models, we need to develop them better, evaluate them better, and judge them on downstream tasks that reflect their actual use cases in discovery.
A virtual cell should ultimately be judged by the biological and therapeutic questions it helps answer. Foundation models should be judged the same way, without forcing all of their value through the narrow lens of perturbation reconstruction.
The Most Interesting Work May Be Outside the Narrow Definition
Some of the most exciting work we saw did not fit the narrow “predict perturbed transcriptome” mold.
One example came from Maria Brbić’s group at EPFL, focused on connecting single-cell transcriptomics to morphology imaging. This direction feels closer to the broader promise of virtual cells. If a model can connect gene expression to visual phenotype, it can help us understand how molecular changes manifest as cellular behavior. That is deeply relevant to gene function, mechanism of action, and disease biology.
This kind of cross modality work matters because cells are not studied by biologists as transcriptomes alone. They are observed through a plethora of assays (e.g., images, sequencing, spatial maps, protein readouts, viability measurements, electrophysiology) that give biologists a better overall understanding of the cell state. A virtual cell that cannot connect across readouts will remain limited.
Another interesting direction was specialized foundation models, such as EVA from Scienta Lab, which focuses on immunology. The specialization is the point. Not every model needs to be a universal model of all cells. In fact, the universal ambition may be premature.
For therapeutic discovery, a model that is genuinely useful in immunology, oncology, neuroscience, or fibrosis may be more valuable than a general model that performs modestly everywhere. The question is not whether a model can classify cell types in a generic benchmark, but whether it can support decisions that matter in a specific biological domain.
And beyond this, the evaluation of general models should expand its horizons. Cell type classification tasks have been useful historically, but they should not dominate our imagination. While this task is quite important and relevant, it remains insufficient on its own. A model that classifies cell types well may still be weak at predicting drug response, discovering mechanisms, or identifying patient relevant biology.
Architecture Still Matters
A growing assumption in modern AI is that scale will solve most problems. Train a larger model on more data, and performance will eventually improve. This “bitter lesson” has shaped much of machine learning.
But biology may not be ready for that lesson yet.
In transcriptomics, we may still be in the architectural exploration phase. Many models borrow ideas from language or vision, but biological data comes with its own structure. Gene expression measurements are sparse, noisy, compositional, assay dependent, and shaped by both biology and experimental artifacts.
Different assays have different noise profiles. Different datasets have different preprocessing histories. Counts, normalized values, ranks, gene panels, batches, and perturbation designs all influencing what the model sees.
That means architecture choices like decoder activation, loss functions, data normalization, and inductive biases matter. We should not assume that scaling a generic architecture will automatically produce a meaningful virtual cell, or even a useful model.
This was one of the intriguing points from newer transcriptomics focused models such as our TxFM model from Recursion, which emphasized representation learning and broad evaluation, thanks to a newly introduced activation function (Rectified Sigmoid) adapted to transcriptomics data distributions. The notable claim was not only performance but efficiency: strong results with far less data (up to 100x) than existing competing approaches. The direction is important. Better architecture and better data use may matter as much or even more than raw scale in biology.
The field needs models designed from the ground up for biological measurements, not merely imported from other domains with a gene vocabulary attached.
The Missing Layer: Patients
Here is the hardest question: how does any of this help actual patients?
Most perturbation effect prediction work is still centered on in vitro systems. These systems are essential. They let us perturb, measure, and learn in ways that are impossible in humans. But drug discovery ultimately cares about patients, not plates.
The translation gap is large. A perturbation that looks promising in a cell line may fail in a tissue, and mechanisms observed in vitro may not dominate in disease. If virtual cell models cannot eventually connect to patient biology, their therapeutic impact will be limited.
This is why patient centered modeling deserves much more attention. We need ways to connect cellular perturbation models to patient data, clinical phenotypes, disease progression, treatment response, and outcomes. We might call this “virtual patient” modeling, though the term should be used carefully to avoid becoming as overloaded as virtual cells. The point is not to simulate a whole person tomorrow, but to make the bridge from simulation to patient outcome more explicit.
That bridge will be difficult because the data is hard to obtain, expensive, and incomplete. Patient data is also often observational, heterogeneous, sparse, and confounded. But that is exactly why the problem matters. Research is not only about solving the tasks with convenient data. It is about identifying the tasks that, if solved, would change what we can do. Virtual cells should not become an isolated benchmarking game. They should become part of a chain that connects perturbation, mechanism, tissue context, disease state, and patient outcome.
While most of the in silico perturbation effect prediction discussion at ICLR was still happening at the cellular level, there were also a few interesting attempts to move the conversation closer to patients. Rather than viewing the cell in isolation, several emerging frameworks attempt to anchor molecular data in clinical reality.
Two examples stood out in that direction.
The first was M-Optimus, a large multimodal foundation model that tries to bridge several biological scales at once. While it is a proprietary, closed source model, it is still a step in the right direction, as it brings together many patient data modalities (histology, bulk transcriptomics, spatial transcriptomics, and longitudinal clinical records), and tries to connect what we see in tissue images, what we measure molecularly, where these signals happen spatially, and how they relate to the patient over time.
The second was SCOPE, which approaches the problem from another angle: how do we represent patients using the cellular composition of their samples? Instead of treating a patient as one flat vector, SCOPE learns patient embeddings based on the distribution and organization of the cell types inside the patient cohort. This is interesting because it gives a way to move from single cell information to patient level similarity. And that is one of the missing bridges in the field right now: not only modeling cells, but understanding how collections of cells define disease states, treatment groups, and eventually clinical outcomes.
These works are still early, and they are far from solving the translation problem on their own. But they point to the right question. While ICLR had a much higher focus on virtual cells than patients, these works represent an optimistic step towards placing cellular modelling inside a wider biological chain: from perturbation, to mechanism, to tissue context, to patient response.
What Should the Field Do Next?
After ICLR 2026, the main takeaway is that virtual cell research does not need another wave of lightly evaluated models making broad claims. It needs infrastructure, better benchmarks to evaluate different models, and a clearer path towards patients.
The most impactful project for a PhD student or research team right now may not be the 101st “virtual cell” model. It may be a rigorous benchmark suite.
That benchmark suite should gather existing datasets, audit their limitations, define meaningful splits, compare metrics, include simple and strong baselines, and evaluate models across tasks that matter. It should test perturbation effect prediction, representation learning, cross modality translation, mechanism discovery, and patient relevance where possible.The Arc Institute’s Virtual Cell Challenge was an important first step in this direction, because it helped push the benchmarking question to the front of the field rather than treating it as an afterthought. But this now needs to go further: the next generation of benchmarks should make it harder to win by exploiting preprocessing artifacts or dataset shortcuts, and easier to understand what kind of biological capability a model has actually learned.
The opportunity is significant. A reliable framework for in silico perturbation effect prediction could meaningfully accelerate experimental design, target discovery, mechanism of action studies, toxicity prediction, and therapeutic hypothesis generation. It could help us decide which experiments are worth doing and which biological stories are worth trusting. And to get there, we obviously need careful evaluation.
ICLR 2026 made the picture quite clear. There is excitement, talent, investment, and visible technical progress. But there is also fragmentation and a growing risk that the field mistakes activity for actual progress.
The next phase of virtual cell research should be less about claiming we have virtualized biology and more about proving, carefully and repeatedly, which pieces of cellular behaviour we can actually model consistently and in which contexts.
That may sound less glamorous, but it is exactly how the field becomes useful.
This post is part of “Inside Valence”, a series where you’ll get a behind-the-scenes look at our research, exploring new ways to predict, explain, and ultimately decode biology. If this resonates, consider subscribing!






