Inside Valence: Thoughts from Inside the Lab

Deep Learning for Microscopy: What Are We Really Learning?

Ihab — Wed, 03 Jun 2026 13:50:57 GMT

TL;DR: Microscopy AI models must be evaluated against simple baselines to understand when they are learning genuine biology or exploiting technical shortcuts like cell density or pixel intensity. While the field is shifting toward massive foundation models, a few benchmarks reveals that some networks often capture similar biological signal as untrained models, even when the biological signal is significant. While massive, in-domain models can successfully capture subtle phenotypes, interpretable baselines are needed to contextualize their true capabilities. To further advance drug discovery, we must make performance interpretability a core design of our benchmarks.

Connect with us: Valence is constantly seeking talented individuals with diverse backgrounds and expertise to join our team. Explore open roles here.

The complexity of the image produced by a microscope is often underappreciated.

When taken out of context, it is just an image of cells. But, in the context of a perturbation experiment, it contains more information: it shows how cells transform when some gene is perturbed or a chemical compound is added. It captures changes to the cell’s features like its shape, texture, intensity, spatial distribution of organelles, and overall arrangement.

This explains the growing importance of microscopy as a major source of phenotypic data in drug research and development. This data allows the researcher to analyze large amounts of experiments and observe different properties of the cells: toxicity, mechanism of action, cellular state, response to stresses, and activity pathways.

While some of this information is invisible to a human analyst, much of the information is rather subtle : only statistically significant after comparing large numbers of samples and many microscopic fields of view.

This is precisely where deep learning comes into play.

Part of my own interest in this question comes from some of my prior work benchmarking single cell transcriptomics models. In that setting, I have repeatedly seen that impressive benchmark numbers can look less impressive once they are compared against simple but well-chosen baselines. A model may appear to understand biology, when it is partly exploiting batch effects, dataset structure, cell-state prevalence, or other shortcuts. That experience made me wonder how much of the same story applies to microscopy.

The power of neural networks for natural image analysis is well-understood. Neural networks are increasingly used in several fields of life science research, including transcriptional profiling and histological pathology analysis. Transcriptional profiling and histological pathology analysis share similarities with microscopy imaging: the former relies on matrices of gene expression data while the latter uses images of tissue sections. Further, like transcriptomics, microscopy can answer questions like: what happens to the biology upon perturbation?

We’ve moved beyond asking whether microscopy images contain biological signals, we know they do. The harder question, and the one I care about most as a benchmarker, is whether our models, and our benchmarks, can tell us what part of the biological signal they have learned.

The type of biological imaging we refer to

When “biological imaging” is mentioned, usually people think about histopathology with stained tissues, cancer diagnosis, and clinical pathology. This is an important branch in biology, and it has been a focus of the machine learning community.

However, the area we refer to is something quite different. We speak about preclinical microscopy imaging: Cell Painting, brightfield imaging, high-content screening, and large scale cell perturbations.

Figure 1 : RxRx1 Cell Painting images

Cell Painting is perhaps the best example to explain our topic. Here, cells are treated with a limited number of fluorescent probes, which stain cellular components. As a result, we can identify the nuclei, mitochondria, endoplasmic reticulum, actin cytoskeleton, Golgi apparatus, ribonucleoprotein granules and other morphological landmarks. This technique does not help us analyze individual markers. Instead, it allows us to obtain a complete phenotypic profile for each cell.

For instance, in a typical high content screen, thousands of drugs or genetic modifications can be tested in various plate conditions, then imaged and computationally analyzed for their phenotypic similarity. Thus, if two drugs create a similar profile, we can infer that they work through a common molecular mechanism. Likewise, phenotypic alignment between a gene knockout and a drug can identify the drug’s potential target. Finally, any particularly robust or unique phenotypes uncovered by these screens serve as high-priority candidates for downstream investigation.

This is a simple idea with large implications: instead of measuring only whether a cell lives or dies, we can measure how the cell changes and use that information to infer potential targets and / or treatments.

From handcrafted morphology to foundation models

While the first generation of image-based profiling did not require deep learning, it required intensive image analysis.

Thanks to tools like CellProfiler, cells could be segmented and annotated based on image morphology, including size, shape, texture, intensity, granularity, radial distribution, and much more. This transformed the cell from an image into tabular data, allowing for applications of bioinformatic methods including: normalization, quality control, aggregation, similarity search, clustering, and statistics.

This was an important milestone. It enabled computability of microscopy images. At the same time, no deep learning was needed to analyze cells. The feature space was pre-defined, interpretable, and powerful enough to sustain years of morphological profiling.

Handcrafted features had their own limitations, though. They could only reflect patterns that we knew how to extract. There might exist biological patterns which are real, reproducible, and meaningful, yet impossible to encode using a conventional texture or shape descriptor.

That is where machine learning came into play. Instead of specifying the features that one must analyze , machine learning algorithms can be trained to learn the relevant representation of data automatically.

The evolution of our methods followed a predictable path. At first, there were specialized image-based CNNs, followed by models trained under weak supervision for perturbation or experimental label prediction. Most recently, the field moved towards self-supervised and foundation models: large-scale neural networks trained on vast microscopy datasets and applied to different downstream tasks.

The promise is attractive. All in all, such a microscopy foundation model would allow us to overcome many limitations of handcrafted feature-based approaches. It could serve as a universal feature extractor for biological images, streamline profiling workflows, improve perturbation retrieval, facilitate MOA discovery, and open up image-based profiling to people without the ability to construct a computer vision pipeline.

On the whole, the ambition looks pretty similar to what happened in the field of natural image understanding. A trained model becomes useful due to having learned reusable patterns of data: starting from edges, textures, and object parts, moving up to higher-level scene semantics.

This was ported over to microscopy data. Given that the key advantage of this data is that it is scalable (if a method has been standardized, then it allows us to collect very large phenotypic datasets with relative ease), this made it prime grounds for scaling deep learning applications in drug discovery.

However, scale also changes the practical shape of the field: who can train these models, who can evaluate them properly, and who can reproduce or challenge the conclusions.

Size of the Dataset: To begin , datasets in the microscopy field are gigantic. Of course, transcriptomics datasets are often big as well, however, transcriptomic datasets typically involve a relatively sparse matrix of genes and cells. Microscopy datasets consist of images in one or more channels across many plates, sites, and fields. Public Cell Painting resources can go as high as hundreds of terabytes. And industrial datasets can go far higher than that. This means a significant shift of who is allowed to join the party. One can just download a transcriptomics dataset onto a laptop computer and analyze it locally. Trying to pull a big imaging screen off the server and process it is a totally different beast. This is not an argument against scale. In fact, microscopy is one of the areas in biology where scale has clearly mattered. The issue is that the same scale that enables strong representations also makes careful benchmarking expensive, both at training time and at inference time. That creates a need for smaller but representative benchmarks, precomputed embeddings, and curated subsets that let more laboratories participate.
Experimental Protocol: Microscopy experiments depend quite a bit on the particular setup and conditions under which they are performed. Things like the microscope, staining protocols, plate format, cell density, image focusing, illumination, batch, vendor, and acquisition settings are all factors that influence the way an image is collected. Some of these effects are biological, others technical, some both. It’s easy for a model to train on specific characteristics which will not generalize to other collections.
Interpretability: Gene names are intuitive. There’s even literature associated with pathways and protein interactions. Image embeddings, on the other hand, are not. It could be true that a model knows something meaningful about the differences between two perturbations from their images, but the researcher still wants to know what that is.

There have been big changes in recent years that have helped us overcome all three of these challenges.

Publicly available data sources like JUMP-CP and the Cell Painting Gallery have allowed us to think more clearly about the scale needed for modern representation learning, with efforts to reduce dataset size while retaining the same number of samples and performance. The availability of large-scale industry datasets has accelerated this trend.

There have also been architectural advancements that have made microscopy-focused modeling possible. Channel-adaptive or channel-agnostic modeling is an essential example. Models for natural images assume three input channels: red, green, and blue. Microscopy works differently. Some assays use one brightfield channel. Other assays use five fluorescent channels. Yet another may use a different number or combination of these channels with a different order. The useful model would need to be flexible to different configurations of the channels.

That’s where channel-agnostic masked autoencoders and their DINO analogs come in. It’s not merely a matter of techniques but a matter of embedding the biology of microscopy and its experimental reality into foundation models.

And of course, there’s scale. With larger models being trained on larger microscopy data collections, we can get better representations, particularly with training and evaluation data similar to our downstream biological task. It means that some of the scaling effects that have proven successful in natural image and language models are genuinely relevant in microscopy too.

But a key question has surfaced after all these advancements : What have the microscopy models actually learned?

Benchmarking what models really learn

A benchmark score is valuable only insofar as we understand what it proves.

On natural images, a good performance on ImageNet typically indicates that the model has developed a representation of visual features suitable for object recognition. While this is by no means a perfect benchmark, decades of scrutiny have helped to elucidate many of its failings.

Benchmarks in microscopy are far newer, though there have been many successful efforts improving these in different academic and industrial contexts.

One useful benchmark, proposed in recent microscopy benchmarks, is to ask whether a representation can recover known biological relationships. In practice, perturbations are embedded, ranked by similarity, and evaluated by how well they retrieve known relationships such as shared mechanism of action, gene function, pathway membership, or other curated biological links. This kind of recall is useful because it is closer to the scientific question than ordinary classification accuracy: do phenotypes that should be biologically related end up close to one another?

This gives us a concrete starting point for asking whether a model is organizing images according to biology, rather than merely exploiting nuisance variation.

The question now is what additional diagnostic baselines can tell us about what drives a score in these benchmarks.

There is no denying that the model has learned biological morphology, which is exactly the intent in the field. However, the model has the possibility of also learning biological intensity statistics, biological cell density, biological plate layout, biological acquisition bias, biological stains, biological focus statistics, or biological dataset structure. All of these can contribute to the benchmark score, although it does not necessarily imply biological abstraction.

A robust model ought to be compared not only against other trained models, but against simpler baselines. How much information can be learned about biological samples from pixel intensities alone? What can be learned from cell counts alone? Can a biological signal be extracted from an untrained neural network? How much information about structure can be inferred independent of image intensity? Simple as they might seem at first glance, these benchmarks serve to demonstrate how much of the task can be solved independent of the desired learning capability. These baselines help distinguish performance that reflects reusable biological structure from performance that reflects shortcuts specific to the benchmark.

Comparing a trained model against weak baselines yields little insight. Against powerful diagnostic baselines, however, the model will reveal plenty.

That is the problem my collaborators and I examine in our new paper, Deep Learning for BioImaging: What Are We Learning? seeks to explain at ICML 2026. I should be explicit about my position here: I work at Valence Labs @ Recursion, and one of the strongest-performing models discussed below comes from our own work. But the point is not simply to show that a larger in-domain model performs well, but use it as a reference to ask what kind of biological evidence this performance reflects.

The paper examines representation learning in bioimaging over two scales of imaging: cell culture microscopy and tissue imaging. Rather than simply asking which model scores highest on the given benchmark, the work asks what this score represents.

To do so, the paper re-examines existing benchmarks through the use of very simple yet insightful baselines:

In cell culture microscopy, trained microscopy models are compared with untrained neural networks and pixel-statistics features.
In tissue imaging, image foundation models are compared with structure-only representations derived from cellular organization.

These baselines are intentionally unambitious. They do not aim to propose novel foundation models. Rather, they aim to reveal the limitations of the benchmarks themselves.

The results are telling.

In a few cell-culture microscopy tasks, multiple trained models perform rather similarly to untrained and extremely simple baselines. The implication is clear : part of the benchmark signal does not require high-level biological representation learning. Architectural bias and low-level image statistics alone contain enough information in many instances.

At the same time, the largest in-domain model we evaluate, the MAE-G/8 model is clearly above these baselines on the evaluated tasks. I think the cautious interpretation is : scale, in-domain biological data, and the training recipe appear to add information beyond obvious low-level confounders, but the baselines help clarify where that extra information is actually needed.

In other tasks, especially those where the biological signal is either more subtle or better aligned with the pre-training data, in-domain foundation models show their benefits compared to simple baselines. The takeaway is that benchmark scores should be calibrated to the specific task by using biologically interpretable and informative baselines for comparison.

A model’s performance might incorporate various components: true biological information, useful visual priors, technical correlations, and domain-specific shortcuts where the presence of specific biological components can be correlated to the outcome of a specific downstream task and not another.

Figure 2 : Average recall of known biological relationships on rxrx3-core benchmark of different pretrained models vs untrained model weights.

For me, as both a model builder and someone who cares about benchmarking, this says something simple: there is no roadblock in front of us. Models are succeeding on different downstream tasks. What is still missing is a sharper set of interpretable tools for measuring what they have learned, when they should be trusted, and which model is appropriate for which downstream task.

If a simple baseline performs well in solving a given task, it’s a piece of information about the task, its requirements, and what kind of evidence is required for claims that a given model learns reusable biological structures. And, more importantly, it’s a valuable insight on how to build a benchmark properly. Namely, in order to test deep representation learning ability, it should include measurements that separate correlations from the underlying biological signal.

At the core of our approach is a single concept: advances in microscopy representation learning need better interpretability, both at the model level and at the downstream task level.

It tells us if representations can be generalized between experiments. If morphological information was captured without density features. If a foundation model did learn subtle phenotypes.

Implications for the field

What’s next for deep learning in microscopy research? Scaling up has been shown to be a successful recipe, especially when trained with high quality data. What is missing is a more mature interpretability and explainability culture.

There are a number of directions that would accelerate practical deep learning applications for impact :

Accessible datasets are needed. Currently, many of the largest-scale imaging screens are simply too big to be handled by ordinary laboratories. Compressed benchmarks, curated subcollections, precomputed embeddings, and cloud-native tools are all necessary steps. But accessibility is also essential. It allows new participants to emerge. Whether it is through crops, smaller image sizes, or data deduplication, we need to find how to preserve good model performance while avoiding the massive memory overhead of microscopy imaging data. Several works are going in this direction.
Baselines need to become more portable and more routine. The next step is to make such comparisons easier to run for new datasets and new tasks: using pixel statistics, cell counts, feature engineering, plate information, structural information, and precomputed embeddings where possible. This should not be seen as a secondary consideration. Baselines are integral to interpretation and adoption.
Interpretability needs improvement. Here, it should not be assumed that every learned feature corresponds to a well-defined biological concept. That is not always true for biology. However, biologists will still need ways of connecting image-based features to morphology, pathways, mechanisms, quality assessment, and experiments. A good microscope model is more than an embedding.
Usable software is essential. What allowed CellProfiler to revolutionize microscopy was, in part, its accessibility. Deep learning applied to microscopy will also need that spirit: models that are easy to use as tools; embeddings that easily integrate into workflows; documentation written for biologists; and benchmarks accessible to graduate students without their own specialized infrastructure.

This may be one of the most profound takeaways from the history of bioinformatics. A technique is powerful if and only if it becomes integrated into normal scientific practice.

One may call microscopy an underdeveloped cousin of transcriptomics, and in a certain sense, it is true. Transcriptomics boasts standardized data formats, mature statistics, and straightforward transition from feature vectors to biological interpretations.

Yet microscopy has something else: phenotypes, spatial arrangements, scalability, industrial use, and potential to uncover phenomena not amenable to molecular interpretation.

The point is not to pit the two against each other. The most effective biology ML models will likely incorporate all types of input data : gene expression, morphology, chemical structures, proteomics, spatial relations, and existing biological knowledge. And all modalities are bound to have gaps in their coverage of biological phenomena. At the same time, they will show the same reality from different perspectives.

This makes microscopy essential for the future of biological ML models because of its ability to capture not what a cell holds but who a cell is.

Yet for that perspective to work well, we should understand what machine learning algorithms see when they analyze microscopic images.

This post is part of “Inside Valence”, a series where you’ll get a behind-the-scenes look at our research, exploring new ways to predict, explain, and ultimately decode biology. If this resonates, consider subscribing!

Subscribe now

From Crystals to Drugs: What Drug Discovery Can Learn From Materials Science as Revealed at ICLR’s AI4Mat 2026

Cristian Gabellini — Tue, 12 May 2026 13:03:52 GMT

tl;dr: ICLR’s AI4Mat 2026 revealed that the tools built to simulate materials science are the same tools that could simulate biology. The problem is that a crystal structure has thousands of atoms whereas a realistic drug target has hundreds of thousands, moving and shifting in ways that take only milliseconds. While the methods aren't yet scalable to drug discovery, this blog post describes what it would actually take to close that gap, and why that is important for anyone trying to build simulators of biological complexity.

Connect with us: Valence is constantly seeking talented individuals with diverse backgrounds and expertise to join our team. Explore open roles here.

Two fields, one problem, one solution

The boundary between materials science and drug discovery is dissolving faster than most practitioners in either field might have noticed. In many cases, the same tools are being built for different domains within chemical space: generative models to propose novel crystal structures, flow-matching frameworks to optimize electrolyte compositions, and agentic systems to reason about synthesis routes.

AI4Mat 2026, a workshop held at ICLR in Rio De Janeiro, brought together work spanning both worlds. Looking across the accepted posters, it became clear that the materials discovery field is, whether knowingly or not, benefiting the drug discovery field as well. The question is whether it is building it at the right scale.

Several papers at this year’s workshop sit directly at the intersection of materials science and drug discovery. In drug discovery, both the spatial scale of the systems and the temporal scale of the relevant dynamics are usually much larger than in the materials settings on which many of these methods are developed. From that perspective, Synthesis-constrained molecular design with direct optimization of reaction conditions addresses one of the most persistent failure modes in computational drug discovery: generating molecules that are chemically interesting but experimentally inaccessible. By jointly optimizing molecular structure and the reaction conditions required to make it, the work highlights that no desirable drug-like property is meaningful if synthesis is inaccessible. Similarly, FragmentFlow: Scalable Transition State Generation for Large Molecules tackles reaction pathway modeling in a way that explicitly addresses distribution shifts across molecular size, a capability that matters enormously when you need to reason about metabolic stability or covalent warhead reactivity in a drug candidate. And When Does Context Help? A Systematic Study of Target-Conditional Molecular Property Prediction asks exactly the right question for structure-based drug design: how much does knowing the target actually improve your property predictions, and under what conditions?

Taken together, these papers point toward a broader issue that matters much more in drug discovery than in most current materials benchmarks: scale has two dimensions, spatial and temporal:

Spatial: the size and complexity of the molecular system itself, from small, well-ordered structures to heterogeneous biological assemblies whose behavior is shaped by their surrounding medium containing tens or hundreds of thousands of atoms.
Temporal: the timescale over which the relevant phenomena unfold, from local rearrangements that happen quickly to conformational transitions, binding events, and allosteric effects that emerge only over much longer dynamics.

Most methods still operate comfortably only when at least one of those axes remains limited.

The timescale problem and the two ways to attack it

Pointing at scale in terms of atom count is only half the story. The other axis is time. Even if you could simulate a 100,000-atom protein-ligand system at the right level of theory, classical molecular dynamics would still be a bottleneck: the relevant conformational changes happen on timescales that are simply inaccessible to femtosecond integration steps, no matter how many GPUs you throw at the problem.

The field has converged on two serious responses: either learn to sample the thermodynamic ensemble directly, bypassing dynamics altogether, or learn to take much larger steps through time without accumulating errors. Both strategies are represented at AI4Mat 2026, and together they paint a picture of how the field is beginning to engage with the timescale problem.

Sampling the ensemble and skipping frames

The first strategy is exemplified by Boltzmann Generators for Condensed Matter via Riemannian Flow Matching. The original Boltzmann Generator framework was conceived precisely for biomolecular systems: the goal was to train a generative model that could sample thermodynamic ensembles directly, bypassing the crippling timescale problem of classical molecular dynamics entirely. The core capability being developed here, learning to generate configurations that correctly represent a Boltzmann distribution over a complex energy landscape, is exactly what is required to model molecular properties that depend on ensembles rather than single structures. Such properties, whether equilibrium or non-equilibrium, are inherently dynamic: they emerge not from a single structure, but from the distribution of states a system explores and the transitions between them across time. A model that cannot sample that landscape is not modeling the physical property itself but instead just a snapshot of it.

The second strategy is the one that has received less attention but may be more immediately practical: instead of bypassing dynamics, learn to take much larger timesteps through them. Learning Hamiltonian Flow Maps: Mean Flow Consistency for Large-Timestep Molecular Dynamics is the clearest expression of this idea. The key insight is that you do not need every femtosecond frame to understand what a molecule is doing. If you can learn a flow map that propagates the system accurately over timescales orders of magnitude larger than what classical integrators can handle, you collapse the computational cost of reaching biologically relevant timescales from intractable to feasible. This is not a new idea (coarse-graining and enhanced sampling have been around for decades) but framing it as a learned flow map with proper consistency guarantees is a meaningful advancement in the context of this field. For drug discovery, the implication is clear: conformational transitions that would require milliseconds of simulation to observe classically might become accessible in minutes.

1,000 atoms is not 100,000 atoms: from materials to drug discovery

These two strategies share a common limitation that the field has not yet fully confronted. The systems featured across AI4Mat 2026, including the most technically sophisticated ones, are typically validated on structures of hundreds to a few thousand atoms, in clean, periodic, or otherwise idealized conditions. That is an appropriate regime for many materials problems. It is the wrong regime for drug discovery.

A realistic protein-ligand system does not look like a small organic crystal. It looks like a membrane-embedded GPCR with a ligand in its orthosteric pocket, surrounded by a lipid bilayer, explicit water, and physiological concentrations of ions, a system that routinely exceeds 100,000 atoms when set up for serious molecular dynamics. Until both the ensemble-sampling and the large-timestep approaches are shown to work reliably at that scale, they remain promising ideas rather than mature drug discovery tools.

But this scale gap also points to a deeper opportunity. The answer to the timescale problem is not simply to gather better data on slow biological dynamics. It is to use machine learning to build effective representations of those dynamics in the first place: models that can either sample the relevant thermodynamic ensemble directly or propagate a system across long stretches of time without resolving every microscopic step. In that sense, ML is not just an analysis layer placed on top of a simulation. It is a way of extending the simulation regime itself.

The benchmarks don’t match the problem

This is not a hardware problem that will quietly resolve itself as GPUs get faster. It is a distributional shift that runs through every layer of the stack: the training data, the model architecture, the evaluation protocol, and the scientific question being asked. A model trained on the Cambridge Structural Database and benchmarked on held-out crystals is not being tested on anything that resembles a flexible macromolecule in a heterogeneous biological environment. The community has become very good at building systems that score well on the benchmarks it has constructed for itself. Those benchmarks do not yet correspond to the problems that matter most.

What it would actually take

What would it mean to solve this ? It would mean treating machine learning as part of the dynamical stack itself. We need training data that includes not just structures but thermodynamic ensembles, and conformational distributions rather than single geometries. But we also need models that can extend those data by learning effective samplers, coarse-grained dynamics, and large-timestep flow maps that reach regimes which brute-force simulation cannot. It would mean evaluation protocols that measure free energy accuracy, not just RMSD to a crystal structure. It would mean test systems where the ground truth is defined by not only some held-out split of synthetic data but also experiment, isothermal titration calorimetry, surface plasmon resonance, or cryo-EM ensembles. Several threads at AI4Mat point toward this: the Boltzmann Generator work on thermodynamic sampling, the target-conditional property prediction study’s interrogation of when structural context actually helps, and the synthesis-constrained design work’s insistence that experimental realizability is part of the objective. These are the right instincts. The field needs to turn these ideas into a shared research objective.

Build for where it counts

AI4Mat 2026 demonstrates, convincingly, that machine learning for molecular and materials discovery is no longer a speculative endeavor. The methods are real, the benchmarks are improving, and the community is asking increasingly sharp scientific questions. What AI4Mat 2026 makes visible, however, is not only progress within materials discovery itself, but the emergence of a methodological toolkit that could matter far beyond it. The next step is to direct that complexity toward the central bottleneck in drug discovery: accessing the relevant thermodynamic and dynamical regimes of biologically realistic systems. That will require better benchmarks and better experimental grounding, and also something more ambitious: using ML not only to score molecules or interpolate known data, but to learn the effective dynamics and ensemble structure and scale to systems that brute-force simulation cannot reach on its own.

Drug discovery is one of the hardest test cases available because it demands exactly that combination of scale, physics, and experimental accountability. The Boltzmann Generator lineage shows what becomes possible when you insist on getting the physics right. That same insistence, at scale, on biological realism, on experimental accountability, should become the field’s north star.

Subscribe now

Virtual Cell @ ICLR: A Sea of Silos

Ihab — Thu, 30 Apr 2026 21:21:51 GMT

Between April 23-27, ICLR brought the AI community to Rio, a city that only enhanced an already rich annual experience. Conferences are not only paper lists and poster sessions, they are an opportunity to get together and learn from new people working in similar areas, accelerating the exchange of novel ideas. The weather was warm, the city was alive, the food was excellent, and the mood was open. Between talks, workshops, lunch lines, food trucks, and late night conversations, Rio gave the conference a different rhythm: less sterile and more human.

For those of us working at the intersection of machine learning and biology, ICLR 2026 was especially interesting. As one of the biggest conferences in AI, the workshop topics are an important indicator of the areas receiving global attention. It was exciting to see the program include multiple biology focused workshops, which underscored that this research is no longer a niche side conversation but one that has firmly entered the mainstream.

The workshops included Machine Learning for Genomics Explorations (MLGenX), Generative AI in Genomics (Gen^2): Barriers and Frontiers, and Learning Meaningful Representations of Life(LMLR). Unfortunately, several biology workshops happened on the same day, often in overlapping sessions. This made the day intellectually rich but logistically difficult. For students, researchers, and industry scientists interested in biological foundation models, perturbation effect prediction, multimodal cellular data, and drug discovery, it often meant choosing between sessions that should probably have been in dialogue with one another.

That tension is a good metaphor for the field itself. There is more work than ever. There is more investment than ever. There are more models than ever. But the pieces are not yet organized into a coherent whole.

And nowhere was that clearer than in the conversation around virtual cells.

What Do We Mean by a “Virtual Cell”?

The term “virtual cell” is now everywhere. It appears in papers, talks, workshops, company decks, and funding narratives. But the more often the phrase is used, the less meaningful it becomes.

A narrow definition would say that a virtual cell is a model that predicts how a cell responds to a perturbation. Knock out a gene, add a chemical compound, change an experimental condition, and the model predicts the resulting gene expression profile.

That is useful, but it is not enough.

A more helpful definition is broader: a virtual cell should help us predict, explain, and discover. It should predict how cellular systems respond to interventions. It should explain why those responses happen, at least enough to generate biological insight. And it should help us discover something new, whether that is a mechanism of action, a therapeutic hypothesis, a biomarker, or the most informative experiment to run next.

Under that definition, virtual cell modeling becomes broader than any single task, modality, dataset, or benchmark. It includes transcriptomics and phenomics perturbation effect prediction, but also goes beyond them towards building general, useful computational representations of cellular behaviour.

The reason this distinction is important is because much of the current discussion around virtual cells has narrowed toward perturbation effect prediction from single cell RNA-seq. This is an important task, and it may become useful for parts of drug discovery, but treating it as the full definition of a virtual cell makes the ambition (and potential) of the field feel much smaller than it should be.

A cell is not just a vector of gene expression counts. It has morphology, spatial context, lineage, state, history, environment, and function. A useful virtual cell will eventually need to take into account many of these layers. If the field defines the entire ambition around narrow benchmarking tasks, it risks building a very small cathedral and calling it a city (just like Cathedral of St. Sebastian, though large, is not all there is to Rio).

The Field Is Moving Fast. Maybe Too Fast

The last few years have produced a wave of models for transcriptomics perturbation effect prediction: GEARS, CellFlow, TxPert, STATE, X-Cell, and plenty more. They mostly try to predict how gene expression shifts under perturbation. Some explore flow matching or conditional generation, some inject biological knowledge bases. There is a lot happening, and it’s sometimes hard to know what to make of it all.

The ICLR 2026 conference showed how much creativity is entering perturbation effect modeling. It featured models that update biological priors, models that introduce new training paradigms, papers that define new splits, and papers that introduce new metrics. This is what an emerging field would look like: many groups trying different ways of turning messy biological measurements into predictive systems.

But the same creativity also exposed a weakness. One recurring pattern at ICLR 2026 was that many papers introduced a new model, a new understudied benchmark, and a new set of understudied metrics, where they claim state-of-the-art performance without benchmarking their performance against many of the models mentioned above. Without shared evaluation conventions, progress becomes hard to read. A model evaluated on one split with one metric and one set of baselines cannot easily be compared to another model evaluated under a different setup. This is especially problematic when simple baselines are sometimes surprisingly strong, and when recent models are not always included in the comparison.

The result was less a shared scientific conversation than a collection of parallel claims. Each model could look strong under its own evaluation setup, but it was often difficult to know whether the field had actually moved forward.

This is not only an academic complaint. In drug discovery, weak evaluation can be expensive. A model that looks impressive on a convenient split may fail when asked to generalize to a new cell type, a new perturbation, a new disease context, or a new experimental platform. If virtual cell models are going to influence real decisions, the field needs to know what they are, when they work, when they fail, and why.

Recent benchmarking work has underscored this concern. A Nature Methods study comparing several deep learning and foundation model approaches for genetic perturbation effect prediction found that none outperformed deliberately simple baselines, highlighting the importance of critical benchmarking in this area. Another ICLR 2025 paper, PertEval-scFM, similarly argued that zero-shot single cell foundation model embeddings provide limited improvement over simple baselines for perturbation effect prediction, especially under distribution shift.

This is the uncomfortable truth: in most settings, the baseline is still stronger than the cutting edge research.

That does not mean virtual cell modeling is doomed. It means the field is young. It also means we should be wary of the hype that comes exclusively from a model with a memorable name, that has run on thousands of GPUs, is scaled to billions of parameters, or has used a very large training set.

Defining Virtual Cell Beyond the Hype

One of the most interesting moments came during the LMRL workshop panel. The session included Daniel Burkhardt from NVIDIA, Anton Osokin from Isomorphic Labs, Sara-Jane Dunn from Relation Therapeutics, and Ben Kompa from Lila Sciences, moderated by Kristina Ulicna from Iambic Therapeutics.

What stood out was the frustration. Across very different industry vantage points, the panelists kept circling the same complaint: too many claims, too many models, too many fragile benchmarks, and not enough agreement on what should count as progress.

That frustration is healthy. It suggests the field is beginning to mature. The first phase of a new research area often rewards proof-of-concept models and anything that shows the idea is plausible. The next phase asks for something harder: standardized tasks, robust metrics, strong baselines, careful splits, and honest reporting.

Virtual cells are entering that second phase, yet the incentives haven’t caught up.

Academia rewards novelty. Companies reward momentum. Investors reward narratives. But biology rewards truth, and truth is slower. If we are serious about using virtual cell models for drug discovery, the field needs to make peace with less glamorous work: benchmark design, dataset auditing, baseline construction, metric analysis, and failure characterization.

The Benchmarking Problem Is Not a Small Detail

One of the strongest lessons from ICLR was that we’re doing ourselves a disservice by considering evaluation as a secondary concern.

Several papers still relied on understudied baselines. Some compared against GEARS while not comparing against newer or stronger models. Others did not include simple mean baselines or nearest neighbor style baselines that are known to be surprisingly competitive in certain perturbation settings. Some emphasized simple Pearson correlation, without exploring Pearson Delta, and without deeper analysis of differential effects or perturbation specific biological fidelity.

This matters because perturbation effect prediction can be deceptively easy to make look good. Many genes do not change much, quite a few perturbations share broad effects, and batch effects can leak information. Beyond that, dataset curation can inflate performance (interestingly discussed in this paper in ICLR 2026), and splits can accidentally reward memorization rather than generalization.

In essence, a model can appear to predict biology when it is really predicting the structure of the dataset.

That is why metrics need to be stable, interpretable, and difficult to game. They should not be overly sensitive to arbitrary preprocessing choices. They should distinguish between predicting the average cell state and predicting the actual perturbation response. They should reveal whether a model captures meaningful differentially expressed genes, directionality, and relevant biological programs, but also whether these signals are aligned with the downstream use cases we actually care about, from mechanism discovery to experimental prioritization.

This is where some of the most valuable work at the workshops appeared. One paper by Qiyuan Liu, on the effects of distance metrics and scaling on the perturbation discrimination score (PDS), went after a small but consequential problem: one of the field’s go-to metrics shifts dramatically depending on preprocessing and scaling choices, which leaves different models effectively incomparable. This is exactly the sort of issue that can quietly distort a field. A metric that is not comparable across models is not a reliable metric.

The broader lesson is simple: we need fewer leaderboard illusions and more precise science.

A good benchmark should make progress harder to fake. This can be achieved by including strong baselines, defining biologically meaningful splits, and testing generalization across perturbations, cell types, contexts, and datasets. Such a benchmark would make preprocessing conventions explicit, and would tell us not just which model wins but what kind of biological features the model has learned.

Foundation Representation Models Are Not Yet Virtual Cells

Another confusion worth clearing up is the way we increasingly talk about transcriptomics foundation models as if they were already virtual representations of cellular biology. Most of these models are, first and foremost, representation learners. They compress gene expression data into embeddings that can support downstream tasks such as cell state characterization, perturbation analysis, disease stratification, experiment design, or retrieval.

This does not make them unimportant. Quite the opposite. In the short term, foundation models for transcriptomics may be among the most useful outputs of the whole field for drug discovery. A good representation model can help organize large transcriptomic datasets, reduce dimensionality, identify related biological states, compare experiments, and support better decisions about what to test next.

They can also become important components of future virtual cell systems. A useful virtual cell will likely need strong representations of cellular states, just as a good simulator needs a meaningful internal description of what it is simulating. In that sense, representation models and simulation models are not competing directions. They are complementary pieces of the same longer term ambition.

The problem appears when we start treating these representation models as virtual cells by default, or when we evaluate their usefulness almost entirely through perturbation effect prediction. Perturbation effect prediction matters, and pretrained representations may help with it. But if we make this the main proof that a foundation model is biologically useful, we end up narrowing both the model and the field around one task.

This is especially limiting because representation models may already be useful in many other ways. They can help with mechanism of action analysis, cross dataset alignment, disease state modeling, patient stratification, experiment retrieval, modality integration, cellular morphology prediction, response biomarkers, and treatment outcome prediction. Many of these tasks are closer to how drug discovery decisions are actually made today.

So the question should not only be whether a foundation model can improve post-perturbation expression prediction but whether it gives us better biological structure, experimental decisions, links across modalities, and representations of disease. If we want to get the most out of foundation models, we need to develop them better, evaluate them better, and judge them on downstream tasks that reflect their actual use cases in discovery.

A virtual cell should ultimately be judged by the biological and therapeutic questions it helps answer. Foundation models should be judged the same way, without forcing all of their value through the narrow lens of perturbation reconstruction.

The Most Interesting Work May Be Outside the Narrow Definition

Some of the most exciting work we saw did not fit the narrow “predict perturbed transcriptome” mold.

One example came from Maria Brbić’s group at EPFL, focused on connecting single-cell transcriptomics to morphology imaging. This direction feels closer to the broader promise of virtual cells. If a model can connect gene expression to visual phenotype, it can help us understand how molecular changes manifest as cellular behavior. That is deeply relevant to gene function, mechanism of action, and disease biology.

This kind of cross modality work matters because cells are not studied by biologists as transcriptomes alone. They are observed through a plethora of assays (e.g., images, sequencing, spatial maps, protein readouts, viability measurements, electrophysiology) that give biologists a better overall understanding of the cell state. A virtual cell that cannot connect across readouts will remain limited.

Another interesting direction was specialized foundation models, such as EVA from Scienta Lab, which focuses on immunology. The specialization is the point. Not every model needs to be a universal model of all cells. In fact, the universal ambition may be premature.

For therapeutic discovery, a model that is genuinely useful in immunology, oncology, neuroscience, or fibrosis may be more valuable than a general model that performs modestly everywhere. The question is not whether a model can classify cell types in a generic benchmark, but whether it can support decisions that matter in a specific biological domain.

And beyond this, the evaluation of general models should expand its horizons. Cell type classification tasks have been useful historically, but they should not dominate our imagination. While this task is quite important and relevant, it remains insufficient on its own. A model that classifies cell types well may still be weak at predicting drug response, discovering mechanisms, or identifying patient relevant biology.

Architecture Still Matters

A growing assumption in modern AI is that scale will solve most problems. Train a larger model on more data, and performance will eventually improve. This “bitter lesson” has shaped much of machine learning.

But biology may not be ready for that lesson yet.

In transcriptomics, we may still be in the architectural exploration phase. Many models borrow ideas from language or vision, but biological data comes with its own structure. Gene expression measurements are sparse, noisy, compositional, assay dependent, and shaped by both biology and experimental artifacts.

Different assays have different noise profiles. Different datasets have different preprocessing histories. Counts, normalized values, ranks, gene panels, batches, and perturbation designs all influencing what the model sees.

That means architecture choices like decoder activation, loss functions, data normalization, and inductive biases matter. We should not assume that scaling a generic architecture will automatically produce a meaningful virtual cell, or even a useful model.

This was one of the intriguing points from newer transcriptomics focused models such as our TxFM model from Recursion, which emphasized representation learning and broad evaluation, thanks to a newly introduced activation function (Rectified Sigmoid) adapted to transcriptomics data distributions. The notable claim was not only performance but efficiency: strong results with far less data (up to 100x) than existing competing approaches. The direction is important. Better architecture and better data use may matter as much or even more than raw scale in biology.

The field needs models designed from the ground up for biological measurements, not merely imported from other domains with a gene vocabulary attached.

The Missing Layer: Patients

Here is the hardest question: how does any of this help actual patients?

Most perturbation effect prediction work is still centered on in vitro systems. These systems are essential. They let us perturb, measure, and learn in ways that are impossible in humans. But drug discovery ultimately cares about patients, not plates.

The translation gap is large. A perturbation that looks promising in a cell line may fail in a tissue, and mechanisms observed in vitro may not dominate in disease. If virtual cell models cannot eventually connect to patient biology, their therapeutic impact will be limited.

This is why patient centered modeling deserves much more attention. We need ways to connect cellular perturbation models to patient data, clinical phenotypes, disease progression, treatment response, and outcomes. We might call this “virtual patient” modeling, though the term should be used carefully to avoid becoming as overloaded as virtual cells. The point is not to simulate a whole person tomorrow, but to make the bridge from simulation to patient outcome more explicit.

That bridge will be difficult because the data is hard to obtain, expensive, and incomplete. Patient data is also often observational, heterogeneous, sparse, and confounded. But that is exactly why the problem matters. Research is not only about solving the tasks with convenient data. It is about identifying the tasks that, if solved, would change what we can do. Virtual cells should not become an isolated benchmarking game. They should become part of a chain that connects perturbation, mechanism, tissue context, disease state, and patient outcome.

While most of the in silico perturbation effect prediction discussion at ICLR was still happening at the cellular level, there were also a few interesting attempts to move the conversation closer to patients. Rather than viewing the cell in isolation, several emerging frameworks attempt to anchor molecular data in clinical reality.

Two examples stood out in that direction.

The first was M-Optimus, a large multimodal foundation model that tries to bridge several biological scales at once. While it is a proprietary, closed source model, it is still a step in the right direction, as it brings together many patient data modalities (histology, bulk transcriptomics, spatial transcriptomics, and longitudinal clinical records), and tries to connect what we see in tissue images, what we measure molecularly, where these signals happen spatially, and how they relate to the patient over time.

The second was SCOPE, which approaches the problem from another angle: how do we represent patients using the cellular composition of their samples? Instead of treating a patient as one flat vector, SCOPE learns patient embeddings based on the distribution and organization of the cell types inside the patient cohort. This is interesting because it gives a way to move from single cell information to patient level similarity. And that is one of the missing bridges in the field right now: not only modeling cells, but understanding how collections of cells define disease states, treatment groups, and eventually clinical outcomes.

These works are still early, and they are far from solving the translation problem on their own. But they point to the right question. While ICLR had a much higher focus on virtual cells than patients, these works represent an optimistic step towards placing cellular modelling inside a wider biological chain: from perturbation, to mechanism, to tissue context, to patient response.

What Should the Field Do Next?

After ICLR 2026, the main takeaway is that virtual cell research does not need another wave of lightly evaluated models making broad claims. It needs infrastructure, better benchmarks to evaluate different models, and a clearer path towards patients.

The most impactful project for a PhD student or research team right now may not be the 101st “virtual cell” model. It may be a rigorous benchmark suite.

That benchmark suite should gather existing datasets, audit their limitations, define meaningful splits, compare metrics, include simple and strong baselines, and evaluate models across tasks that matter. It should test perturbation effect prediction, representation learning, cross modality translation, mechanism discovery, and patient relevance where possible.The Arc Institute’s Virtual Cell Challenge was an important first step in this direction, because it helped push the benchmarking question to the front of the field rather than treating it as an afterthought. But this now needs to go further: the next generation of benchmarks should make it harder to win by exploiting preprocessing artifacts or dataset shortcuts, and easier to understand what kind of biological capability a model has actually learned.

The opportunity is significant. A reliable framework for in silico perturbation effect prediction could meaningfully accelerate experimental design, target discovery, mechanism of action studies, toxicity prediction, and therapeutic hypothesis generation. It could help us decide which experiments are worth doing and which biological stories are worth trusting. And to get there, we obviously need careful evaluation.

ICLR 2026 made the picture quite clear. There is excitement, talent, investment, and visible technical progress. But there is also fragmentation and a growing risk that the field mistakes activity for actual progress.

The next phase of virtual cell research should be less about claiming we have virtualized biology and more about proving, carefully and repeatedly, which pieces of cellular behaviour we can actually model consistently and in which contexts.

That may sound less glamorous, but it is exactly how the field becomes useful.

Subscribe now