Towards Reasoning in Virtual Cells
Bridging predict and discover with autonomous mechanistic explanations
tl;dr: VCR-Agent is a multi-agent system that can explain why a perturbation drives a particular cellular response, as opposed to simply describing the cellular response of a perturbation. This work falls under the “Explain” pillar of our Virtual Cell framework, a pillar that can unlock significant value in the drug discovery pipeline despite being underexplored in the research community. Explainability strengthens our ability to predict which compounds work, moving toward autonomous, interpretable virtual cell experimentation. Training with structured mechanistic traces improves differential expression prediction over both standard baselines and STATE, a recent SOTA perturbation response model. We are also releasing VC-TRACES, an initial collection of mechanistic traces on the Tahoe public dataset to encourage others to collaborate in this action space.
Connect with us: Valence is constantly seeking talented individuals with diverse backgrounds and expertise to join our team. Explore open roles here.
In our previous posts, we introduced our Predict-Explain-Discover framework for Virtual Cells and dove deep into the Predict pillar through virtual assays for microscopy. Today, we turn to the pillar that has received far less attention from the scientific community: Explain.
The bottleneck of understanding “why” in biology and drug discovery
Recent advances in virtual cell technology are poised to transform biomedical research, using AI foundation models to simulate cellular response. While not yet perfect, perturbation response prediction is becoming impressively accurate. Models like TxPert and X-Cell now predict transcriptomic shifts under genetic and chemical perturbations with increasing performance. TxPert, for instance, reaches a performance competitive with experimental reproducibility on held-out perturbations in selected cell lines. Beyond transcriptomics, phenomics models are now generating realistic microscopy readouts under unseen compound perturbations, further expanding the frontier of what virtual assays can simulate. Yet predicting what happens without understanding how it happens leaves a critical gap. While useful, prediction alone cannot reveal mechanisms of action and cannot warn you when a prediction is right for the wrong reasons.
One could argue that a mechanistic explanation is a luxury, as not all drugs have a fully characterized mechanism, and phenotypic drug discovery has a long track record of producing effective therapies without one. If a model can either accurately predict which perturbations produce the desired phenotypic response or rank compounds by predicted efficacy, why do we need to know how or why it works?
But no predictor is perfect. In practice, models struggle to generalize beyond their training distribution, fail silently on novel scaffolds, and give no signal when they’re wrong. Smarter experiment selection helps: active learning and Bayesian optimization can prioritize which perturbations to test next, making design-make-test cycles more efficient. But while these approaches may accelerate cycle time and reduce costs, improving drug discovery outcomes requires more than optimization alone. For example, a medicinal chemist works under hard constraints: limited synthesis budget, a target mechanism of action, SAR hypotheses to validate, and off-target liabilities to rule out. Each of those decisions requires understanding why a compound drives a particular cellular response, and that response is rarely simple. A compound binding its target triggers downstream signaling changes, transcriptional reprogramming, and phenotypic transitions that vary across cell types, genetic backgrounds, and disease contexts. Knowing which pathway a compound engages lets you anticipate off-target toxicity, assess context-specific selectivity, and evaluate translatability from cell line to patient. As experimental throughput scales to millions of perturbations per week, the bottleneck shifts from generating cellular readouts to interpreting what they mean mechanistically.
The vision of interpretability of virtual cells
In our Predict-Explain-Discover framework, Explain is the bridge between prediction and discovery: a virtual cell that can articulate why a compound drives a particular cellular response can turn a prediction into a testable hypothesis, and a testable hypothesis into a designed experiment. Yet interpreting what a cellular readout means mechanistically is precisely where current approaches fall short, and where Explain remains the least developed of the three capabilities.
Part of the difficulty is definitional. In machine learning, explainability typically refers to understanding how a model reached a prediction: which features mattered, which attention heads activated. These are useful diagnostics for model developers, but they are not biological explanations. Knowing that a model assigned high importance to a particular gene expression feature does not tell you whether that gene is causally involved in the biological response, nor does it generate a hypothesis you can test in the lab. What we are after is different: a structured account of what happens inside the cell, expressed in terms a domain expert can evaluate and act on.
The natural language for such accounts already exists in biology. Pathway diagrams, from textbook signaling cascades to curated interaction networks, are directed graphs of mechanistic dependencies: node A activates node B, which phosphorylates node C, which drives expression of gene D. Each edge is a claim checkable against experimental evidence, making the whole structure interpretable and falsifiable. The computational analog often adopted is a directed acyclic graph (DAG) of biological actions: while signaling networks contain feedback loops, a DAG captures the primary propagation direction of a perturbation response from target engagement to downstream phenotypic outcome. Signals propagate directionally and causally: a binding event enables an activity change, an activity change enables a transcriptional response, a transcriptional response enables a phenotypic transition.
RAS/RAF/MEK/ERK pathway
Image from Bahar et al. 2023
Curated biological knowledge graphs like Reactome or STRING are a natural starting point to leverage. But static lookup is far from reasoning. These resources encode what the community has already established, and while they provide essential grounding, they were not designed to generate mechanistic accounts of how a novel compound behaves in a specific cellular context. The relevant knowledge is also distributed across literature, database entries, sequence annotations, and experimental results, in formats no single ontology captures. LLMs can synthesize across these sources and reason dynamically over them, but only if their outputs are sufficiently constrained and verifiable to produce explanations that are grounded and falsifiable.
Why standard LLMs fail at biology
Large language models have transformed reasoning in mathematics and programming. Chain-of-thought prompting, reinforcement learning from verifiable rewards, and supervised fine-tuning on curated reasoning traces have produced models that solve competition-level math problems and write complex code. The natural question is whether the same approach translates to biology ?
It does not, for three reasons.
First, LLMs hallucinate, and in biology, those hallucinations are hard to catch. In math, a wrong intermediate step typically leads to a wrong final answer that a verifier can flag (e.g., “2+3=6” immediately invalidates). In biology, there exists no such clean check. Ground truth is scattered across incomplete databases, experimental results, and literature, so a plausible but incorrect claim can propagate through an entire reasoning chain without being flagged by any automated check.
Second, biology is deeply context-dependent. A drug’s mechanism of action can differ entirely depending on which dose is applied, which cell type is targeted, which experimental setting is used, or which disease state is being evaluated. An LLM that learned a pathway from literature has no reliable signal for when that signaling cascade breaks down in a novel context.
Third, biological knowledge is validated through experiment, not derivation. In mathematics, a proof is self-contained, and verification is cheap. In biology, establishing how a pathway operates in a specific cellular context requires wet lab experiments that take weeks and significant resources. Unlike GSM8K and MATH, there is no cheap oracle for biological reasoning, and scaling expert-annotated mechanistic explanations to millions of perturbation contexts through human annotation is simply not feasible.
The LLM reasoning landscape for biology
Several recent works have begun exploring LLM reasoning for biological perturbation response. PerturbQA contextualizes perturbation experiments through language and compresses gene-centric knowledge graphs at inference time. RBio1 introduces an RL framework with soft verifiers for gene-centric reasoning. SynthPert fine-tunes models on GPT-4o-generated reasoning traces for cellular perturbation prediction.
These efforts, while important steps in the right direction, share a common constraint: generated reasoning remains unstructured free-form text whose individual claims cannot be independently checked. Models reason from memorized training data without systematically retrieving from external knowledge bases, making hallucinations hard to prevent. And they focus almost exclusively on gene-centric perturbations, neglecting drug-induced perturbations.
VCR-Agent: Structured reasoning as the key design choice
VCR-Agent targets these gaps with per-action falsifiability, external knowledge grounding, and coverage across a broader hierarchy of cellular events. The core design decision follows directly from the problem: when an LLM is free to produce any text, there is no handle for a verifier. Constraining the output space is the natural fix, but the constraint must be biologically meaningful. Rather than asking an LLM to explain a perturbation in free-form prose, we ask it to construct a directed acyclic graph (DAG) of biological actions. This constraint is what makes verification tractable: each node is a discrete, typed claim that can be independently evaluated against external evidence.
Given a perturbation (e.g., a chemical compound or gene perturbation) and cellular context (e.g., a cell line with a specific background), VCR-Agent generates a reasoning graph where each node is a biological action selected from a predefined space of 20 primitives, and each directed edge encodes a mechanistic dependency between actions. These initial 20 primitives cover seven categories of cellular events: system initialization, metabolic, regulation, functional, interaction, phenotype, and proteostasis.
20 action primitives for VCR-Agent
Each primitive is parameterized by a specific argument schema. For example, binds_to(id, actor, target, {affinity, unit, residues_actor, residues_target, via, confidence}) specifies the interacting molecules, binding affinity, relevant residues, and the mechanism. This parameterization is what makes each action falsifiable, e.g., a verifier can check the plausibility of the claimed drug-target binding at the stated affinity by querying a binding predictive tool such as Boltz-2 or a comparable state-of-the-art alternative.
To illustrate concretely, take Binimetinib in C32 melanoma cells. VCR-Agent generates a trace that begins with set_context(cell_type=”C32 melanoma”, genotype=”BRAF V600E GoF, ...”), proceeds through dual binds_to actions targeting MAP2K1 and MAP2K2, chains through MEK1/2 allosteric inhibition → MAPK pathway blockade disrupting the BRAF→MEK→ERK cascade → loss of ERK1/2 phosphorylation → ERK1/2 cytoplasmic retention, and then fans out across three parallel consequences: downregulation of proliferative transcription factors (MYC, CCND1, FOS, JUN), CDK4/6 activity reduction leading to RB1 hypophosphorylation and Rb-E2F1 complex formation suppressing S-phase genes (MCM7, PCNA), and upregulation of pro-apoptotic BH3-only genes (BCL2L11, BBC3) alongside suppression of pro-survival factors (MCL1, BCL2L1) — shifting the BCL-2 family balance toward mitochondrial apoptosis, cytochrome c release, and caspase-3 activation — before terminating with induces_phenotype(phenotype=”G1/S cell cycle arrest and apoptotic cell death”). Every node is interpretable; every edge encodes a mechanistic dependency. The entire trace is a structured object that can be parsed and verified systematically.
An example explanation trace DAG for (Binimetinib, C32)
Crucially, this is an extensible framework. The current 20 primitives reflect a practical trade-off between coverage and verifiability: broad enough to represent the range of cellular biology, constrained enough that specialized verifiers can be built for each type. New action types can be added (e.g., for epigenomic or metabolomic events), and the verifier suite can be improved and extended with new specialized tools. Realizing this extensibility requires community input: new primitives, new verifiers, and stress-testing across diverse biological domains. To that end, we are releasing VC-TRACES, an initial collection of mechanistic traces, and the full pipeline to invite the community to build on both.
The multi-agent pipeline: Ensuring biological reliability
Generating a well-formed, factually grounded DAG requires two distinct capabilities: retrieving and synthesizing biological knowledge, and organizing that knowledge into a typed, structured format. Our experiments validate that asking a single model to do both simultaneously degrades performance on each: generating structured output under strict format constraints competes with factual accuracy, and a model focused on knowledge retrieval is not optimally positioned to enforce the argument schemas that make each action verifiable. VCR-Agent, therefore, separates these into two specialized agents.
The Report Generator is responsible for information retrieval and synthesis. Given a perturbation-context pair, it first extracts biomedical entities using HunFlair2 (a biomedical NER model), then queries four external knowledge bases: StarkPrimeKG (a biomedical knowledge graph), Harmonizome (a gene-centric database), PubMed (biomedical literature), and Wikipedia. The retrieved information is synthesized into a natural-language report that describes the perturbation’s known pharmacology, the cellular context’s mutational landscape, and established mechanistic pathways. This report serves as the factual foundation that the next agent can draw upon without hallucinating facts from its parametric memory.
An example generated report for (Binimetinib, C32)
The Explanation Constructor then takes this knowledge-grounded report and translates it into the structured reasoning format: a sequence of action primitives connected as a DAG. With factual retrieval already handled, it can focus on the reasoning task: which events to include, how to order and connect them, and how to parameterize each action.
An example structured explanation for (Binimetinib, C32)
Even with grounded knowledge retrieval and structured generation, LLMs can still produce incorrect claims. VCR-Agent addresses this through a verifier-based filtering pipeline that evaluates each intermediate step in explanation traces and removes or corrects factually inaccurate components. For instance, the DTI verifier uses Boltz-2 to score binds_to actions, i.e., if the predicted binding probability falls below a threshold, the entire trace is discarded. The DE verifier checks regulates_expression actions against ground-truth differential expression data from Tahoe-100M, pruning genes that are incorrectly identified or directionally mismatched. These two verifiers are not exhaustive: any sufficiently curated experimental resource or computational tool can serve as a verification oracle, and each addition incrementally increases the reliability of the generated traces.
Empirical results: Structured explanation quality
We apply VCR-Agent to 18,950 compound perturbation-context pairs derived from the Tahoe-100M atlas, releasing the resulting verified traces as the VC-TRACES dataset. We evaluate the generated reasoning traces against the reasoning traces generated by both open-source LLMs (DeepSeek-R1-8B, Qwen3-30B, Llama3.3-70B) and a closed-source baseline (Claude-Sonnet-4, the same backbone model used within VCR-Agent).
Evaluating the quality of biological reasoning traces is inherently difficult, as there is no single ground-truth answer to check against, unlike math or code. We therefore assess quality along two complementary axes: format-based metrics that measure whether the output is structurally valid and its entities can be mapped to biomedical ontologies, and verifier-based metrics that measure whether specific claims align with experimental data. On format-based metrics, VCR-Agent achieves near-perfect validity and verifiability, while open-source baselines struggle significantly, i.e., some nearly fail to produce the structured format at all. On verifier-based metrics, VCR-Agent outperforms all baselines on both DTI and DE scores, including the identical backbone model (Claude-Sonnet-4) without the multi-agent pipeline. This gap is particularly telling: it demonstrates that knowledge retrieval and structured generation add genuine value beyond what prompting a strong LLM alone can achieve. Moreover, the significance of verification is underscored by the numbers: the verifier pipeline excluded 28.2% of faulty DTI claims and refined 87.3% of DE actions.
Does explanation actually help downstream tasks?
The ultimate test of a mechanistic explanation is whether it improves prediction. VCR-Agent is evaluated on TahoeQA, a gene expression prediction task inspired by PerturbQA: given a perturbation and cellular context, predict whether a target gene is differentially expressed and, if so, whether its expression increases or decreases.
Two supervised fine-tuning (SFT) training configurations leverage traces generated by VCR-Agent: SFT-Prompt, where the verified structured explanation is provided as additional input, and SFT-Generate, where the model learns to first generate the explanation and then predict the answer. Both configurations substantially outperform all baselines, including vanilla SFT, zero-shot prompting, strong statistical baselines, and even STATE, a transcriptomic foundation model trained directly on Tahoe-100M, on the differential expression prediction task. The SFT-Generate result is especially notable, i.e., the model constructs its own mechanistic reasoning at inference time and still outperforms baselines that have access to raw numerical representations. This suggests that structured biological reasoning serves as a strong inductive bias that generalizes to novel compounds more effectively than memorizing statistical patterns.
Improvement in the TahoeQA task
Conclusion: Towards explaining the virtual cell
VCR-Agent contributes three things to the virtual cell ecosystem: a formalism (structured mechanistic action graphs with 20 biologically grounded primitives), a framework (a multi-agent pipeline that separates knowledge retrieval from structured reasoning and applies verifier-based filtering), and a dataset (18,950 verified mechanistic explanation VC-TRACES). Together, these contributions make the Explain pillar of the Predict-Explain-Discover framework tractable, not solved, but approachable in a principled way.
This work is a starting point, not a finished system. The action space needs stress-testing across diverse biological domains, the verifier suite needs to grow beyond two tools, and VC-TRACES needs broader coverage across compounds, cell types, and disease contexts. These are open problems, and progress on them requires the community to build on what we have released. Structured mechanistic reasoning is is what makes autonomous, interpretable virtual cells for drug discovery tractable, and expanding its scope may accelerate how quickly that becomes a reality.
This post is part of “Inside Valence”, a series where you’ll get a behind-the-scenes look at our research, exploring new ways to predict, explain, and ultimately decode biology. If this resonates, consider subscribing!













