3 Comments
User's avatar
Nicky's avatar

Amazing paper, super motivating read :)

About your question: I'd put my money on the right grid being the real one (haha, fingers crossed). My reasoning comes from your own framing: a perturbation response is not concentrated on a single outcome, it can be multimodal, heavy-tailed, and contain rare but important events. To me, the right grid seems to capture more of that tail behavior. I could totally be wrong, though; I'm basically coin-flipping.

On metrics, I agree with you about FID/KID. They compare samples in a frozen ImageNet embedding space and essentially ask, "Do these look like real cells overall?" That's why even an unconditional model can score well, as you showed.

Two things I think could help:

1. Compare samples in an embedding space that actually understands cells rather than ImageNet features.

2. In a perturbation setting, I think of each condition as a cloud of cells: control is one cloud, the real drug shifts it to another, and the model predicts a third. What really matters is whether the predicted cloud lands on the real perturbed one. I'd therefore evaluate that shift directly using distributional metrics such as Energy Distance or MMD, along with a discrimination test: rank the real drugs by distance to the prediction and see whether the true drug comes out #1.

That second idea also captures a failure mode that FID might miss. A model that simply outputs the "average treated cell" may look realistic and achieve a great FID score, yet predict essentially the same response for every drug. In that case, discrimination performance would collapse to a coin flip, which gets at the core question of perturbation modeling: not just whether the images look realistic, but whether the correct response is associated with the correct perturbation.

And the closing line "what do we still not understand well enough to simulate?" really stuck with me. Shifting the bottleneck from lab throughput to model fidelity feels like it could genuinely expand the drug-discovery design space. Excited to see where these trends go.

Charles Jones's avatar

Hi Nicky, it's Charlie (lead author of the preprint) here!

Firstly, thanks a lot for your kind and thoughtful comments. We started this blog because we wanted to engage in discussions exactly like this.

The grid of images on the RHS is actually the generated one! I'm super happy that it's this tough to tell them apart. These are generated images of the AZ841 molecule perturbation, which the model had not seen at all during training. If you're interested, we have a more in-depth qualitative analysis of these images in section 5.3 of the preprint [1].

We agree that getting benchmarks right is an ever-present challenge in this line of research. FID and KID are certainly deeply flawed metrics, however one thing I found inspiring from the wider generative modelling community is how, despite these flaws, we have seen surprisingly strong outcomes from work which narrowly focuses on improving FID scores. Indeed, when we did early experiments with our internal Phenom series of microscopy embedding models [2] (in place of ImageNet-trained Inception), we found that, while the absolute numbers were different than the Inception ones, the ranking of models was the same. This gave us some confidence that FID and KID are good-enough to measure generation fidelity in microscopy -- although we may start to see benefits of better evaluation encoders soon as FID begins to saturate.

All that being said, I have to admit that in this preprint, we were very focused on treating this as an ML problem and systematically improving these simple metrics as opposed to true biological validation. When evaluating generative models, all metrics measure a combination of (i) "Do these look like real cells overall?" and (ii) "Are we correctly capturing the perturbation effect?" (see the paper for a more formal version of this). Pure ML techniques allow us to make the error from (i) as small as possible, so that when we focus on the more biologically interesting question (ii), our conclusions are not as confounded by differences in the model quality. As we build on this work and focus more on (ii), we're tapping into our team's transcriptomics modelling and benchmarking expertise [3] to include conditional and ranking-style metrics just like the ones you suggest.

Keen to discuss further (or email me charlie@valencelabs.com)!

[1] https://arxiv.org/abs/2603.26790

[2] https://arxiv.org/abs/2411.02572

[3] https://www.nature.com/articles/s41587-026-03113-4

Nicky's avatar

Thanks so much for taking the time to write such a thoughtful reply, Charlie. For the reveal (called it wrong, but honestly thrilled the generated grid is that convincing!). Loved your framing of fidelity vs. capturing the perturbation effect, and it’s exciting. Will definitely read the 5.3 section. Really appreciate the engagement, looking forward to following along with where this goes!