New study proposes metrics to refine perturbation model benchmarking and accelerate rejuvenation target identification.
As machine learning begins to reshape how scientists model biology, a new study from Shift Bioscience has called time on a persistent flaw in the benchmarking of single-cell perturbation models – the overperformance of the dataset mean. This statistical average, often outperforming state-of-the-art predictors in common metrics, has become something of an uninvited guest in virtual screening, offering high scores without biological insight. Now, researchers at Shift propose a redesigned framework that addresses this issue head-on and promises to improve model selection for aging and rejuvenation research.
Virtual cell models, trained on large single-cell RNA sequencing (scRNA-seq) datasets, are emerging as a powerful tool to test how gene perturbations – up- or down-regulation – alter cell behavior. These models could, in theory, compress decades of wet-lab experimentation into months of in silico simulations. However, the field has hit a rather technical snag: benchmark results are often inflated by control bias and weak perturbations, leading to performance mirages. In some cases, the best-performing models aren’t models at all, but simple predictions based on the mean expression of all cells – informative statistically, perhaps, but not biologically.
Longevity.Technology: Virtual cell models have long promised a computational panacea for aging research – a way to trial, screen and iterate at machine speed – but their progress has been dogged by an inconvenient truth: that many models struggle to outperform a statistical stand-in. Shift’s decision to call this out is refreshing – the dataset mean may be a statistical footnote, but here it becomes a philosophical provocation, forcing us to confront what we think our models are doing versus what they are actually achieving.
By retooling the ranking system with a biologically literate compass – one that centers on differentially expressed genes, adjusts for control bias and discards false confidence – Shift is not merely tuning the metrics, but repositioning the entire framework toward models that matter. The inclusion of a ‘technical duplicate’ baseline – a clever means of simulating a performance ceiling using real data – provides a much-needed sanity check; when models begin to approach this line, we can be more confident they are capturing biology, not just statistical shortcuts. This is good news for rejuvenation pipelines, yes, but it’s also a quiet challenge to the broader machine learning community: are your models just good at data, or are they good at biology?
Control bias and collapsing modes
The paper, Diversity by Design, outlines a series of metric reforms intended to address what the authors describe as “mode collapse” – a tendency for models to default to uninformative, average-like predictions when confronted with sparse biological signals or biased control data. This issue is particularly acute in commonly used evaluation metrics like mean squared error (MSE) and Pearson correlation, which, the authors argue, reward safe bets rather than specific insights [1].
To address this, the team introduced biologically weighted alternatives that center their evaluation on differentially expressed genes (DEGs), giving more emphasis to changes that are meaningful from a gene regulation perspective. These refinements allow models to be penalized for failing to capture real biological variation – rather than being praised for following the crowd.
In addition to metrics, the researchers implemented a set of baselines for calibration: a negative baseline based on control means, a null performance baseline using the dataset mean, and a positive baseline derived from a technical duplicate – essentially splitting real-world data and asking one half to predict the other. This triangulation provides a performance landscape with clear anchors, helping to separate genuinely informative models from statistical freeloaders.
Real-world and simulated testing
The team validated their proposals using both in silico simulations and two real-world datasets: Norman19 and Replogle22, both widely used in the field. Their simulations demonstrated that even modest levels of control bias could inflate benchmark scores for the dataset mean. When evaluated using the proposed DEG-weighted metrics, however, this effect all but vanished [1].
One notable finding was that when models were trained using weighted mean squared error as the objective function – rather than the traditional MSE – they not only avoided mode collapse, but recovered more of the true biological variation present in the data. In the Norman19 dataset, this improved performance placed the model within reach of the technical duplicate baseline – a ceiling representing the noise limit of the dataset itself [1].

“In this research, our team has shown that by focusing on the development of new metrics and baselines, we can more easily identify models that demonstrate strong predictability,” said Lucas Paulo de Lima Camillo, Head of Machine Learning at Shift Bioscience.
“The paper provides foundational data which will enable us to develop more powerful, biologically-useful perturbation models, ultimately accelerating our therapeutic pipeline and helping us to uncover new targets for rejuvenation therapeutics.”
Toward sharper models for aging biology
Rooted in machine learning technicalities, this study, although still awaiting peer review, makes a pointed contribution to the wider goals of rejuvenation research. With the biological relevance of computational models under increasing scrutiny, particularly in the longevity biotech sector, having sharper tools to filter signal from noise becomes critical. Shift’s framework moves the conversation from generic accuracy to biological fidelity – a welcome development for anyone serious about targeting the mechanisms of aging with precision.


