Lightnews — Scholar-powered news

alandenadel.bsky.social

@alandenadel.bsky.social

Thank you again to all my collaborators for their contributions and thoughtful feedback.

Madeline Hughes
Akshaya Thoutam
Anay Gupta
Andrew Navia
@nfusi.bsky.social
Srivatsan Raghavan
Peter Winter
@avapamini.bsky.social
@lcrawford.bsky.social

I appreciate any feedback!

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

Revised Manuscript:
www.biorxiv.org/content/10.1...

Evaluating the role of pre-training dataset size and diversity on single-cell foundation model performance

The success of transformer-based foundation models on natural language and images has motivated their use in single-cell biology. Single-cell foundation models have been trained on increasingly larger...

www.biorxiv.org

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

Interestingly, we saw improved zero-shot performance when increasing model size (but still no data scaling) for both scVI and Geneformer

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

The Nicheformer authors observed a similar phenomenon, that when Nicheformer was pre-trained on 1% of their 110M cell dataset performance did not decrease dramatically:

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

There is an implicit assumption that scaling the pre-training dataset size is inherently better, but the only demonstrated scaling law we know of is in terms of data quality:
arxiv.org/abs/2503.02726

Measurement noise scaling laws for cellular representation learning

Deep learning scaling laws predict how performance improves with increased model and dataset size. Here we identify measurement noise in data as another performance scaling axis, governed by a distinc...

arxiv.org

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

This work addresses a critical consideration in training large-scale models: the size and diversity of the pre-training corpus.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

And for out-of-distribution perturbation response prediction.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

We also observed similar results for zero-shot batch integration.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

The learning saturation points were always 25% or less when evaluating the models on zero-shot classification and were always 10% or less when evaluating the models on fine-tuned classification.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

To assess the extent to which this plateauing generalized across datasets and tasks, we identified the "learning saturation point" for each model. This is the minimum pre-training dataset size for which a model surpassed 95% of the maximum performance observed.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

Across all model architectures, model performance at cell type classification (both zero-shot and fine-tuned) plateaued at a small fraction of the total pre-training dataset size, regardless of dataset diversity. When fine-tuning, pre-training has almost no impact on performance.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

We assessed five model architectures pre-trained to perform as single-cell foundation models (scFMs) in the context of single-cell RNA-seq: PCA, scVI, SSL, Geneformer, and SCimilarity. We pre-trained these models on subsets of the scTab corpus using three downsampling schemes.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

In our expanded analysis, we show that single-cell foundation models tend to plateau in downstream task performance with pre-training subsets that are a small fraction of the size of current pre-training datasets.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

Thank you to all my collaborators for their contributions and thoughtful feedback.

Madeline Hughes
Akshaya Thoutam
@anaygupta.bsky.social
Andrew Navia
@nfusi.bsky.social
Srivatsan Raghavan
Peter Winter
@avapamini.bsky.social
@lcrawford.bsky.social

I welcome any comments!

December 18, 2024 at 6:48 PM

alandenadel.bsky.social

@alandenadel.bsky.social

Code: github.com/microsoft/sc...
Paper: www.biorxiv.org/content/10.1...

Evaluating the role of pre-training dataset size and diversity on single-cell foundation model performance

The success of transformer-based foundation models on natural language and images has motivated their use in single-cell biology. Single-cell foundation models have been trained on increasingly larger...

www.biorxiv.org

December 18, 2024 at 6:48 PM

alandenadel.bsky.social

@alandenadel.bsky.social

Our results highlight the need for a more nuanced approach, balancing dataset size and diversity with careful attention to model architectures and model benchmarking.

December 18, 2024 at 6:48 PM

alandenadel.bsky.social

@alandenadel.bsky.social

Our findings underscore the importance of prioritizing data quality and content over sheer size. Developers of scFMs and large databases should consider this rather than simply scaling up models and databases, which we have shown is unlikely to meaningfully improve performance.

December 18, 2024 at 6:48 PM

alandenadel.bsky.social

@alandenadel.bsky.social

While neural scaling laws observed in other domains suggest that increasing dataset size leads to better performance, our findings show that, past a learning saturation point, simply increasing pre-training datasets doesn't necessarily improve performance on downstream tasks.

December 18, 2024 at 6:48 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news