Lightnews — Scholar-powered news

alandenadel.bsky.social

@alandenadel.bsky.social

Interestingly, we saw improved zero-shot performance when increasing model size (but still no data scaling) for both scVI and Geneformer

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

The Nicheformer authors observed a similar phenomenon, that when Nicheformer was pre-trained on 1% of their 110M cell dataset performance did not decrease dramatically:

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

And for out-of-distribution perturbation response prediction.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

We also observed similar results for zero-shot batch integration.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

The learning saturation points were always 25% or less when evaluating the models on zero-shot classification and were always 10% or less when evaluating the models on fine-tuned classification.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

To assess the extent to which this plateauing generalized across datasets and tasks, we identified the "learning saturation point" for each model. This is the minimum pre-training dataset size for which a model surpassed 95% of the maximum performance observed.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

Across all model architectures, model performance at cell type classification (both zero-shot and fine-tuned) plateaued at a small fraction of the total pre-training dataset size, regardless of dataset diversity. When fine-tuning, pre-training has almost no impact on performance.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

We assessed five model architectures pre-trained to perform as single-cell foundation models (scFMs) in the context of single-cell RNA-seq: PCA, scVI, SSL, Geneformer, and SCimilarity. We pre-trained these models on subsets of the scTab corpus using three downsampling schemes.

November 7, 2025 at 8:07 PM

alandenadel.bsky.social

@alandenadel.bsky.social

The learning saturation points were always 25% or less when evaluating the models on zero-shot classification and were always 10% or less when evaluating the models on fine-tuned classification. We also observed similar results for zero-shot batch integration.

(B) Heatmap visualizing the learning saturation point for the clonal hematopoiesis, intestine-on-chip, periodontitis, and placental infection datasets for each of scVI, SSL, and Geneformer, across each downsampling strategy and when evaluated in the zero-shot regime. Each sub-panel corresponds to the model architecture, the x-axis corresponds to the dataset evaluated, and the y-axis corresponds to the downsampling strategy used to pre-train each model.
(C) Heatmap visualizing the learning saturation point for the clonal hematopoiesis, intestine-on-chip, periodontitis, and placental infection datasets for each of scVI, SSL, and Geneformer, across each downsampling strategy and when evaluated in the fine-tuning regime. Each sub-panel corresponds to the model architecture, the x-axis corresponds to the dataset evaluated, and the y-axis corresponds to the downsampling strategy used to pre-train each model.

December 18, 2024 at 6:48 PM

alandenadel.bsky.social

@alandenadel.bsky.social

To assess the extent to which this plateauing generalized across datasets and tasks, we identified the "learning saturation point" for each model. This is the minimum pre-training dataset size for which a model surpassed 95% of the maximum performance observed.

Schematic of analysis to find the learning saturation point. For each family of models (i.e., a downsampling strategy paired with a model) a "saturation threshold" of 95% of the maximum performance was computed, and the minimum pre-training dataset size that produced a model surpassing that threshold was identified. This dataset size was denoted the "learning saturation point" and is considered the point at which model performance saturated as a function of pre-training dataset size.

December 18, 2024 at 6:48 PM

alandenadel.bsky.social

@alandenadel.bsky.social

Model performance at cell type classification (both zero-shot and fine-tuned) tended to plateau at a small fraction of the total pre-training dataset size on a clonal hematopoiesis evaluation dataset, regardless of pre-training dataset diversity.

$Figure 2. Zero-shot and fine-tuned performance on classifying cells from a clonal hematopoiesis dataset plateaus at a small fraction of the total data available for pre-training. (A) Line plots showing zero-shot classification performance for each model’s embeddings, as evaluated by the micro F1 Score. For each model, the different colors correspond to the downsampling strategy used to generate the data used for pre-training. The dotted line shows the performance of using the highly variable genes as an embedding; the dashed line shows the performance of using principal component projections as an embedding. (B) Line plots showing classification performance for each model after fine-tuning, as evaluated by the micro F1 Score. For each model, the different colors correspond to the downsampling strategy used to generate the data used for pre-training. The dotted line shows the performance of training a regularized logistic regression classifier using the highly variable genes as input features.$

December 18, 2024 at 6:48 PM

alandenadel.bsky.social

@alandenadel.bsky.social

The three downsampling schemes were: (1) random downsampling (2) cell type re-weighting and (3) geometric sketching. (1) conserves diversity, while (2) and (3) increase diversity (relative to the full corpus). Datasets were generated at 1%, 10%, 25%, 50%, and 75% of the total.

Supplemental Figure 1. Diversity of datasets used for pre-training as evaluated by intrinsic and extrinsic metrics. The Shannon index, Gini-Simpson index, and Vendi Score are shown for each of the downsampled pre-training datasets. Cell type re-weighting and geometric sketching have increased diversity relative to the randomly downsampled datasets. Cell type re-weighting (which re-weights based on cell type metadata) has the highest Shannon index and Gini-Simpson index (which both measure the diversity of cell type metadata). Geometric sketching (which samples evenly across transcriptional space) has the highest Vendi Score (which measures the diversity of the transcriptional data directly).

December 18, 2024 at 6:48 PM

alandenadel.bsky.social

@alandenadel.bsky.social

Current methods in the field are trained on atlases ranging from 1 to 100 million cells. In our newest preprint, we show that these same approaches tend to plateau in performance with pre-training datasets that are only a fraction of the size.

Figure 1. Strategy to assess the effects of pre-training dataset size and diversity on scFM performance. (A) Schematic of the downsampling approaches, sizes of downsampled pre-training datasets, and data splitting strategy. (B) An example of what evaluation performance might a priori be expected to look like as a function of pre-training dataset size and diversity.

December 18, 2024 at 6:48 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news