Yun S. Song
yun-s-song.bsky.social
Yun S. Song
@yun-s-song.bsky.social
Professor of EECS and Statistics at UC Berkeley. Mathematical and computational biologist.
Not yet, but we will surely generate bp-resolution genome-wide scores for all six species studied in the paper and make them publicly available. For now, we have predictions for ~10M variants used in the S-LDSC analysis in humans.
September 22, 2025 at 2:59 PM
All in all, we believe that GPN-Star offers a scalable & flexible approach for training effective gLMs.

This work was led by my talented students @czye.bsky.social and @gonzalobenegas.bsky.social, with contributions from other lab members, @peterdfields.bsky.social at Jax, & B. Clarke at DKFZ
(n/n)
September 22, 2025 at 5:29 AM
Upon publication, we will release base-resolution predictions for the human genome and the five model organisms.
Codes to train the model, run inference, and reproduce the analyses are available on GitHub (github.com/songlab-cal/...) and Hugging Face (tinyurl.com/nhhcppvm).
(9/n)
GitHub - songlab-cal/gpn: Genomic Pre-trained Network
Genomic Pre-trained Network. Contribute to songlab-cal/gpn development by creating an account on GitHub.
github.com
September 22, 2025 at 5:29 AM
To show that GPN-Star is a robust and generalizable framework that can advance biology beyond human genetics, we apply it to train gLMs for five well-studied model organisms and demonstrate their effectiveness in assessing variant effects in these species.
(8/n)
September 22, 2025 at 5:29 AM
In addition, GPN-Star exhibits meaningful nucleotide dependencies that align with known functional dependencies, indicating its potential to help understand genomic syntax. This represents a notable advance over traditional conservation scores.
(7/n)
September 22, 2025 at 5:29 AM
By training GPN-Star on vertebrate, mammal, and primate alignments, we reveal task-dependent advantages of modeling deeper versus more recent evolution. These findings offer new biological insights and practical guidance for developing future gLMs and evolutionary models.
(6/n)
September 22, 2025 at 5:29 AM
GPN-Star achieves unprecedented SNP heritability enrichments across over 100 human complex traits. Moreover, we devise a simple approach to incorporate tissue-specificity into the model prediction and show that it further improves heritability enrichment.
(5/n)
September 22, 2025 at 5:29 AM
We compare GPN-Star with several models, including the recent AlphaGenome and Evo2 models with up to 1Mb context size and 40B parameters, and observe that GPN-Star consistently ranks at the top across a wide range of human variant effect prediction tasks.
(4/n)
September 22, 2025 at 5:29 AM
We also introduce a calibration method that removes the confounding effect of mutation rate variation from gLM predictions for the first time. This improves downstream performance and enables a more direct interpretation of model scores as estimates of selective constraint.
(3/n)
September 22, 2025 at 5:29 AM
GPN-Star features a novel phylogeny-aware architecture that enables the model to explicitly capture evolutionary relationships encoded in whole-genome alignments and overcomes the key limitations of our earlier model GPN-MSA (doi.org/10.1038/s415...).
(2/n)
September 22, 2025 at 5:29 AM
Thanks, Josh. I wish you had been one of our reviewers—life would’ve been so much easier.
September 11, 2025 at 4:45 AM
This work was led by my talented student Milind Jagota @milindjagota.bsky.social in collaboration with colleagues at UC Berkeley, UCSF (the Ye Lab @yimmieg.bsky.social), and Fred Hutch (the Matsen Lab @matsen.bsky.social). We are grateful to all co-authors for their enthusiasm and hard work. (n/n)
ky.social
August 15, 2025 at 1:17 PM
From a machine learning perspective, this work illustrates the value of high-quality negative examples. The paper is mostly focused on BCR light chains, but we are excited about extensions. (10/n)
August 15, 2025 at 1:17 PM
We interpret what sequence features the model associates with dysfunction. One example is shown below. For a specific light chain V- and J- gene, we observe sharp selection on CDRL3 length, and on certain amino acids. (9/n)
August 15, 2025 at 1:17 PM
In new data, we find that very low scores are associated with reduced surface expression in naive B cells. To our knowledge, this is the first time expression variation in naive B cells has been linked to the light chain. (8/n)
August 15, 2025 at 1:17 PM
B cells can further mutate antibodies to improve binding. We compare observed mutations to random control sets of mutations. Mutations that significantly decrease model scores appear to be selected out. However, this only works in a few positions. (7/n)
August 15, 2025 at 1:17 PM
Models trained on allelic inclusion generalize to predict antibody properties with no direct training. Here we apply models to independent data measuring polyreactivity of human antibodies and observe correlation with polyreactivity. Baselines don’t capture this signal. (6/n)
August 15, 2025 at 1:17 PM
We don’t know which sequence in each double-light B cell is “bad”, but we develop a training framework that doesn’t need this information. We compare with baseline approaches that don’t use the new allelic inclusion data. (5/n)
August 15, 2025 at 1:17 PM