Ian Shi
heyitsmeianshi.bsky.social
Ian Shi
@heyitsmeianshi.bsky.social
PhD Student @ University of Toronto

Building foundation models for genomics!
And also to our collaborators who provided datasets and thoughtful advice including Simran, Andrew, Cyrus, Defne, Jessica, Kaitlin, Ilyes, @bowang87.bsky.social and @quaidmorris.bsky.social! (+ MSK HPC for making this all possible)
July 15, 2025 at 6:41 PM
A huge thanks to @taykhoomdalal.bsky.social, @phil-fradkin.bsky.social, and Divya for pushing this work across the finish line!
July 15, 2025 at 6:41 PM
mRNABench is available on Github: github.com/morrislab/mR..., where we've made an effort to make the codebase accessible, extensible, and reproducible!

A Colab notebook is available: colab.research.google.com/drive/1VZF5N...

Details on our findings are on BioRxiv: biorxiv.org/content/10.1...
GitHub - morrislab/mRNABench: Collection of mRNA benchmarks
Collection of mRNA benchmarks. Contribute to morrislab/mRNABench development by creating an account on GitHub.
github.com
July 15, 2025 at 6:41 PM
As seen, most models struggle to compositionally generalize, representing a significant gap in their abilities to truly understand regulation.

We hope that this experimental setup and others like it can inform new directions for nucleotide foundation model development.
July 15, 2025 at 6:41 PM
(4) Together with Divya Koyyalagunta, we further assess the ability for foundation models to compositionally generalize from learned motifs.

Models are exposed to either sequence element that promotes translation, but never both, and we task them with predicting the unseen combination.
July 15, 2025 at 6:41 PM
(3) Finally, we assess the limitations of current benchmarking and modelling efforts.

A common source of data leakage is sequence homology, leading to overestimation of performance without careful data splits. We demonstrate the impact of improper splitting in our tasks.
July 15, 2025 at 6:41 PM
@taykhoomdalal.bsky.social further explored this phenomenon and developed a joint CL + MLM objective function, and demonstrated that the joint loss results in superior downstream performance. Remarkably, adding an MLM objective to Orthrus yields SOTA results using 700x less parameters.
July 15, 2025 at 6:41 PM
(2) Choice of pre-training objective has a noticeable impact on downstream performance.

Orthrus, trained using contrastive learning (CL), performs better on "global" sequence-level property prediction compared to finer-resolution tasks, consistent with known CL limitations.
July 15, 2025 at 6:41 PM
In one of the coolest analyses of this paper, @phil-fradkin.bsky.social quantified the distributional differences between mRNA, ncRNA, and genomic regions through their cross-compressibility under a Huffman encoding scheme, reinforcing their distinct regulatory grammars
July 15, 2025 at 6:41 PM
1) Unsurprisingly, we find that models pre-trained on mRNA sequences perform better on downstream mRNA tasks.

While that makes biological sense, the result might be naively counterintuitive -- since mRNAs arise from the genome, shouldn't genomic models be able to model mRNA?
July 15, 2025 at 6:41 PM
On these datasets, we assess the embedding quality of almost all existing nucleotide foundation models including Evo2, RiNALMo, AIDO.RNA, Orthrus, SpliceBERT, and others.

Using linear probing, we conduct over 100K experiments, revealing several insights:
July 15, 2025 at 6:41 PM
In contrast to existing benchmarks, 𝐦𝐑𝐍𝐀𝐁𝐞𝐧𝐜𝐡 focuses on mRNA biology, assessing prediction of:

- mRNA stability
- Mean ribosome loading
- mRNA sub-cellular localization
- RNA-Protein interaction
- Pathogenicity of variants
July 15, 2025 at 6:41 PM