Building foundation models for genomics!
A Colab notebook is available: colab.research.google.com/drive/1VZF5N...
Details on our findings are on BioRxiv: biorxiv.org/content/10.1...
A Colab notebook is available: colab.research.google.com/drive/1VZF5N...
Details on our findings are on BioRxiv: biorxiv.org/content/10.1...
We hope that this experimental setup and others like it can inform new directions for nucleotide foundation model development.
We hope that this experimental setup and others like it can inform new directions for nucleotide foundation model development.
Models are exposed to either sequence element that promotes translation, but never both, and we task them with predicting the unseen combination.
Models are exposed to either sequence element that promotes translation, but never both, and we task them with predicting the unseen combination.
A common source of data leakage is sequence homology, leading to overestimation of performance without careful data splits. We demonstrate the impact of improper splitting in our tasks.
A common source of data leakage is sequence homology, leading to overestimation of performance without careful data splits. We demonstrate the impact of improper splitting in our tasks.
Orthrus, trained using contrastive learning (CL), performs better on "global" sequence-level property prediction compared to finer-resolution tasks, consistent with known CL limitations.
Orthrus, trained using contrastive learning (CL), performs better on "global" sequence-level property prediction compared to finer-resolution tasks, consistent with known CL limitations.
While that makes biological sense, the result might be naively counterintuitive -- since mRNAs arise from the genome, shouldn't genomic models be able to model mRNA?
While that makes biological sense, the result might be naively counterintuitive -- since mRNAs arise from the genome, shouldn't genomic models be able to model mRNA?
Using linear probing, we conduct over 100K experiments, revealing several insights:
Using linear probing, we conduct over 100K experiments, revealing several insights:
- mRNA stability
- Mean ribosome loading
- mRNA sub-cellular localization
- RNA-Protein interaction
- Pathogenicity of variants
- mRNA stability
- Mean ribosome loading
- mRNA sub-cellular localization
- RNA-Protein interaction
- Pathogenicity of variants