eyes robson
eyesrobson.bsky.social
eyes robson
@eyesrobson.bsky.social
PhD Candidate at UC Berkeley // y = mx + biology // bioethics, algorithmic fairness, and genomic AI // they/she 🏳️‍⚧️
one quick thing - I saw GUANinE listed under "...benchmarks that do not fine-tune gLMs" and was a bit confused ?

...a big chunk of that paper was about fine-tuning our hgT5 gLMs (it was actually the whole motivation for GUANinE -- tl;dr we saw strong gains in functional & conservation tasks)
September 10, 2025 at 2:43 AM
just getting to this, but it looks awesome! 💯
September 10, 2025 at 2:37 AM
nope lol 😆

using all the params in an LM is hard. In genonics I would expect it to conform to extracting features for augmentation (i.e. an LM feature in CADD), just like in protein LMs

www.nature.com/articles/s41...
Learning protein fitness models from evolutionary and assay-labeled data - Nature Biotechnology
A simple machine learning algorithm combines evolutionary and experimental data for improved protein fitness prediction.
www.nature.com
February 4, 2025 at 6:47 PM
the GPRA task was mostly for thoroughness & lack of alternatives at the time -- I designed the GUANinE benchmark with Nilah back in 2021 before lots of human large N, high-throughput methods emerged

however, our follow-up preprint correlates it with model "quality" as Basenji2 < Enformer < Borzoi
February 4, 2025 at 6:16 PM
a proper use for these models in genomics would more likely be preliminary exploration, annotation, and variant calling correction

(but a huge part of the funding & dev pipeline is forbiopharma and variant interpretation, not basic science)
February 4, 2025 at 6:13 PM
can't disagree!

the original use case for ELMO and other NLP LMs was pretraining ultra-high parameter models in the absence of large-scale supervised data. genomics only has this absence on novel organisms in genbank, not humans

www.ncbi.nlm.nih.gov/genbank/stat...
GenBank and WGS StatisticsTwitterFacebookLinkedInGitHubNCBI Insights BlogTwitterFacebookYoutube
www.ncbi.nlm.nih.gov
February 4, 2025 at 6:11 PM
I'm optimistic they'll find they're niches... eventually.. although I expect the field to take a while to figure out how to structure tasks in a scalable way that genomic LMs would succeed at

(e.g. Borzoi's 32 bp RNA-seq vs Xpresso's historical approach of one-gene-is-one-example)
February 4, 2025 at 4:32 AM
love seeing this critique of genomic LMs!

although I've seen pretty strong evidence to suggest they work well on certain tasks like conservation or cCRE recognition, e.g. ~ proceedings.mlr.press/v240/robson2...

(obviously depends on the model, the task... and how predictions are made :) )
GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models
Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous mode...
proceedings.mlr.press
February 4, 2025 at 4:27 AM
we also present an undervalued interpretability approach, which decomposes model variance explained into 'interpretable variables' like GC-content etc.

Borzoi and Enformer capture deeper features than the ones we test out, even surprisingly cryptic chromosomal features from sequence alone
January 31, 2025 at 9:35 PM
we also examined the accuracy of Borzoi at different bin averaging scales -- for the region-based tasks of GUANinE, more bins = better perf., aside from the accessibility task
January 31, 2025 at 9:26 PM
this just further emphasizes that the biggest opportunities for advancing public health is *increasing access to existing medicines*

be it through single-payer systems (e.g. medicare for all) or publicly developed and distributed medicines
May 23, 2024 at 6:27 PM