Lightnews — Scholar-powered news

eyes robson

@eyesrobson.bsky.social

89 followers 260 following 16 posts

PhD Candidate at UC Berkeley // y = mx + biology // bioethics, algorithmic fairness, and genomic AI // they/she 🏳️‍⚧️

Posts Replies Media Videos

eyes robson

@eyesrobson.bsky.social

one quick thing - I saw GUANinE listed under "...benchmarks that do not fine-tune gLMs" and was a bit confused ?

...a big chunk of that paper was about fine-tuning our hgT5 gLMs (it was actually the whole motivation for GUANinE -- tl;dr we saw strong gains in functional & conservation tasks)

September 10, 2025 at 2:43 AM

eyes robson

@eyesrobson.bsky.social

just getting to this, but it looks awesome! 💯

September 10, 2025 at 2:37 AM

eyes robson

@eyesrobson.bsky.social

nope lol 😆

using all the params in an LM is hard. In genonics I would expect it to conform to extracting features for augmentation (i.e. an LM feature in CADD), just like in protein LMs

www.nature.com/articles/s41...

Learning protein fitness models from evolutionary and assay-labeled data - Nature Biotechnology

A simple machine learning algorithm combines evolutionary and experimental data for improved protein fitness prediction.

www.nature.com

February 4, 2025 at 6:47 PM

eyes robson

@eyesrobson.bsky.social

the GPRA task was mostly for thoroughness & lack of alternatives at the time -- I designed the GUANinE benchmark with Nilah back in 2021 before lots of human large N, high-throughput methods emerged

however, our follow-up preprint correlates it with model "quality" as Basenji2 < Enformer < Borzoi

February 4, 2025 at 6:16 PM

eyes robson

@eyesrobson.bsky.social

a proper use for these models in genomics would more likely be preliminary exploration, annotation, and variant calling correction

(but a huge part of the funding & dev pipeline is forbiopharma and variant interpretation, not basic science)

February 4, 2025 at 6:13 PM

eyes robson

@eyesrobson.bsky.social

can't disagree!

the original use case for ELMO and other NLP LMs was pretraining ultra-high parameter models in the absence of large-scale supervised data. genomics only has this absence on novel organisms in genbank, not humans

www.ncbi.nlm.nih.gov/genbank/stat...

GenBank and WGS StatisticsTwitterFacebookLinkedInGitHubNCBI Insights BlogTwitterFacebookYoutube

www.ncbi.nlm.nih.gov

February 4, 2025 at 6:11 PM

eyes robson

@eyesrobson.bsky.social

I'm optimistic they'll find they're niches... eventually.. although I expect the field to take a while to figure out how to structure tasks in a scalable way that genomic LMs would succeed at

(e.g. Borzoi's 32 bp RNA-seq vs Xpresso's historical approach of one-gene-is-one-example)

February 4, 2025 at 4:32 AM

eyes robson

@eyesrobson.bsky.social

love seeing this critique of genomic LMs!

although I've seen pretty strong evidence to suggest they work well on certain tasks like conservation or cCRE recognition, e.g. ~ proceedings.mlr.press/v240/robson2...

(obviously depends on the model, the task... and how predictions are made :) )

GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models

Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous mode...

proceedings.mlr.press

February 4, 2025 at 4:27 AM

eyes robson

@eyesrobson.bsky.social

we also present an undervalued interpretability approach, which decomposes model variance explained into 'interpretable variables' like GC-content etc.

Borzoi and Enformer capture deeper features than the ones we test out, even surprisingly cryptic chromosomal features from sequence alone

A figure of four stacked bar plots. Each bar plot sums to the R2 variance explained by a (model x task) combination, for the models Enformer and Borzoi. The two GUANinE tasks are dnase-propensity (accessibility, left two) and cons30 (conservation, right two). The blue portion of each bar is uniquely captured variance from the deep models, while the orange portion of each bar is shared variance (i.e. these models have absorbed useful signal from these factors, which include chromosomal variables like chromosome size). Nearly invisible, save for the rightmost (Borzoi, cons30) task, is the green, interpretable variable (IV) only portion of each bar -- this means Enformer and Borzoi have extracted all of the useful signal, subsuming these features entirely, despite never having seen them during sequence-only training.

January 31, 2025 at 9:35 PM

eyes robson

@eyesrobson.bsky.social

we also examined the accuracy of Borzoi at different bin averaging scales -- for the region-based tasks of GUANinE, more bins = better perf., aside from the accessibility task

A 3x2 table of six plots, each showing the L2-regularized few-shot performance of Borzoi on a different GUANinE task, with the x-axes showing different bin averaging sizes (corresponding to 32 to 416bp). Most plots show performance slightly up and to the right, except for the top-left plot for the dnase-propensity task (an accessibility task). Enformer's 3-bin performance is also present in each plot, usually below Borzoi's at the same x-axis resolution (384 bp), except for the two conservation tasks (center), where Enformer significantly outperforms Borzoi.

January 31, 2025 at 9:26 PM

eyes robson

@eyesrobson.bsky.social

this just further emphasizes that the biggest opportunities for advancing public health is *increasing access to existing medicines*

be it through single-payer systems (e.g. medicare for all) or publicly developed and distributed medicines

May 23, 2024 at 6:27 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news