Lightnews — Scholar-powered news

Cas (Stephen Casper)

@scasper.bsky.social

See this paper for more of my thoughts.

papers.ssrn.com/sol3/papers...

November 26, 2025 at 4:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

By coincidence, I just stumbled into this paper today shortly after posting this.
arxiv.org/abs/2511.19299

Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning

Novel deep learning architectures are increasingly being applied to biological data, including genetic sequences. These models, referred to as genomic language mod- els (gLMs), have demonstrated impre...

arxiv.org

November 25, 2025 at 9:30 PM

Cas (Stephen Casper)

@scasper.bsky.social

For more thoughts, see our agenda paper.

t.co/CVkAKNXZme

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

In general, it's still hard to study the impacts of data filtering because the experiments are expensive, & developers don't generally report much about what they do. For example, we found very limited/inconsistent reporting in some recent analysis.
t.co/CVkAKNXZme

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

Those are the key recent papers that I know of. Do you know of any others???

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

5. Biorisk evals paper (Nov 2025)

They tested filtration of species/genus data against adv. fine-tuning. It didn't work well. This suggests filtering may work better if applied to entire tasks/domains rather than specific instances.

arxiv.org/abs/2510.27629

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

4. Deep ignorance paper (August 2025) @kyletokens.bsky.social

We showed that filtering biothreat-related pretraining data is SOTA for making models resist adversarial fine-tuning. We proposed an amendment to the hypothesis from papers 1 and 2 above.

deepignorance.ai

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

3. Estimating worst-case open-weight risks paper (Aug 2025)

They reported an instance where filtering biothreat data didn't have a big impact. But without more info on how and how much they filtered, it's hard to draw strong conclusions.

arxiv.org/abs/2508.03153

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

2. Bad data --> good models paper (May 2025)

They found similar results to the safety pretraining paper -- that models trained on without toxic text could be *more* vulnerable to attacks eliciting toxicity.

arxiv.org/abs/2505.04741

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

1. Safety pretraining paper (Apr 2025)

The experiment of theirs that was most interesting to me found that models trained without toxic text could be *more* vulnerable to attacks eliciting toxicity.

arxiv.org/abs/2504.16980

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

For example, imagine a person who has never heard toxic speech versus a person who has never studied virology. It would be much easier for the first person to learn to reliably say toxic things than for the second to learn to reliably say true things about virology.

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

My working hypothesis involves shameless anthropomorphization. Imagine a human has never learned about X. If they would take a long time to learn the task given what they already know, I bet it'll be possible to make a model robustly resist knowledge extraction attacks for X.

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

Just as building the science of open-weight model risk management will provide a collective good, it will also require collective effort.

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

We also find that currently, prominent open-weight model developers often either do not implement or report on mitigations. So there is a lot of room for more innovation and information as the science grows.

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

In response, we cover 16 open technical problems with *unique* implications for open-weight model safety. They span the model lifecycle across training data curation, training algorithms, evaluations, deployment, and ecosystem monitoring.
x.com/StephenLCasp...

Cas (Stephen Casper) on X: "In response, we cover 16 open technical problems with *unique* implications for open-weight model safety. They span the model lifecycle across training data curation, training algorithms, evaluations, deployment, and ecosystem monitoring. https://t.co/4WQggZR3wS" / X

In response, we cover 16 open technical problems with *unique* implications for open-weight model safety. They span the model lifecycle across training data curation, training algorithms, evaluations, deployment, and ecosystem monitoring. https://t.co/4WQggZR3wS

x.com

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

Taking AI safety seriously increasingly means taking open-weight models seriously.

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

Empirical harms enabled by open models are also mounting. For example, the Internet Watch Foundation has found that they are the tools of choice for generating non-consensual AI deepfakes depicting children.
t.co/Ag4J6rrejz

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

Most importantly, powerful open-weight models are probably inevitable. For example, in recent years, they have steadily grown in their prominence, capabilities, and influence. Here are two nice graphics I often point to.

Thx Epoch & Bhandari et al.

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

Compared to proprietary models, open-weight models pose different opportunities and problems. I often say that they are simultaneously wonderful and terrible. For example, they allow for more open research and testing, but they can also be arbitrarily tampered with.

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

Here's the paper:
t.co/CVkAKNXZme

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5705186

t.co

November 12, 2025 at 2:17 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news