Cas (Stephen Casper)
banner
scasper.bsky.social
Cas (Stephen Casper)
@scasper.bsky.social
AI technical gov & risk management research. PhD student @MIT_CSAIL, fmr. UK AISI. I'm on the CS faculty job market! https://stephencasper.com/
See this paper for more of my thoughts.

papers.ssrn.com/sol3/papers...
November 26, 2025 at 4:00 PM
For more thoughts, see our agenda paper.

t.co/CVkAKNXZme
November 25, 2025 at 8:00 PM
In general, it's still hard to study the impacts of data filtering because the experiments are expensive, & developers don't generally report much about what they do. For example, we found very limited/inconsistent reporting in some recent analysis.
t.co/CVkAKNXZme
November 25, 2025 at 8:00 PM
Those are the key recent papers that I know of. Do you know of any others???
November 25, 2025 at 8:00 PM
5. Biorisk evals paper (Nov 2025)

They tested filtration of species/genus data against adv. fine-tuning. It didn't work well. This suggests filtering may work better if applied to entire tasks/domains rather than specific instances.

arxiv.org/abs/2510.27629
November 25, 2025 at 8:00 PM
4. Deep ignorance paper (August 2025) @kyletokens.bsky.social

We showed that filtering biothreat-related pretraining data is SOTA for making models resist adversarial fine-tuning. We proposed an amendment to the hypothesis from papers 1 and 2 above.

deepignorance.ai
November 25, 2025 at 8:00 PM
3. Estimating worst-case open-weight risks paper (Aug 2025)

They reported an instance where filtering biothreat data didn't have a big impact. But without more info on how and how much they filtered, it's hard to draw strong conclusions.

arxiv.org/abs/2508.03153
November 25, 2025 at 8:00 PM
2. Bad data --> good models paper (May 2025)

They found similar results to the safety pretraining paper -- that models trained on without toxic text could be *more* vulnerable to attacks eliciting toxicity.

arxiv.org/abs/2505.04741
November 25, 2025 at 8:00 PM
1. Safety pretraining paper (Apr 2025)

The experiment of theirs that was most interesting to me found that models trained without toxic text could be *more* vulnerable to attacks eliciting toxicity.

arxiv.org/abs/2504.16980
November 25, 2025 at 8:00 PM
For example, imagine a person who has never heard toxic speech versus a person who has never studied virology. It would be much easier for the first person to learn to reliably say toxic things than for the second to learn to reliably say true things about virology.
November 25, 2025 at 8:00 PM
My working hypothesis involves shameless anthropomorphization. Imagine a human has never learned about X. If they would take a long time to learn the task given what they already know, I bet it'll be possible to make a model robustly resist knowledge extraction attacks for X.
November 25, 2025 at 8:00 PM
Just as building the science of open-weight model risk management will provide a collective good, it will also require collective effort.
November 12, 2025 at 2:17 PM
We also find that currently, prominent open-weight model developers often either do not implement or report on mitigations. So there is a lot of room for more innovation and information as the science grows.
November 12, 2025 at 2:17 PM
In response, we cover 16 open technical problems with *unique* implications for open-weight model safety. They span the model lifecycle across training data curation, training algorithms, evaluations, deployment, and ecosystem monitoring.
x.com/StephenLCasp...
Cas (Stephen Casper) on X: "In response, we cover 16 open technical problems with *unique* implications for open-weight model safety. They span the model lifecycle across training data curation, training algorithms, evaluations, deployment, and ecosystem monitoring. https://t.co/4WQggZR3wS" / X
In response, we cover 16 open technical problems with *unique* implications for open-weight model safety. They span the model lifecycle across training data curation, training algorithms, evaluations, deployment, and ecosystem monitoring. https://t.co/4WQggZR3wS
x.com
November 12, 2025 at 2:17 PM
Taking AI safety seriously increasingly means taking open-weight models seriously.
November 12, 2025 at 2:17 PM
Empirical harms enabled by open models are also mounting. For example, the Internet Watch Foundation has found that they are the tools of choice for generating non-consensual AI deepfakes depicting children.
t.co/Ag4J6rrejz
November 12, 2025 at 2:17 PM
Most importantly, powerful open-weight models are probably inevitable. For example, in recent years, they have steadily grown in their prominence, capabilities, and influence. Here are two nice graphics I often point to.

Thx Epoch & Bhandari et al.
November 12, 2025 at 2:17 PM
Compared to proprietary models, open-weight models pose different opportunities and problems. I often say that they are simultaneously wonderful and terrible. For example, they allow for more open research and testing, but they can also be arbitrarily tampered with.
November 12, 2025 at 2:17 PM
Here's the paper:
t.co/CVkAKNXZme
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5705186
t.co
November 12, 2025 at 2:17 PM