TL;DR — LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with data about good AIs helps them become more aligned.
TL;DR — LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with data about good AIs helps them become more aligned.
I'm excited to be on the faculty job market this fall. I just updated my website with my CV.
stephencasper.com
I'm excited to be on the faculty job market this fall. I just updated my website with my CV.
stephencasper.com
fortune.com/2025/08/14/w...
fortune.com/2025/08/14/w...
www.washingtonpost.com/newsletter/p...
Open-weight LLM safety is both important & neglected. But filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.
arxiv.org/abs/2508.031...
arxiv.org/abs/2508.031...
Venue: actionable-interpretability.github.io
Paper: arxiv.org/abs/2411.11296
Venue: actionable-interpretability.github.io
Paper: arxiv.org/abs/2411.11296
I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve.
I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve.