Lightnews — Scholar-powered news

Cas (Stephen Casper)

@scasper.bsky.social

Here are my current favorite ideas for how to improve tamper-resistant ignorance/unlearning in LLMs.

Shamelessly copied from a slack message.

November 26, 2025 at 4:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

For more thoughts, see our agenda paper.

t.co/CVkAKNXZme

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

In general, it's still hard to study the impacts of data filtering because the experiments are expensive, & developers don't generally report much about what they do. For example, we found very limited/inconsistent reporting in some recent analysis.
t.co/CVkAKNXZme

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

5. Biorisk evals paper (Nov 2025)

They tested filtration of species/genus data against adv. fine-tuning. It didn't work well. This suggests filtering may work better if applied to entire tasks/domains rather than specific instances.

arxiv.org/abs/2510.27629

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

4. Deep ignorance paper (August 2025) @kyletokens.bsky.social

We showed that filtering biothreat-related pretraining data is SOTA for making models resist adversarial fine-tuning. We proposed an amendment to the hypothesis from papers 1 and 2 above.

deepignorance.ai

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

3. Estimating worst-case open-weight risks paper (Aug 2025)

They reported an instance where filtering biothreat data didn't have a big impact. But without more info on how and how much they filtered, it's hard to draw strong conclusions.

arxiv.org/abs/2508.03153

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

2. Bad data --> good models paper (May 2025)

They found similar results to the safety pretraining paper -- that models trained on without toxic text could be *more* vulnerable to attacks eliciting toxicity.

arxiv.org/abs/2505.04741

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

1. Safety pretraining paper (Apr 2025)

The experiment of theirs that was most interesting to me found that models trained without toxic text could be *more* vulnerable to attacks eliciting toxicity.

arxiv.org/abs/2504.16980

November 25, 2025 at 8:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

The leaked executive order has me wondering if the term "regulatory capture" has any meaning anymore.

It appears that state AI bills -- many of which big tech has fought tooth and nail to prevent -- are categorically regulatory capture.

November 20, 2025 at 2:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

We also find that currently, prominent open-weight model developers often either do not implement or report on mitigations. So there is a lot of room for more innovation and information as the science grows.

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

Empirical harms enabled by open models are also mounting. For example, the Internet Watch Foundation has found that they are the tools of choice for generating non-consensual AI deepfakes depicting children.
t.co/Ag4J6rrejz

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

Most importantly, powerful open-weight models are probably inevitable. For example, in recent years, they have steadily grown in their prominence, capabilities, and influence. Here are two nice graphics I often point to.

Thx Epoch & Bhandari et al.

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

🚨New paper🚨

From a technical perspective, safeguarding open-weight model safety is AI safety in hard mode. But there's still a lot of progress to be made. Our new paper covers 16 open problems.

🧵🧵🧵

November 12, 2025 at 2:17 PM

Cas (Stephen Casper)

@scasper.bsky.social

We also find that currently, prominent open-weight model developers often either do not implement or report on mitigations. So there is a lot of room for more innovation and information as the science grows.

November 12, 2025 at 2:04 PM

Cas (Stephen Casper)

@scasper.bsky.social

In response, we cover 16 open technical problems with *unique* implications for open-weight model safety. They span the model lifecycle across training data curation, training algorithms, evaluations, deployment, and ecosystem monitoring.

November 12, 2025 at 2:04 PM

Cas (Stephen Casper)

@scasper.bsky.social

Empirical harms enabled by open models are also mounting. For example, the Internet Watch Foundation has found that they are the tools of choice for generating non-consensual AI deepfakes depicting children.

admin.iwf.org.uk/media/nadlc...

November 12, 2025 at 2:04 PM

Cas (Stephen Casper)

@scasper.bsky.social

Most importantly, powerful open-weight models are probably inevitable. For example, in recent years, they have steadily grown in their prominence, capabilities, and influence. Here are two nice graphics I often point to.

Thx @EpochAIResearch & Bhandari et al.

November 12, 2025 at 2:04 PM

Cas (Stephen Casper)

@scasper.bsky.social

🚨New paper🚨

From a technical perspective, safeguarding open-weight model safety is AI safety in hard mode. But there's still a lot of progress to be made. Our new paper covers 16 open problems.

🧵🧵🧵

November 12, 2025 at 2:04 PM

Cas (Stephen Casper)

@scasper.bsky.social

I've essentially stopped paying attention to companies' AI eval reports. They're way too easy to game and, at this point, probably lack meaningful construct validity.

I'm increasingly persuaded that the only quantitative measures that matter anymore are usage stats & profit.

November 8, 2025 at 7:42 PM

Cas (Stephen Casper)

@scasper.bsky.social

This summer, OpenAI, Anthropic, and GDM warned that their new models were nearing key risk thresholds for novice uplift on dangerous tasks.

Now that Moonshot claims Kimi K2 Thinking is SOTA, it seems, uh, less than ideal that it came with zero reporting related to safety/risk.

November 8, 2025 at 12:22 AM

Cas (Stephen Casper)

@scasper.bsky.social

Our proposal for new AI watermarking characters is officially in the Unicode document register for proposed additions. 🤞

unicode.org/L2/L2025/252...

t.co/yJfp8ezU64

October 21, 2025 at 2:59 PM

Cas (Stephen Casper)

@scasper.bsky.social

🧵🧵🧵 Do you ever hear people saying that it's important to assess AI systems based on their "marginal risk"?

Of course -- that's obvious. Nobody would ever dispute that.

So then why are we saying that?

Maybe it's a little too obvious...

October 18, 2025 at 2:00 PM

Cas (Stephen Casper)

@scasper.bsky.social

🧵🧵🧵 Do you ever hear people saying that it's important to assess AI systems based on their "marginal risk"?

Of course -- that's obvious. Nobody would ever dispute that.

So then why are we saying that?

Maybe it's a little too obvious...

October 17, 2025 at 10:15 PM

Cas (Stephen Casper)

@scasper.bsky.social

It draws closely from recent work that we did with @kyletokens.bsky.social et al. to mitigate risks from malicious fine-tuning.

t.co/us8MEhMrIh

October 9, 2025 at 10:49 PM

Cas (Stephen Casper)

@scasper.bsky.social

Don't forget that in AI, "sycophancy," "pandering," "personalized alignment," "steerable alignment," and "user alignment" all describe exactly the same thing.

October 2, 2025 at 7:20 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news