Lightnews — Scholar-powered news

Jakub Łucki

@jakublucki.bsky.social

24 followers 38 following 11 posts

Visiting Researcher at NASA JPL | Data Science MSc at ETH Zurich

Posts Replies Media Videos

Jakub Łucki

@jakublucki.bsky.social

Fine-tuning “unlearned” models on benign datasets can completely restore hazardous knowledge.

Fine-tuning on dangerous knowledge leads to disproportionately fast recovery of hazardous capabilities (10 samples -> >60% of capabilities regained).

December 6, 2024 at 5:50 PM

Jakub Łucki

@jakublucki.bsky.social

🔡 GCG can be adapted to generate universal adversarial prefixes.

↗️ Similar to safety, unlearning relies on specific directions in the residual stream that can be ablated.

✂️ We can prune neurons responsible for “obfuscating” dangerous knowledge.

December 6, 2024 at 5:49 PM

Jakub Łucki

@jakublucki.bsky.social

🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨

Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.

Here's what we found👇

December 6, 2024 at 5:47 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news