Jakub Łucki
jakublucki.bsky.social
Jakub Łucki
@jakublucki.bsky.social
Visiting Researcher at NASA JPL | Data Science MSc at ETH Zurich
Fine-tuning “unlearned” models on benign datasets can completely restore hazardous knowledge.

Fine-tuning on dangerous knowledge leads to disproportionately fast recovery of hazardous capabilities (10 samples -> >60% of capabilities regained).
December 6, 2024 at 5:50 PM
🔡 GCG can be adapted to generate universal adversarial prefixes.

↗️ Similar to safety, unlearning relies on specific directions in the residual stream that can be ablated.

✂️ We can prune neurons responsible for “obfuscating” dangerous knowledge.
December 6, 2024 at 5:49 PM
🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨

Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.

Here's what we found👇
December 6, 2024 at 5:47 PM