Jakub Łucki
jakublucki.bsky.social
Jakub Łucki
@jakublucki.bsky.social
Visiting Researcher at NASA JPL | Data Science MSc at ETH Zurich
An Adversarial Perspective on Machine Unlearning for AI Safety
🏆 Best paper award
@ SoLaR Workshop

📅 Sat 14 Dec. Poster at 11am and Talk in the afternoon.
📍 Room West Meeting 121,122 Paper:
arxiv.org/abs/2409.18025
An Adversarial Perspective on Machine Unlearning for AI Safety
Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities fro...
arxiv.org
December 10, 2024 at 2:00 AM
Eager to experiment? Reproduce our findings with this code: github.com/ethz-spylab/...
https://github.com/ethz-spylab/un…
December 6, 2024 at 5:53 PM
Our findings highlight that

1️⃣ Robust unlearning is not yet possible; current methods face similar challenges as safety training.

2️⃣ Black-box evaluations can be misleading when assessing the effectiveness of unlearning.
December 6, 2024 at 5:50 PM
Fine-tuning “unlearned” models on benign datasets can completely restore hazardous knowledge.

Fine-tuning on dangerous knowledge leads to disproportionately fast recovery of hazardous capabilities (10 samples -> >60% of capabilities regained).
December 6, 2024 at 5:50 PM
🔡 GCG can be adapted to generate universal adversarial prefixes.

↗️ Similar to safety, unlearning relies on specific directions in the residual stream that can be ablated.

✂️ We can prune neurons responsible for “obfuscating” dangerous knowledge.
December 6, 2024 at 5:49 PM
How did we check this?

We adapted several white-box attacks used to jailbreak safety-trained models and applied them to two prominent unlearning methods: RMU, NPO.
December 6, 2024 at 5:48 PM
Safety training fine-tunes models to refuse harmful requests but can be easily jailbroken.

Machine unlearning was introduced to fully erase hazardous knowledge, making it inaccessible to adversaries.

Sounds amazing, right? Well, existing methods cannot do this (yet).
December 6, 2024 at 5:47 PM