Simon Geisler
geislersi.bsky.social
Simon Geisler
@geislersi.bsky.social
Interests: machine learning, algorithms, coding

Current: Machine Learning PhD Student at TU Munich with Prof. Stephan Günnemann

Past: @deep-mind.bsky.social, Bosch Center for Artificial Intelligence, Bosch Connected Services
The results speak for themselves! 📈 On HarmBench standard, Our REINFORCE-GCG doubles the attack success rate on Llama3 & raises it from 2% to 50% on the circuit breaker defense! This shows that our method gets closer to the estimation of the LLMs’ true robustness. 🔬 7/
February 27, 2025 at 3:16 PM
What makes our REINFORCE objective better?
① Adaptive: Tailored to the specific LLM being attacked.
② Distributional: Considers the model’s distribution of responses.
③ Semantic: Focuses on genuinely harmful behavior (LLM-as-a-judge), not just a static starting phrase.
5/
February 27, 2025 at 3:16 PM
Our solution? REINFORCE adversarial attacks! 💪 We use reinforcement learning principles to guide the attack towards semantically harmful outcomes, considering the distribution of possible LLM responses. It dynamically adapts to the model and targets its actual generation. 🎯 4/
February 27, 2025 at 3:16 PM
What's the problem with existing #LLM adversarial attacks? 🤔 They often just try to make the LLM start its response inappropriately (affirmative objective). The LLM might neither complete the response in a harmful way nor does the target adapt to model-specific preferences. 2/
February 27, 2025 at 3:16 PM
Do you think your LLM is robust?⚠️With current adversarial attacks it is hard to find out since they optimize the wrong thing! We fix this with our adaptive, semantic, and distributional objective.

By Günnemann's lab @ TU Munich's lab & Google Research, w/ CAIS support

Here's how we did it. 🧵
February 27, 2025 at 3:16 PM