Lightnews — Scholar-powered news

Simon Geisler

@geislersi.bsky.social

Interests: machine learning, algorithms, coding

Current: Machine Learning PhD Student at TU Munich with Prof. Stephan Günnemann

Past: @deep-mind.bsky.social, Bosch Center for Artificial Intelligence, Bosch Connected Services

Posts Replies Media Videos

Simon Geisler

@geislersi.bsky.social

The results speak for themselves! 📈 On HarmBench standard, Our REINFORCE-GCG doubles the attack success rate on Llama3 & raises it from 2% to 50% on the circuit breaker defense! This shows that our method gets closer to the estimation of the LLMs’ true robustness. 🔬 7/

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

What makes our REINFORCE objective better?
① Adaptive: Tailored to the specific LLM being attacked.
② Distributional: Considers the model’s distribution of responses.
③ Semantic: Focuses on genuinely harmful behavior (LLM-as-a-judge), not just a static starting phrase.
5/

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

Our solution? REINFORCE adversarial attacks! 💪 We use reinforcement learning principles to guide the attack towards semantically harmful outcomes, considering the distribution of possible LLM responses. It dynamically adapts to the model and targets its actual generation. 🎯 4/

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

What's the problem with existing #LLM adversarial attacks? 🤔 They often just try to make the LLM start its response inappropriately (affirmative objective). The LLM might neither complete the response in a harmful way nor does the target adapt to model-specific preferences. 2/

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

Do you think your LLM is robust?⚠️With current adversarial attacks it is hard to find out since they optimize the wrong thing! We fix this with our adaptive, semantic, and distributional objective.

By Günnemann's lab @ TU Munich's lab & Google Research, w/ CAIS support

Here's how we did it. 🧵

February 27, 2025 at 3:16 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news