Lightnews — Scholar-powered news

Simon Geisler

@geislersi.bsky.social

Interests: machine learning, algorithms, coding

Current: Machine Learning PhD Student at TU Munich with Prof. Stephan Günnemann

Past: @deep-mind.bsky.social, Bosch Center for Artificial Intelligence, Bosch Connected Services

Posts Replies Media Videos

Simon Geisler

@geislersi.bsky.social

This work was possible due to my awesome coauthors @wollschlager.bsky.social , M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann, and the support by Center for AI Safety (CAIS) as well as Google Reserarch. 9/

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

Want to dive deeper and better understand LLM robustness? Read our paper for details on our REINFORCE objective, its implementation, and experimental evaluation. Let us know what you think! 💬👇 arxiv.org/abs/2502.17254 #AI #MachineLearning #AISafety 8/

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative re...

arxiv.org

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

The results speak for themselves! 📈 On HarmBench standard, Our REINFORCE-GCG doubles the attack success rate on Llama3 & raises it from 2% to 50% on the circuit breaker defense! This shows that our method gets closer to the estimation of the LLMs’ true robustness. 🔬 7/

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

Our objective can be used with various attack algorithms. In the context of jailbreaks and with appropriate reward, maximizing the reward turns out to be equivalent to maximizing the probability of the model responding in a harmful manner. 6/

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

What makes our REINFORCE objective better?
① Adaptive: Tailored to the specific LLM being attacked.
② Distributional: Considers the model’s distribution of responses.
③ Semantic: Focuses on genuinely harmful behavior (LLM-as-a-judge), not just a static starting phrase.
5/

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

Our solution? REINFORCE adversarial attacks! 💪 We use reinforcement learning principles to guide the attack towards semantically harmful outcomes, considering the distribution of possible LLM responses. It dynamically adapts to the model and targets its actual generation. 🎯 4/

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

Considering that the attack objective navigates the LLM during the attack, in terms of real-world navigation systems, the affirmative objective is like the instruction “exit the driveway” but lacks instructions about getting to your actual destination. 🚗💥3/

February 27, 2025 at 3:16 PM

Simon Geisler

@geislersi.bsky.social

What's the problem with existing #LLM adversarial attacks? 🤔 They often just try to make the LLM start its response inappropriately (affirmative objective). The LLM might neither complete the response in a harmful way nor does the target adapt to model-specific preferences. 2/

February 27, 2025 at 3:16 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news