Simon Geisler
geislersi.bsky.social
Simon Geisler
@geislersi.bsky.social
Interests: machine learning, algorithms, coding

Current: Machine Learning PhD Student at TU Munich with Prof. Stephan Günnemann

Past: @deep-mind.bsky.social, Bosch Center for Artificial Intelligence, Bosch Connected Services
This work was possible due to my awesome coauthors @wollschlager.bsky.social , M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann, and the support by Center for AI Safety (CAIS) as well as Google Reserarch. 9/
February 27, 2025 at 3:16 PM
Want to dive deeper and better understand LLM robustness? Read our paper for details on our REINFORCE objective, its implementation, and experimental evaluation. Let us know what you think! 💬👇 arxiv.org/abs/2502.17254 #AI #MachineLearning #AISafety 8/
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective
To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative re...
arxiv.org
February 27, 2025 at 3:16 PM
The results speak for themselves! 📈 On HarmBench standard, Our REINFORCE-GCG doubles the attack success rate on Llama3 & raises it from 2% to 50% on the circuit breaker defense! This shows that our method gets closer to the estimation of the LLMs’ true robustness. 🔬 7/
February 27, 2025 at 3:16 PM
Our objective can be used with various attack algorithms. In the context of jailbreaks and with appropriate reward, maximizing the reward turns out to be equivalent to maximizing the probability of the model responding in a harmful manner. 6/
February 27, 2025 at 3:16 PM
What makes our REINFORCE objective better?
① Adaptive: Tailored to the specific LLM being attacked.
② Distributional: Considers the model’s distribution of responses.
③ Semantic: Focuses on genuinely harmful behavior (LLM-as-a-judge), not just a static starting phrase.
5/
February 27, 2025 at 3:16 PM
Our solution? REINFORCE adversarial attacks! 💪 We use reinforcement learning principles to guide the attack towards semantically harmful outcomes, considering the distribution of possible LLM responses. It dynamically adapts to the model and targets its actual generation. 🎯 4/
February 27, 2025 at 3:16 PM
Considering that the attack objective navigates the LLM during the attack, in terms of real-world navigation systems, the affirmative objective is like the instruction “exit the driveway” but lacks instructions about getting to your actual destination. 🚗💥3/
February 27, 2025 at 3:16 PM
What's the problem with existing #LLM adversarial attacks? 🤔 They often just try to make the LLM start its response inappropriately (affirmative objective). The LLM might neither complete the response in a harmful way nor does the target adapt to model-specific preferences. 2/
February 27, 2025 at 3:16 PM