Javier Rando
javirandor.com
Javier Rando
@javirandor.com
Red-Teaming LLMs / PhD student at ETH Zurich / Prev. research intern at Meta / People call me Javi / Vegan 🌱
Website: javirando.com
Thank you so much for the invite!
February 18, 2025 at 10:05 PM
We really hope this analysis can help the community better understand where we come from, where we stand, and what things may help us make meaningful progress in the future.

Co-authored with @jiezhang-ethz.bsky.social, Nicholas Carlini and @floriantramer.bsky.social

arxiv.org/abs/2502.02260
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" probl...
arxiv.org
February 10, 2025 at 4:24 PM
We propose that adversarial ML research should clearly differentiate between two problems:

1️⃣ Real-world vulnerabilities. Attacks and defenses on ill-defined problems are valuable when harm is immediate.

2️⃣ Scientific understanding. We should study well-defined problems.
February 10, 2025 at 4:24 PM
We are aware that this is not a simple problem and some changes may actually have been for the better! For instance, we now study real-world challenges instead of academic “toy” problems like ℓₚ robustness. We tried to carefully discuss these alternative views in our work.
February 10, 2025 at 4:24 PM
We identify 3 core challenges that make adversarial ML for LLMs harder to define, harder to solve, and harder to evaluate. We then illustrate these with specific case studies: jailbreaks, un-finetunable models, poisoning, prompt injections, membership inference, and unlearning.
February 10, 2025 at 4:24 PM
Perhaps most telling, unlike for image classifiers, manual attacks outperform automated methods at finding worst-case inputs for LLMs! This challenges our ability to automatically evaluate the worst-case robustness of protections and benchmark progress.
February 10, 2025 at 4:24 PM
Now, the field has shifted to LLMs, where we consider subjective notions of safety, allow for unbounded threat models, and evaluate closed-source systems that constantly change. These changes are hindering our ability to produce meaningful scientific progress.
February 10, 2025 at 4:24 PM
Back in the 🐼 days, we dealt with well-defined tasks: misclassify an image by slightly perturbing pixels within an ℓₚ-ball. Also, attack success and defense utility could be easily measured with classification accuracy. Simple objectives that we could rigorously benchmark.
February 10, 2025 at 4:24 PM
Looking forward to this presentation. You can add it to your calendar here cohere.com/events/coher...
Cohere For AI - Javier Rando, AI Safety PhD Student at ETH Zürich
Javier Rando, AI Safety PhD Student at ETH Zürich - Poisoned Training Data Can Compromise LLMs
cohere.com
January 20, 2025 at 3:39 PM
Recently, we have demonstrated that small amounts of poisoned data posted online could compromise large-scale pretraining with backdoors that persist even after alignment arxiv.org/abs/2410.13722
Persistent Pre-Training Poisoning of LLMs
Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practic...
arxiv.org
January 20, 2025 at 3:39 PM
We poisoned RLHF to introduce backdoors in LLMs that allowed adversaries to elicit harmful generations easily arxiv.org/abs/2311.14455
Universal Jailbreak Backdoors from Poisoned Human Feedback
Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adv...
arxiv.org
January 20, 2025 at 3:39 PM
From left to right the amazing @nkristina.bsky.social @jiezhang-ethz.bsky.social @edebenedetti.bsky.social @javirandor.com @aemai.bsky.social and @dpaleka.bsky.social!

We work on AI Security/Safety/Privacy. Find out more about work in our lab website spylab.ai
SPY Lab
We are a research group at ETH Zürich studying how to build secure and private AI.
spylab.ai
December 10, 2024 at 7:43 PM
Check out all the details in the offical website llmailinject.azurewebsites.net
LLMail Inject
llmailinject.azurewebsites.net
December 9, 2024 at 5:06 PM
2) An Adversarial Perspective on Machine Unlearning for AI Safety

🏆 Best paper award
@solarneurips

📅 Sat 14 Dec. Poster at 11am and Talk in the afternoon.
📍 Room West Meeting 121,122

Paper: arxiv.org/abs/2409.18025
An Adversarial Perspective on Machine Unlearning for AI Safety
Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities fro...
arxiv.org
December 9, 2024 at 5:02 PM
1) Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition.

📅 Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST
📍 Spotlight Poster #5203 (West Ballroom A-D)

arxiv.org/abs/2406.07954
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we or...
arxiv.org
December 9, 2024 at 5:02 PM