Lightnews — Scholar-powered news

Javier Rando

@javirandor.com

Thank you so much for the invite!

February 18, 2025 at 10:05 PM

Javier Rando

@javirandor.com

We really hope this analysis can help the community better understand where we come from, where we stand, and what things may help us make meaningful progress in the future.

Co-authored with @jiezhang-ethz.bsky.social, Nicholas Carlini and @floriantramer.bsky.social

arxiv.org/abs/2502.02260

Adversarial ML Problems Are Getting Harder to Solve and to Evaluate

In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" probl...

arxiv.org

February 10, 2025 at 4:24 PM

Javier Rando

@javirandor.com

We propose that adversarial ML research should clearly differentiate between two problems:

1️⃣ Real-world vulnerabilities. Attacks and defenses on ill-defined problems are valuable when harm is immediate.

2️⃣ Scientific understanding. We should study well-defined problems.

February 10, 2025 at 4:24 PM

Javier Rando

@javirandor.com

We are aware that this is not a simple problem and some changes may actually have been for the better! For instance, we now study real-world challenges instead of academic “toy” problems like ℓₚ robustness. We tried to carefully discuss these alternative views in our work.

February 10, 2025 at 4:24 PM

Javier Rando

@javirandor.com

We identify 3 core challenges that make adversarial ML for LLMs harder to define, harder to solve, and harder to evaluate. We then illustrate these with specific case studies: jailbreaks, un-finetunable models, poisoning, prompt injections, membership inference, and unlearning.

February 10, 2025 at 4:24 PM

Javier Rando

@javirandor.com

Perhaps most telling, unlike for image classifiers, manual attacks outperform automated methods at finding worst-case inputs for LLMs! This challenges our ability to automatically evaluate the worst-case robustness of protections and benchmark progress.

February 10, 2025 at 4:24 PM

Javier Rando

@javirandor.com

Now, the field has shifted to LLMs, where we consider subjective notions of safety, allow for unbounded threat models, and evaluate closed-source systems that constantly change. These changes are hindering our ability to produce meaningful scientific progress.

February 10, 2025 at 4:24 PM

Javier Rando

@javirandor.com

Back in the 🐼 days, we dealt with well-defined tasks: misclassify an image by slightly perturbing pixels within an ℓₚ-ball. Also, attack success and defense utility could be easily measured with classification accuracy. Simple objectives that we could rigorously benchmark.

February 10, 2025 at 4:24 PM

Javier Rando

@javirandor.com

Looking forward to this presentation. You can add it to your calendar here cohere.com/events/coher...

Cohere For AI - Javier Rando, AI Safety PhD Student at ETH Zürich

Javier Rando, AI Safety PhD Student at ETH Zürich - Poisoned Training Data Can Compromise LLMs

cohere.com

January 20, 2025 at 3:39 PM

Javier Rando

@javirandor.com

Recently, we have demonstrated that small amounts of poisoned data posted online could compromise large-scale pretraining with backdoors that persist even after alignment arxiv.org/abs/2410.13722

Persistent Pre-Training Poisoning of LLMs

Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practic...

arxiv.org

January 20, 2025 at 3:39 PM

Javier Rando

@javirandor.com

We poisoned RLHF to introduce backdoors in LLMs that allowed adversaries to elicit harmful generations easily arxiv.org/abs/2311.14455

Universal Jailbreak Backdoors from Poisoned Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adv...

arxiv.org

January 20, 2025 at 3:39 PM

Javier Rando

@javirandor.com

From left to right the amazing @nkristina.bsky.social @jiezhang-ethz.bsky.social @edebenedetti.bsky.social @javirandor.com @aemai.bsky.social and @dpaleka.bsky.social!

We work on AI Security/Safety/Privacy. Find out more about work in our lab website spylab.ai

SPY Lab

We are a research group at ETH Zürich studying how to build secure and private AI.

spylab.ai

December 10, 2024 at 7:43 PM

Javier Rando

@javirandor.com

Check out all the details in the offical website llmailinject.azurewebsites.net

LLMail Inject

llmailinject.azurewebsites.net

December 9, 2024 at 5:06 PM

Javier Rando

@javirandor.com

2) An Adversarial Perspective on Machine Unlearning for AI Safety

🏆 Best paper award
@solarneurips

📅 Sat 14 Dec. Poster at 11am and Talk in the afternoon.
📍 Room West Meeting 121,122

Paper: arxiv.org/abs/2409.18025

An Adversarial Perspective on Machine Unlearning for AI Safety

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities fro...

arxiv.org

December 9, 2024 at 5:02 PM

Javier Rando

@javirandor.com

1) Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition.

📅 Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST
📍 Spotlight Poster #5203 (West Ballroom A-D)

arxiv.org/abs/2406.07954

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we or...

arxiv.org

December 9, 2024 at 5:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news