Javier Rando
javirandor.com
Javier Rando
@javirandor.com
Red-Teaming LLMs / PhD student at ETH Zurich / Prev. research intern at Meta / People call me Javi / Vegan 🌱
Website: javirando.com
Pinned
Anyone may be able to compromise LLMs with malicious content posted online. With just a small amount of data, adversaries can backdoor chatbots to become unusable for RAG, or bias their outputs towards specific beliefs. Check our latest work! 👇🧵
Adversarial ML research is evolving, but not necessarily for the better. In our new paper, we argue that LLMs have made problems harder to solve, and even tougher to evaluate. Here’s why another decade of work might still leave us without meaningful progress. 👇
February 10, 2025 at 4:24 PM
This Thursday, I will be presenting my work on poisoning RLHF and LLM pretraining @cohereforai.bsky.social

More info here cohere.com/events/coher...
Cohere For AI - Javier Rando, AI Safety PhD Student at ETH Zürich
Javier Rando, AI Safety PhD Student at ETH Zürich - Poisoned Training Data Can Compromise LLMs
cohere.com
January 20, 2025 at 3:39 PM
Reposted by Javier Rando
Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)
January 11, 2025 at 1:53 AM
Tomorrow @jakublucki.bsky.social will be presenting the BEST TECHNICAL PAPER at the SoLaR workshop at NeurIPS. Come check our poster and his oral presentation!
Our paper on how unlearning fails to remove hazardous knowledge from LLM weights received 🏆 Best Paper 🏆 award at SoLaR @ NeurIPS!

Join my oral presentation on Saturday at 4:30 pm to learn more.
December 14, 2024 at 3:43 AM
Reposted by Javier Rando
I am at NeurIPS 🇨🇦, please reach out if you want to grab a coffee!
December 12, 2024 at 10:36 PM
Reposted by Javier Rando
I am in beautiful Vancouver for #NeurIPS2024 with those amazing folks!
Say hi if you want to chat about ML privacy and security
(or speciality ☕)
SPY Lab is in Vancouver for NeurIPS! Come say hi if you see us around 🕵️
December 10, 2024 at 7:48 PM
SPY Lab is in Vancouver for NeurIPS! Come say hi if you see us around 🕵️
December 10, 2024 at 7:43 PM
A new competition on LLM-agents prompt injection is out! Send malicious emails and get agents to perform unauthorised actions.

The competition is hosted at SaTML 2025 and has a pool of $10k in prizes! What are you waiting for?
📢Have experience jailbreaking LLMs?
Want to learn how an indirect / cross prompt injection attack works? Want to try something different to an advent of code?
Then, I have a challenge for you!

The LLMail-Inject competition (llmailinject.azurewebsites.net) starts at 11am UTC (that's in 5min!)
December 9, 2024 at 5:06 PM
I will be at #NeurIPS2024 in Vancouver. I am excited to meet people working on AI Safety and Security. Drop a DM if you want to meet.

I will be presenting two (spotlight!) works. Come say hi to our posters.
December 9, 2024 at 5:02 PM
Reposted by Javier Rando
🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨

Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.

Here's what we found👇
December 6, 2024 at 5:47 PM
Reposted by Javier Rando
Come do open AI with us in Zurich!
We're hiring PhD students, postdocs (and faculty!)
Zurich is a great place to live and do research. It became a slightly better one overnight! Excited to see OAI opening an office here with such a great starting team 🎉
Ok, it is yesterdays news already, but good night sleep is important.

After 7 amazing years at Google Brain/DM, I am joining OpenAI. Together with @xzhai.bsky.social and @giffmana.ai, we will establish OpenAI Zurich office. Proud of our past work and looking forward to the future.
December 4, 2024 at 1:49 PM
I am curating a list of researchers working on AI Safety and Security here go.bsky.app/BcjeVbN.

Reply to this post with your user or other people you think should be included!
AI Safety and Security
Join the conversation
go.bsky.app
December 4, 2024 at 10:38 AM
Zurich is a great place to live and do research. It became a slightly better one overnight! Excited to see OAI opening an office here with such a great starting team 🎉
Ok, it is yesterdays news already, but good night sleep is important.

After 7 amazing years at Google Brain/DM, I am joining OpenAI. Together with @xzhai.bsky.social and @giffmana.ai, we will establish OpenAI Zurich office. Proud of our past work and looking forward to the future.
December 4, 2024 at 9:46 AM
Great opportunity to do impactful work on AI alignment!
📢 Seeking PhD students for AI alignment research. Our lab investigates technical mechanisms for value learning, pre-training alignment, and regulatory frameworks. Come work with us if you want to bridge technical ML and legal/policy domains. Details in thread 🧵
December 2, 2024 at 4:07 PM
Jailbreaks have become a new sort of ImageNet competition instead of helping us better understand LLM security. I wrote a blogpost about what I think valuable research could look like 🧵

📖 javirando.com/blog/2024/ja...
Do not write that jailbreak paper | Javier Rando | AI Safety and Security
Jailbreaks are becoming a new ImageNet competition instead of helping us better understand LLM security. Some takes on how LLM jailbreak and security research should look like.
javirando.com
November 26, 2024 at 12:18 PM
Anyone may be able to compromise LLMs with malicious content posted online. With just a small amount of data, adversaries can backdoor chatbots to become unusable for RAG, or bias their outputs towards specific beliefs. Check our latest work! 👇🧵
November 25, 2024 at 12:27 PM
Reposted by Javier Rando
Ensemble Everything Everywhere is a defense against adversarial examples that people got quite exited about a few months ago (in particular, the defense causes "perceptually aligned" gradients just like adversarial training)

Unfortunately, we show it's not robust...

arxiv.org/abs/2411.14834
Gradient Masking All-at-Once: Ensemble Everything Everywhere Is Not Robust
Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model's intermediate representations...
arxiv.org
November 25, 2024 at 8:38 AM