Kellin Pelrine
kellinpelrine.bsky.social
Kellin Pelrine
@kellinpelrine.bsky.social
Reposted by Kellin Pelrine
Frontier AI models with openly available weights are steadily becoming more powerful and widely adopted. They enable open research, but also create new risks. New paper outlines 16 open technical challenges for making open-weight AI models safer. 👇
🚨New paper🚨

From a technical perspective, safeguarding open-weight model safety is AI safety in hard mode. But there's still a lot of progress to be made. Our new paper covers 16 open problems.

🧵🧵🧵
November 12, 2025 at 9:04 PM
Reposted by Kellin Pelrine
1/ Many frontier AIs are willing to persuade on dangerous topics, according to our new benchmark: Attempt to Persuade Eval (APE).

Here’s Google’s most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist group👇
August 21, 2025 at 4:24 PM
Reposted by Kellin Pelrine
1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"—equally capable as the original but stripped of all safety measures.
July 17, 2025 at 6:01 PM
Reposted by Kellin Pelrine
Conspiracies emerge in the wake of high-profile events, but you can’t debunk them with evidence because little yet exists. Does this mean LLMs can’t debunk conspiracies during ongoing events? No!

We show they can in a new working paper.

PDF: osf.io/preprints/ps...
July 9, 2025 at 4:34 PM
💡 Strong data and eval are essential for real-world progress. In "A Guide to Misinformation Detection Data and Evaluation"—to be presented at KDD 2025—we conduct the largest survey to date in this domain: 75 datasets curated, 45 accessible ones analyzed in depth. Key findings👇
June 19, 2025 at 2:14 PM
1/5 🚀 Just accepted to Findings of ACL 2025! We dug into a foundational LLM vulnerability: models learn structure‐specific safety with insufficient semantic generalization. In short, safety training fails when the same meaning appears in a different form. 🧵
June 3, 2025 at 2:36 PM
Reposted by Kellin Pelrine
1/ Safety guardrails are illusory. DeepSeek R1’s advanced reasoning can be converted into an "evil twin": just as powerful, but with safety guardrails stripped away. The same applies to GPT-4o, Gemini 1.5 & Claude 3. How can we ensure AI maximizes benefits while minimizing harm?
February 4, 2025 at 10:41 PM
1/5 AI is increasingly–even superhumanly–persuasive…could they soon cause severe harm through societal-scale manipulation? It’s extremely hard to test countermeasures, since we can’t just go out and manipulate people in order to see how countermeasures work. What can we do?🧵
October 22, 2024 at 4:46 PM