Kellin Pelrine
kellinpelrine.bsky.social
Kellin Pelrine
@kellinpelrine.bsky.social
👥 Research by CamilleThibault
@jacobtian.bsky.social @gskulski.bsky.social
TaylorCurtis JamesZhou FlorenceLaflamme LukeGuan
@reirab.bsky.social @godbout.bsky.social @kellinpelrine.bsky.social
June 19, 2025 at 2:23 PM
🚀 Given these challenges, error analysis and other simple steps could greatly improve the robustness of research in the field. We propose a lightweight Evaluation Quality Assurance (EQA) framework to enable research results that translate more smoothly to real-world impact.
June 19, 2025 at 2:15 PM
🛠️ We also provide practical tools:
• CDL-DQA: a toolkit to assess misinformation datasets
• CDL-MD: the largest misinformation dataset repo, now on Hugging Face 🤗
June 19, 2025 at 2:15 PM
🔍 Categorical labels can underestimate the performance of generative systems by massive amounts: half the errors or more.
June 19, 2025 at 2:15 PM
📊Severe spurious correlations and ambiguities affect the majority of datasets in the literature. For example, most datasets have many examples where one can’t conclusively assess veracity at all.
June 19, 2025 at 2:14 PM
5/5 🔑 We frame structural safety generalization as a fundamental vulnerability and a tractable target for research on the road to robust AI alignment. Read the full paper: arxiv.org/pdf/2504.09712
arxiv.org
June 3, 2025 at 2:36 PM
4/5 🛡️ Our fix: Structure Rewriting (SR) Guardrail. Rewrite any prompt into a canonical (plain English) form before evaluation. On GPT-4o, SR Guardrails cut attack success from 44% to 6% while blocking zero benign prompts.
June 3, 2025 at 2:36 PM
3/5 🎯 Key insight: Safety boundaries don’t transfer across formats or contexts (text ↔ images; single-turn ↔ multi-turn; English ↔ low-resource languages). We define 4 criteria for tractable research: Semantic Equivalence, Explainability, Model Transferability, Goal Transferability.
June 3, 2025 at 2:36 PM
2/5 🔍 Striking examples:
• Claude 3.5: 0% ASR on image jailbreaks—but split the same content across images? 25% success.
• Gemini 1.5 Flash: 3% ASR on text prompts—paste that text in an image and it soars to 72%.
• GPT-4o: 4% ASR on single perturbed images—split across multiple images → 38%.
June 3, 2025 at 2:36 PM
5/5 👥Team: Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao, Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, Camille Thibault, Busra Tugce Gurbuz, Reihaneh Rabbany, Jean-François Godbout, @kellinpelrine.bsky.social
October 22, 2024 at 4:49 PM
4/5 Stay tuned for updates as we expand the measurement suite, add stats for assessing counterfactuals, push scale further and refine the agent personas!
📄 Read the full paper: arxiv.org/abs/2410.13915
🖥️ Code: github.com/social-sandb...
A Simulation System Towards Solving Societal-Scale Manipulation
The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-world settings at scale is ethically and logistically impract...
arxiv.org
October 22, 2024 at 4:48 PM
3/5 We demonstrate the system in a few scenarios involving an election with different types of agents structured with memories and traits. In one example, we align agents beliefs in order to flip the election relative to a control setting.
October 22, 2024 at 4:47 PM
2/5 We built a sim system! Our 1st version has:
1.LLM-based agents interacting on social media (Mastodon).
2.Scalability: 100+ versatile, rich agents (memory, traits, etc.)
3.Measurement tools: dashboard to track agent voting, candidate favorability, and activity in an election.
October 22, 2024 at 4:46 PM