Lightnews — Scholar-powered news

Kellin Pelrine

@kellinpelrine.bsky.social

13 followers 9 following 17 posts

https://kellinpelrine.github.io/, https://www.linkedin.com/in/kellin-pelrine/

Posts Replies Media Videos

Reposted by Kellin Pelrine

FAR.AI

@far.ai

Frontier AI models with openly available weights are steadily becoming more powerful and widely adopted. They enable open research, but also create new risks. New paper outlines 16 open technical challenges for making open-weight AI models safer. 👇

Cas (Stephen Casper) @scasper.bsky.social · 1d

🚨New paper🚨

From a technical perspective, safeguarding open-weight model safety is AI safety in hard mode. But there's still a lot of progress to be made. Our new paper covers 16 open problems.

🧵🧵🧵

November 12, 2025 at 9:04 PM

Reposted by Kellin Pelrine

FAR.AI

@far.ai

1/ Many frontier AIs are willing to persuade on dangerous topics, according to our new benchmark: Attempt to Persuade Eval (APE).

Here’s Google’s most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist group👇

August 21, 2025 at 4:24 PM

Reposted by Kellin Pelrine

FAR.AI

@far.ai

1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"—equally capable as the original but stripped of all safety measures.

July 17, 2025 at 6:01 PM

Reposted by Kellin Pelrine

Tom Costello

@tomcostello.bsky.social

Conspiracies emerge in the wake of high-profile events, but you can’t debunk them with evidence because little yet exists. Does this mean LLMs can’t debunk conspiracies during ongoing events? No!

We show they can in a new working paper.

PDF: osf.io/preprints/ps...

July 9, 2025 at 4:34 PM

Kellin Pelrine

@kellinpelrine.bsky.social

💡 Strong data and eval are essential for real-world progress. In "A Guide to Misinformation Detection Data and Evaluation"—to be presented at KDD 2025—we conduct the largest survey to date in this domain: 75 datasets curated, 45 accessible ones analyzed in depth. Key findings👇

June 19, 2025 at 2:14 PM

Kellin Pelrine

@kellinpelrine.bsky.social

1/5 🚀 Just accepted to Findings of ACL 2025! We dug into a foundational LLM vulnerability: models learn structure‐specific safety with insufficient semantic generalization. In short, safety training fails when the same meaning appears in a different form. 🧵

June 3, 2025 at 2:36 PM

Reposted by Kellin Pelrine

FAR.AI

@far.ai

1/ Safety guardrails are illusory. DeepSeek R1’s advanced reasoning can be converted into an "evil twin": just as powerful, but with safety guardrails stripped away. The same applies to GPT-4o, Gemini 1.5 & Claude 3. How can we ensure AI maximizes benefits while minimizing harm?

February 4, 2025 at 10:41 PM

Kellin Pelrine

@kellinpelrine.bsky.social

1/5 AI is increasingly–even superhumanly–persuasive…could they soon cause severe harm through societal-scale manipulation? It’s extremely hard to test countermeasures, since we can’t just go out and manipulate people in order to see how countermeasures work. What can we do?🧵

October 22, 2024 at 4:46 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news