davidlindner.me
(1) current models are not yet capable of realistic scheming
(2) CoT monitoring is a promising mitigation for future scheming
(1) current models are not yet capable of realistic scheming
(2) CoT monitoring is a promising mitigation for future scheming
We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
No time to read 145 pages? Check out the 10 page extended abstract at the beginning of the paper
No time to read 145 pages? Check out the 10 page extended abstract at the beginning of the paper
In particular: we're looking for strong ML researchers and engineers and you do not need to be an AGI safety expert
We're hiring at Google DeepMind! We have open positions for research engineers and research scientists in the AGI Safety & Alignment and Gemini Safety teams.
Locations: London, Zurich, New York, Mountain View and SF
In particular: we're looking for strong ML researchers and engineers and you do not need to be an AGI safety expert
We're hiring at Google DeepMind! We have open positions for research engineers and research scientists in the AGI Safety & Alignment and Gemini Safety teams.
Locations: London, Zurich, New York, Mountain View and SF
We're hiring at Google DeepMind! We have open positions for research engineers and research scientists in the AGI Safety & Alignment and Gemini Safety teams.
Locations: London, Zurich, New York, Mountain View and SF
Our new paper mitigates this, keeps the ability for long-term planning, and doesnt assume you can detect the undermining strategy. 👇
Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Inspired by myopic optimization but better performance – details in🧵
Our new paper mitigates this, keeps the ability for long-term planning, and doesnt assume you can detect the undermining strategy. 👇
Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Inspired by myopic optimization but better performance – details in🧵
Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Inspired by myopic optimization but better performance – details in🧵
It represents a substantial step forward in how we predict weather and assess the risk of extreme events. 🌪️🧵
It represents a substantial step forward in how we predict weather and assess the risk of extreme events. 🌪️🧵