davidlindner.me
Read more: arxiv.org/abs/2507.02737
Read more: arxiv.org/abs/2507.02737
We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
No time to read 145 pages? Check out the 10 page extended abstract at the beginning of the paper
No time to read 145 pages? Check out the 10 page extended abstract at the beginning of the paper
📄 Paper: arxiv.org/abs/2501.13011
💡 Introductory explainer: deepmindsafetyresearch.medium.com/mona-a-meth...
⚙️ Technical safety post: www.alignmentforum.org/posts/zWySW...
📄 Paper: arxiv.org/abs/2501.13011
💡 Introductory explainer: deepmindsafetyresearch.medium.com/mona-a-meth...
⚙️ Technical safety post: www.alignmentforum.org/posts/zWySW...
Best of both worlds: we get human-understandable plans (safe!) and long-term planning (performant!)
Best of both worlds: we get human-understandable plans (safe!) and long-term planning (performant!)
Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Inspired by myopic optimization but better performance – details in🧵
Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Inspired by myopic optimization but better performance – details in🧵