Sebastian Farquhar
sebfar.bsky.social
Sebastian Farquhar
@sebfar.bsky.social
Senior Research Scientist at Google DeepMind. AGI Alignment researcher. Views my dog's.
By default, LLM agents with long action sequences use early steps to undermine your evaluation of later steps; a big alignment risk.

Our new paper mitigates this, keeps the ability for long-term planning, and doesnt assume you can detect the undermining strategy. 👇
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?

Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!

Inspired by myopic optimization but better performance – details in🧵
January 23, 2025 at 3:47 PM
Updated! Keep em coming.
Help me grow this starter pack for technical researchers working on AGI safety! go.bsky.app/D6P44sC Some flex, but aiming for mostly technical research rather than governance/strategy. Who am I missing?
November 26, 2024 at 9:07 AM
Help me grow this starter pack for technical researchers working on AGI safety! go.bsky.app/D6P44sC Some flex, but aiming for mostly technical research rather than governance/strategy. Who am I missing?
November 25, 2024 at 2:04 PM
Starting to prepare yourself to submit to ICML? Here are my tips on how to write well for an ML research audience. sebastianfarquhar.com/on-research/...
How to Write ML Papers
This doc is aimed at students learning to write ML papers as well as more experienced writers. It isn’t about how to do the research itself, but about how to present it in a way that makes it impactfu...
sebastianfarquhar.com
November 18, 2024 at 8:06 PM
Entertaining essay about how the decline in practical engineering education has been devastating for *checks notes* professional criminal safe crackers. (Ok, mostly just a fun history of safe cracking.) www.timhunkin.com/94_illegal_e...
timhunkin/illegal engineering
www.timhunkin.com
November 15, 2024 at 5:14 PM
Something I loved most about the internet in the 2000s was the idiosyncratic personal webpages that some people had put a crazy amount of time and effort into.

These pages must still exist right? What are the best ones you know of?
November 13, 2024 at 10:40 AM