Cassidy Laidlaw
cassidylaidlaw.bsky.social
Cassidy Laidlaw
@cassidylaidlaw.bsky.social
PhD student at UC Berkeley studying RL and AI safety.
https://cassidylaidlaw.com
We built an AI assistant that plays Minecraft with you.
Start building a house—it figures out what you’re doing and jumps in to help.

This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵
April 11, 2025 at 10:17 PM
When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵
December 19, 2024 at 5:17 PM
Reposted by Cassidy Laidlaw
Kind of a broken record here but proceedings.neurips.cc/paper_files/...
is totally fascinating in that it postulates two underlying, measurable structures that you can use to assess if RL will be easy or hard in an environment
November 23, 2024 at 6:18 PM