Lightnews — Scholar-powered news

Cassidy Laidlaw

@cassidylaidlaw.bsky.social

780 followers 58 following 20 posts

PhD student at UC Berkeley studying RL and AI safety.
https://cassidylaidlaw.com

Posts Replies Media Videos

Cassidy Laidlaw

@cassidylaidlaw.bsky.social

We built an AI assistant that plays Minecraft with you.
Start building a house—it figures out what you’re doing and jumps in to help.

This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

April 11, 2025 at 10:17 PM

Cassidy Laidlaw

@cassidylaidlaw.bsky.social

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

December 19, 2024 at 5:17 PM

Reposted by Cassidy Laidlaw

Eugene Vinitsky 🍒

@eugenevinitsky.bsky.social

Kind of a broken record here but proceedings.neurips.cc/paper_files/...
is totally fascinating in that it postulates two underlying, measurable structures that you can use to assess if RL will be easy or hard in an environment

e introduce the effective horizon, a property of
MDPs that controls how difficult RL is. Our analysis is mo-
tivated by Greedy Over Random Policy (GORP), a simple
Monte Carlo planning algorithm (left) that exhaustively ex-
plores action sequences of length k and then uses m random
rollouts to evaluate each leaf node. The effective horizon
combines both k and m into a single measure. We prove
sample complexity bounds based on the effective horizon that
correlate closely with the real performance of PPO, a deep
RL algorithm, on our BRIDGE dataset of 155 deterministic
MDPs (right).

November 23, 2024 at 6:18 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news