We conclude our paper with a vision of how AssistanceZero could be applied to post-training of LLMs. We think that our approach could remove incentives for deception and other unsafe behavior in LLMs and make them more helpful. We may or may not be already working on this 😉
April 11, 2025 at 10:17 PM
We conclude our paper with a vision of how AssistanceZero could be applied to post-training of LLMs. We think that our approach could remove incentives for deception and other unsafe behavior in LLMs and make them more helpful. We may or may not be already working on this 😉
Real human users rate our AssistanceZero assistant much higher than one trained via a pretraining+SFT pipeline! And, it enables people to build houses while placing fewer blocks than building alone.
April 11, 2025 at 10:17 PM
Real human users rate our AssistanceZero assistant much higher than one trained via a pretraining+SFT pipeline! And, it enables people to build houses while placing fewer blocks than building alone.
Our new RL algorithm, AssistanceZero, trains an assistant that displays emergent helpful behaviors like *active learning* and *learning from corrections*.
April 11, 2025 at 10:17 PM
Our new RL algorithm, AssistanceZero, trains an assistant that displays emergent helpful behaviors like *active learning* and *learning from corrections*.
In Minecraft, we use an assistance game formulation where a simulated human is given random houses to build, and an AI assistants learns via RL to help the human out. The assistant can't see the goal house, so it has to predict the goal and maintain uncertainty to be helpful.
April 11, 2025 at 10:17 PM
In Minecraft, we use an assistance game formulation where a simulated human is given random houses to build, and an AI assistants learns via RL to help the human out. The assistant can't see the goal house, so it has to predict the goal and maintain uncertainty to be helpful.
Unlike RLHF, assistance games explicitly treat the user-assistant interaction as a two player game, where the user knows their goal but the assistant doesn't. AGs model *communication* about the goal from the user to the assistant and *collaboration* between them to achieve it.
April 11, 2025 at 10:17 PM
Unlike RLHF, assistance games explicitly treat the user-assistant interaction as a two player game, where the user knows their goal but the assistant doesn't. AGs model *communication* about the goal from the user to the assistant and *collaboration* between them to achieve it.
A better assistant would maintain *uncertainty* about its goal and ask clarification questions until it really understood, leading to a better solution. Assistance games can enable this.
April 11, 2025 at 10:17 PM
A better assistant would maintain *uncertainty* about its goal and ask clarification questions until it really understood, leading to a better solution. Assistance games can enable this.
RLHF is great but it encourages short-term optimization: trying to solve the user's entire problem in a single response. For example, if you ask ChatGPT to "clean up some disk space," it will immediately give you a program to run without asking which files are okay to delete!
April 11, 2025 at 10:17 PM
RLHF is great but it encourages short-term optimization: trying to solve the user's entire problem in a single response. For example, if you ask ChatGPT to "clean up some disk space," it will immediately give you a program to run without asking which files are okay to delete!
Our work provides a more principled step towards preventing reward hacking and ensuring the safety of increasingly powerful AI. Check out the paper for all the details!
Our work provides a more principled step towards preventing reward hacking and ensuring the safety of increasingly powerful AI. Check out the paper for all the details!
Action distribution and occupancy measure regularization are equivalent for most of today's RLHF implementations (which are effectively contextual bandits). However, once LLMs are optimized for multi-turn interaction or tool use this will no longer be the case.
December 19, 2024 at 5:17 PM
Action distribution and occupancy measure regularization are equivalent for most of today's RLHF implementations (which are effectively contextual bandits). However, once LLMs are optimized for multi-turn interaction or tool use this will no longer be the case.
Experiments show that χ² occupancy measure regularization outperforms KL action distribution regularization in all the environments we study! Our regularization scheme allows for larger improvements in true reward compared to base policies while preventing reward hacking.
December 19, 2024 at 5:17 PM
Experiments show that χ² occupancy measure regularization outperforms KL action distribution regularization in all the environments we study! Our regularization scheme allows for larger improvements in true reward compared to base policies while preventing reward hacking.
Regularization is already used to prevent reward hacking in RLHF, but our theory suggests two key changes: regularize based on occupancy measures rather than action distributions and use χ² divergence instead of KL divergence.
December 19, 2024 at 5:17 PM
Regularization is already used to prevent reward hacking in RLHF, but our theory suggests two key changes: regularize based on occupancy measures rather than action distributions and use χ² divergence instead of KL divergence.
Our definition also leads to a principled method for preventing reward hacking: regularize optimization to the base policy based on χ² occupancy measure divergence. We prove that this regularized objective gives a lower bound on improvement in the true reward.
December 19, 2024 at 5:17 PM
Our definition also leads to a principled method for preventing reward hacking: regularize optimization to the base policy based on χ² occupancy measure divergence. We prove that this regularized objective gives a lower bound on improvement in the true reward.
We define reward hacking as when optimizing a proxy breaks the correlation, resulting in lower true reward than the base policy. Our definition captures intuitive cases of reward hacking in realistic environments, including RLHF, traffic control, and glucose monitoring.
December 19, 2024 at 5:17 PM
We define reward hacking as when optimizing a proxy breaks the correlation, resulting in lower true reward than the base policy. Our definition captures intuitive cases of reward hacking in realistic environments, including RLHF, traffic control, and glucose monitoring.
We argue that a good proxy *correlates* with the true reward for states and actions sampled from some reasonable "base policy.” For example, in RLHF a natural base policy is the SFT policy.
December 19, 2024 at 5:17 PM
We argue that a good proxy *correlates* with the true reward for states and actions sampled from some reasonable "base policy.” For example, in RLHF a natural base policy is the SFT policy.
However, formally defining reward hacking is tricky because we have to define what makes a proxy reward "reasonable." If we optimize a reward function that's totally unrelated to our objective, then it's unsurprising that it doesn't work and it arguably isn't "reward hacking."
December 19, 2024 at 5:17 PM
However, formally defining reward hacking is tricky because we have to define what makes a proxy reward "reasonable." If we optimize a reward function that's totally unrelated to our objective, then it's unsurprising that it doesn't work and it arguably isn't "reward hacking."
Reward hacking is when we optimize a reward function that seems reasonable, but it ceases to be a good proxy and we end up with a policy that performs poorly under the unknown "true" reward function. It's ubiquitous because real-world objectives are really hard to specify.
December 19, 2024 at 5:17 PM
Reward hacking is when we optimize a reward function that seems reasonable, but it ceases to be a good proxy and we end up with a policy that performs poorly under the unknown "true" reward function. It's ubiquitous because real-world objectives are really hard to specify.