Interested in RL, AI Safety, Cooperative AI, TCS
https://karim-abdel.github.io
The results show that agents trained with the regret objective achieve near-maximum return for almost all goal locations.
The results show that agents trained with the regret objective achieve near-maximum return for almost all goal locations.
Left: performance at test time
Right: % of distinguishing levels played by the respective level designer
Left: performance at test time
Right: % of distinguishing levels played by the respective level designer
Adding a few distinguishing levels does not alter this outcome. However, we propose a mitigation for this scenario!
Adding a few distinguishing levels does not alter this outcome. However, we propose a mitigation for this scenario!
• Non-distinguishing: the true and proxy reward may induce the same behaviour
• Distinguishing: the true and proxy rewards induce different behavior
• Non-distinguishing: the true and proxy reward may induce the same behaviour
• Distinguishing: the true and proxy rewards induce different behavior
Minimizing it will encourage the agent to solve rare out-of-distribution levels during training, helping it learn the correct reward function.
Minimizing it will encourage the agent to solve rare out-of-distribution levels during training, helping it learn the correct reward function.
🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal.
😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!
🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal.
😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!