Karim Abdel Sadek
karimabdel.bsky.social
Karim Abdel Sadek
@karimabdel.bsky.social
Incoming PhD, UC Berkeley

Interested in RL, AI Safety, Cooperative AI, TCS

https://karim-abdel.github.io
We also visualize the performance of our agents in a maze for each possible location of the goal in the environment.

The results show that agents trained with the regret objective achieve near-maximum return for almost all goal locations.
July 8, 2025 at 5:16 PM
We complement our theoretical findings with empirical results. We find these as supporting our theory, showing better generalization of agents trained via minimax regret.

Left: performance at test time
Right: % of distinguishing levels played by the respective level designer
July 8, 2025 at 5:16 PM
In the case where the environments in deployment are in the support of the training level distribution, we also show that a policy that is optimal with respect to the minimax regret objective must provably be robust against goal misgeneralization!
July 8, 2025 at 5:16 PM
We first formally show that a policy maximizing expected value may suffer from goal misgeneralization if distinguishing levels are rare.
July 8, 2025 at 5:16 PM
Goal misgeneralization can occur when training only on non-distinguishing levels, as shown in Langosco et al., 2022.

Adding a few distinguishing levels does not alter this outcome. However, we propose a mitigation for this scenario!
July 8, 2025 at 5:16 PM
Goal misgeneralization arises due to the presence of ‘proxy goals’. We formalize this and characterize environments as either:

• Non-distinguishing: the true and proxy reward may induce the same behaviour

• Distinguishing: the true and proxy rewards induce different behavior
July 8, 2025 at 5:16 PM
We propose using regret, the difference between the optimal agent's return and our current policy's return, as a training objective.

Minimizing it will encourage the agent to solve rare out-of-distribution levels during training, helping it learn the correct reward function.
July 8, 2025 at 5:16 PM
*New Paper*

🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal.

😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!
July 8, 2025 at 5:16 PM
what if…
February 21, 2025 at 4:31 AM