Karim Abdel Sadek
@karimabdel.bsky.social
Incoming PhD, UC Berkeley
Interested in RL, AI Safety, Cooperative AI, TCS
https://karim-abdel.github.io
Interested in RL, AI Safety, Cooperative AI, TCS
https://karim-abdel.github.io
The paper, "Mitigating goal misgeneralization via minimax regret" will appear at @rl-conference.bsky.social!
Joint work with the great Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christrian Schroeder de Witt, David Krueger and @michaelddennis.bsky.social
www.arxiv.org/pdf/2507.03068
Joint work with the great Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christrian Schroeder de Witt, David Krueger and @michaelddennis.bsky.social
www.arxiv.org/pdf/2507.03068
www.arxiv.org
July 8, 2025 at 5:16 PM
The paper, "Mitigating goal misgeneralization via minimax regret" will appear at @rl-conference.bsky.social!
Joint work with the great Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christrian Schroeder de Witt, David Krueger and @michaelddennis.bsky.social
www.arxiv.org/pdf/2507.03068
Joint work with the great Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christrian Schroeder de Witt, David Krueger and @michaelddennis.bsky.social
www.arxiv.org/pdf/2507.03068
Future work we are excited about:
• Improving UED algorithms to be closer to the results predicted by our theory
• Mitigating the fully ambiguous case, by focusing on the inductive biases of the agent.
• Improving UED algorithms to be closer to the results predicted by our theory
• Mitigating the fully ambiguous case, by focusing on the inductive biases of the agent.
July 8, 2025 at 5:16 PM
Future work we are excited about:
• Improving UED algorithms to be closer to the results predicted by our theory
• Mitigating the fully ambiguous case, by focusing on the inductive biases of the agent.
• Improving UED algorithms to be closer to the results predicted by our theory
• Mitigating the fully ambiguous case, by focusing on the inductive biases of the agent.
We also visualize the performance of our agents in a maze for each possible location of the goal in the environment.
The results show that agents trained with the regret objective achieve near-maximum return for almost all goal locations.
The results show that agents trained with the regret objective achieve near-maximum return for almost all goal locations.
July 8, 2025 at 5:16 PM
We also visualize the performance of our agents in a maze for each possible location of the goal in the environment.
The results show that agents trained with the regret objective achieve near-maximum return for almost all goal locations.
The results show that agents trained with the regret objective achieve near-maximum return for almost all goal locations.
We complement our theoretical findings with empirical results. We find these as supporting our theory, showing better generalization of agents trained via minimax regret.
Left: performance at test time
Right: % of distinguishing levels played by the respective level designer
Left: performance at test time
Right: % of distinguishing levels played by the respective level designer
July 8, 2025 at 5:16 PM
We complement our theoretical findings with empirical results. We find these as supporting our theory, showing better generalization of agents trained via minimax regret.
Left: performance at test time
Right: % of distinguishing levels played by the respective level designer
Left: performance at test time
Right: % of distinguishing levels played by the respective level designer
In the case where the environments in deployment are in the support of the training level distribution, we also show that a policy that is optimal with respect to the minimax regret objective must provably be robust against goal misgeneralization!
July 8, 2025 at 5:16 PM
In the case where the environments in deployment are in the support of the training level distribution, we also show that a policy that is optimal with respect to the minimax regret objective must provably be robust against goal misgeneralization!
We first formally show that a policy maximizing expected value may suffer from goal misgeneralization if distinguishing levels are rare.
July 8, 2025 at 5:16 PM
We first formally show that a policy maximizing expected value may suffer from goal misgeneralization if distinguishing levels are rare.
Goal misgeneralization can occur when training only on non-distinguishing levels, as shown in Langosco et al., 2022.
Adding a few distinguishing levels does not alter this outcome. However, we propose a mitigation for this scenario!
Adding a few distinguishing levels does not alter this outcome. However, we propose a mitigation for this scenario!
July 8, 2025 at 5:16 PM
Goal misgeneralization can occur when training only on non-distinguishing levels, as shown in Langosco et al., 2022.
Adding a few distinguishing levels does not alter this outcome. However, we propose a mitigation for this scenario!
Adding a few distinguishing levels does not alter this outcome. However, we propose a mitigation for this scenario!
Goal misgeneralization arises due to the presence of ‘proxy goals’. We formalize this and characterize environments as either:
• Non-distinguishing: the true and proxy reward may induce the same behaviour
• Distinguishing: the true and proxy rewards induce different behavior
• Non-distinguishing: the true and proxy reward may induce the same behaviour
• Distinguishing: the true and proxy rewards induce different behavior
July 8, 2025 at 5:16 PM
Goal misgeneralization arises due to the presence of ‘proxy goals’. We formalize this and characterize environments as either:
• Non-distinguishing: the true and proxy reward may induce the same behaviour
• Distinguishing: the true and proxy rewards induce different behavior
• Non-distinguishing: the true and proxy reward may induce the same behaviour
• Distinguishing: the true and proxy rewards induce different behavior
We propose using regret, the difference between the optimal agent's return and our current policy's return, as a training objective.
Minimizing it will encourage the agent to solve rare out-of-distribution levels during training, helping it learn the correct reward function.
Minimizing it will encourage the agent to solve rare out-of-distribution levels during training, helping it learn the correct reward function.
July 8, 2025 at 5:16 PM
We propose using regret, the difference between the optimal agent's return and our current policy's return, as a training objective.
Minimizing it will encourage the agent to solve rare out-of-distribution levels during training, helping it learn the correct reward function.
Minimizing it will encourage the agent to solve rare out-of-distribution levels during training, helping it learn the correct reward function.
lbh gnxr gur yninynzc bhgchg, naq Nyvpr naq Obo qb gur qbg cebqhpg bs vg jvgu gurve erfcrpgvir ahzore naq gura nccyl zbq 2 gb gur erfhyg. Gurl gura pbzzhavpngr gur ovg gurl bognvarq (1=jnir,0=jvax), naq guvf bcrengvba nyjnlf erghea gur fnzr ahzore gb obgu vs n=o be bgurejvfr snvyf jvgu c=1/2?
February 17, 2025 at 6:30 AM
lbh gnxr gur yninynzc bhgchg, naq Nyvpr naq Obo qb gur qbg cebqhpg bs vg jvgu gurve erfcrpgvir ahzore naq gura nccyl zbq 2 gb gur erfhyg. Gurl gura pbzzhavpngr gur ovg gurl bognvarq (1=jnir,0=jvax), naq guvf bcrengvba nyjnlf erghea gur fnzr ahzore gb obgu vs n=o be bgurejvfr snvyf jvgu c=1/2?
Here some cool work doing a first step towards that in Minecraft using MCTS: Scalably Solving Assistance Games - openreview.net/pdf/080f0c69...
openreview.net
November 19, 2024 at 3:26 PM
Here some cool work doing a first step towards that in Minecraft using MCTS: Scalably Solving Assistance Games - openreview.net/pdf/080f0c69...
Very cool work! I think an important challenge is to scale assistance games in scenarios where the goal/action/communication space can be 'large', as to capture real world scenarios where we will want to actually apply CIRL.
November 19, 2024 at 3:26 PM
Very cool work! I think an important challenge is to scale assistance games in scenarios where the goal/action/communication space can be 'large', as to capture real world scenarios where we will want to actually apply CIRL.
Here some cool work doing a first step towards that in Minecraft using MCTS: Scalably Solving Assistance Games - openreview.net/pdf/080f0c69...
openreview.net
November 19, 2024 at 3:22 PM
Here some cool work doing a first step towards that in Minecraft using MCTS: Scalably Solving Assistance Games - openreview.net/pdf/080f0c69...