Lightnews — Scholar-powered news

Karim Abdel Sadek

@karimabdel.bsky.social

190 followers 94 following 16 posts

Incoming PhD, UC Berkeley

Interested in RL, AI Safety, Cooperative AI, TCS

https://karim-abdel.github.io

Posts Replies Media Videos

Karim Abdel Sadek

@karimabdel.bsky.social

We also visualize the performance of our agents in a maze for each possible location of the goal in the environment.

The results show that agents trained with the regret objective achieve near-maximum return for almost all goal locations.

July 8, 2025 at 5:16 PM

Karim Abdel Sadek

@karimabdel.bsky.social

We complement our theoretical findings with empirical results. We find these as supporting our theory, showing better generalization of agents trained via minimax regret.

Left: performance at test time
Right: % of distinguishing levels played by the respective level designer

July 8, 2025 at 5:16 PM

Karim Abdel Sadek

@karimabdel.bsky.social

In the case where the environments in deployment are in the support of the training level distribution, we also show that a policy that is optimal with respect to the minimax regret objective must provably be robust against goal misgeneralization!

July 8, 2025 at 5:16 PM

Karim Abdel Sadek

@karimabdel.bsky.social

We first formally show that a policy maximizing expected value may suffer from goal misgeneralization if distinguishing levels are rare.

July 8, 2025 at 5:16 PM

Karim Abdel Sadek

@karimabdel.bsky.social

Goal misgeneralization can occur when training only on non-distinguishing levels, as shown in Langosco et al., 2022.

Adding a few distinguishing levels does not alter this outcome. However, we propose a mitigation for this scenario!

July 8, 2025 at 5:16 PM

Karim Abdel Sadek

@karimabdel.bsky.social

Goal misgeneralization arises due to the presence of ‘proxy goals’. We formalize this and characterize environments as either:

• Non-distinguishing: the true and proxy reward may induce the same behaviour

• Distinguishing: the true and proxy rewards induce different behavior

July 8, 2025 at 5:16 PM

Karim Abdel Sadek

@karimabdel.bsky.social

We propose using regret, the difference between the optimal agent's return and our current policy's return, as a training objective.

Minimizing it will encourage the agent to solve rare out-of-distribution levels during training, helping it learn the correct reward function.

July 8, 2025 at 5:16 PM

Karim Abdel Sadek

@karimabdel.bsky.social

*New Paper*

🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal.

😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!

July 8, 2025 at 5:16 PM

Karim Abdel Sadek

@karimabdel.bsky.social

what if…

February 21, 2025 at 4:31 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news