Lightnews — Scholar-powered news

Vivek Myers

@vivekmyers.bsky.social

How can agents trained to reach (temporally) nearby goals generalize to attain distant goals?

Come to our #ICLR2025 poster now to discuss 𝘩𝘰𝘳𝘪𝘻𝘰𝘯 𝘨𝘦𝘯𝘦𝘳𝘢𝘭𝘪𝘻𝘢𝘵𝘪𝘰𝘯!

w/ @crji.bsky.social and @ben-eysenbach.bsky.social

📍Hall 3 + Hall 2B #637

April 26, 2025 at 2:12 AM

Vivek Myers

@vivekmyers.bsky.social

We validated this in simulation. Across offline RL benchmarks, imitation using our TRA task representations outperformed standard behavioral cloning-especially for stitching tasks. In many cases, TRA beat "true" value-based offline RL, using only an imitation loss. 5/

February 14, 2025 at 1:39 AM

Vivek Myers

@vivekmyers.bsky.social

Successor features have long been known to boost RL generalization (Dayan, 1993). Our findings suggest something stronger: successor task representations produce emergent capabilities beyond training even without RL or explicit subtask decomposition. 4/

February 14, 2025 at 1:39 AM

Vivek Myers

@vivekmyers.bsky.social

This trick encourages a form of time invariance during learning: both nearby and distant goals are represented similarly. By additionally aligning language instructions 𝜉(ℓ) to the goal representations 𝜓(𝑔), the policy can also perform new compound language tasks. 3/

February 14, 2025 at 1:39 AM

Vivek Myers

@vivekmyers.bsky.social

What does temporal alignment mean? When training, our policy imitates the human actions that lead to the end goal 𝑔 of a trajectory. Rather than training on the raw goals, we use a representation 𝜓(𝑔) that aligns with the preceding state “successor features” 𝜙(𝑠). 2/

February 14, 2025 at 1:39 AM

Vivek Myers

@vivekmyers.bsky.social

Current robot learning methods are good at imitating tasks seen during training, but struggle to compose behaviors in new ways. When training imitation policies, we found something surprising—using temporally-aligned task representations enabled compositional generalization. 1/

February 14, 2025 at 1:39 AM

Vivek Myers

@vivekmyers.bsky.social

Empirical results support this theory. The degree of planning invariance and horizon generalization is correlated across environments and GCRL methods. Critics parameterized as a quasimetric distance indeed tend to generalize the most over horizon. 5/

February 4, 2025 at 8:37 PM

Vivek Myers

@vivekmyers.bsky.social

Similar to how CNN architectures exploit the inductive bias of translation-invariance for image classification, RL policies can enforce planning invariance by using a *quasimetric* critic parameterization that is guaranteed to obey the triangle inequality. 4/

February 4, 2025 at 8:37 PM

Vivek Myers

@vivekmyers.bsky.social

The key to achieving horizon generalization is *planning invariance*. A policy is planning invariant if decomposing tasks into simpler subtasks doesn't improve performance. We prove planning invariance can enable horizon generalization. 3/

February 4, 2025 at 8:37 PM

Vivek Myers

@vivekmyers.bsky.social

Certain RL algorithms are more conducive to horizon generalization than others. Goal-conditioned (GCRL) methods with a bilinear critic ϕ(𝑠)ᵀψ(𝑔) as well as quasimetric methods better-enable horizon generalization. 2/

February 4, 2025 at 8:37 PM

Vivek Myers

@vivekmyers.bsky.social

Reinforcement learning agents should be able to improve upon behaviors seen during training.
In practice, RL agents often struggle to generalize to new long-horizon behaviors.
Our new paper studies *horizon generalization*, the degree to which RL algorithms generalize to reaching distant goals. 1/

February 4, 2025 at 8:37 PM

Vivek Myers

@vivekmyers.bsky.social

We show that optimizing this human effective empowerment helps in assistive settings. Theoretically, we show that maximizing the effective empowerment optimizes an (average-case) lower bound the human's utility/reward/objective under a uninformative prior. 4/

January 22, 2025 at 2:17 AM

Vivek Myers

@vivekmyers.bsky.social

Our recent paper, "Learning to Assist Humans Without Inferring Rewards," proposes a scalable contrastive estimator for human empowerment. The estimator learns successor features to model the effects of a human's actions on the environment, approximating the *effective empowerment*. 3/

January 22, 2025 at 2:17 AM

Vivek Myers

@vivekmyers.bsky.social

This distinction is subtle but important. An agent that maximizes a misspecified model of the human's reward or seeks power for itself can lead to arbitrarily bad outcomes where the human becomes disempowered. Maximizing human empowerment avoids this. 2/

January 22, 2025 at 2:17 AM

Vivek Myers

@vivekmyers.bsky.social

“AI Alignment" is typically seen as the problem of instilling human values in agents. But the very notion of human values is nebulous—humans have distinct, contradictory preferences which may change. Really, we should ensure agents *empower* humans to best achieve their own goals. 1/

January 22, 2025 at 2:17 AM

Vivek Myers

@vivekmyers.bsky.social

When is interpolation in a learned representation space meaningful? Come to our NeurIPS poster today at 4:30 to see how time-contrastive learning can provably enable inference (such as subgoal planning) through warped linear interpolation!

December 11, 2024 at 10:57 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news