Lightnews — Scholar-powered news

@nico-espinosa-dice.bsky.social

check out our paper for more details! in summary:

✅ efficient training
✅ expressive modeling
✅ inference-time scaling

paper: arxiv.org/abs/2505.22866
code: github.com/nico-espinosadice/SORL
w/ Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, @gokul.dev, @xkianteb.bsky.social, Wen Sun

Scaling Offline RL via Efficient and Expressive Shortcut Models

Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) rema...

arxiv.org

June 12, 2025 at 7:39 PM

Nicolas Espinosa Dice

@nico-espinosa-dice.bsky.social

all together, 𝚂𝙾𝚁𝙻 consistently outperforms 10 offline RL baselines on 40 tasks!

June 12, 2025 at 7:39 PM

Nicolas Espinosa Dice

@nico-espinosa-dice.bsky.social

a common issue w/ CoT in LLMs is extrapolating beyond what was optimized during training. 𝚂𝙾𝚁𝙻’s sequential and best-of-N scaling enables *sequential extrapolation*

June 12, 2025 at 7:38 PM

Nicolas Espinosa Dice

@nico-espinosa-dice.bsky.social

𝚂𝙾𝚁𝙻 does *parallel scaling* by performing best-of-N with the Q function as a verifier. this is an embarrassingly simple way to help address issues with policy extraction (a major problem in offline RL, see arxiv.org/abs/2406.09329)

Is Value Learning Really the Main Bottleneck in Offline RL?

While imitation learning requires access to high-quality data, offline reinforcement learning (RL) should, in principle, perform similarly or better with substantially lower data quality by using a va...

arxiv.org

June 12, 2025 at 7:38 PM

Nicolas Espinosa Dice

@nico-espinosa-dice.bsky.social

motivated by CoT in LLMs, 𝚂𝙾𝚁𝙻 scales inference *sequentially* by increasing the number of inference steps

June 12, 2025 at 7:38 PM

Nicolas Espinosa Dice

@nico-espinosa-dice.bsky.social

with self-consistency, 𝚂𝙾𝚁𝙻 can:

1. *scale expressivity* by fitting complex, multi-modal distributions
2. *scale training* by limiting the amount of backprop through time
3. *scale inference* through sequential and parallel scaling

June 12, 2025 at 7:37 PM

Nicolas Espinosa Dice

@nico-espinosa-dice.bsky.social

𝚂𝙾𝚁𝙻 leverages @Kevin Frans's shortcut models to incorporate self-consistency into training, allowing the policy to learn “large jumps” that match more-precise “small jumps”

pc: @Kevin Frans

June 12, 2025 at 7:37 PM

Nicolas Espinosa Dice

@nico-espinosa-dice.bsky.social

Scalable Offline RL (𝚂𝙾𝚁𝙻): a training procedure for *expressive policies* that can do inference under *any* compute budget

paper: arxiv.org/abs/2505.22866
code: github.com/nico-espinosadice/SORL
w/ Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, @gokul.dev , @xkianteb.bsky.social, Wen Sun

Scaling Offline RL via Efficient and Expressive Shortcut Models

Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) rema...

arxiv.org

June 12, 2025 at 7:35 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news