Lightnews — Scholar-powered news

Scott Jeen

@enjeeneer.io

Check it out here: arxiv.org/abs/2508.16496

Modern reinforcement learning (RL) systems capture deep truths about general, human problem-solving. In domains where new data can be simulated cheaply, these systems uncover sequential decision-makin...

arxiv.org

September 3, 2025 at 9:01 PM

Scott Jeen

@enjeeneer.io

It's dedicated to the late Barry Sealey CBE and Helen Sealey whose funding of my earlier postgraduate studies opened the door to a PhD. I'm hugely indebted to them for their kindness and generosity.

September 3, 2025 at 9:01 PM

Scott Jeen

@enjeeneer.io

More detail in the paper, at the project page or in the repo!

Paper: arxiv.org/abs/2506.15446
Project Page: enjeeneer.io/projects/bfm...
Code: github.com/enjeeneer/bf...

with Tom Bewley and Jon Cullen.

Zero-Shot Reinforcement Learning Under Partial Observability

Recent work has shown that, under certain assumptions, zero-shot reinforcement learning (RL) methods can generalise to any unseen task in an environment after reward-free pre-training. Access to Marko...

arxiv.org

July 31, 2025 at 9:01 PM

Scott Jeen

@enjeeneer.io

We explored different sequence models: Transformers, GRUs, LSTMs, S4d, S5.

To our surprise, we found GRUs to be far-and-away the most effective, and Transformers to be disappointingly ineffective.

Why? The combined F^T x B representation seems unstable for all non-GRU methods.

July 31, 2025 at 9:01 PM

Scott Jeen

@enjeeneer.io

We run experiments on amended ExORL environments with different types of partial observability. In particular, we explore partially observed states, and partially observed changes in dynamics.

In aggregate, we improve performance across all partially observed settings.

July 31, 2025 at 9:01 PM

Scott Jeen

@enjeeneer.io

We solve both failure modes by replacing BFMs' standard MLPs with sequence models that condition on trajectories of observations and actions.

We call the resultant family of methods: Behaviour Foundation Models with Memory.

July 31, 2025 at 9:01 PM

Scott Jeen

@enjeeneer.io

When Behaviour Foundation Models are fed unreliable observations, rather than states, they fail in two predictable ways.

We call these failure models *state* misidentification, and *task* misidentification.

Each inhibits performance in isolation; together they kill the model.

July 31, 2025 at 9:01 PM

Scott Jeen

@enjeeneer.io

It all feels a bit hacky though, yeh.

January 21, 2025 at 4:31 PM

Scott Jeen

@enjeeneer.io

- It's probs not doing pure policy exploration in the classical RL sense. The prior provided by pre-training should reduce the effective search space hugely. I could imagine that small amounts of exploration on top of the reasoning traces provided by the base model could be enough to get signal.

January 21, 2025 at 4:31 PM

Scott Jeen

@enjeeneer.io

I don't disagree, but a couple of possible explanations:
- Fig 3 could imply that it learns to solve questions that require shorter reasoning chains first, before moving to those that require longer reasoning chains.

January 21, 2025 at 4:31 PM

Scott Jeen

@enjeeneer.io

Thank you for this Jane, it's beautiful and heart-wrenching. I didn't know Felix well, but my few interactions with him always left me awed by his all-round brilliance. My thoughts are with you and everyone who knew him more closely. ❤️

January 3, 2025 at 3:34 PM

Scott Jeen

@enjeeneer.io

NeurIPS revolves around demonstration. This year’s @rl-conference.bsky.social revolved around conversation. I much prefer the latter.

December 17, 2024 at 2:54 PM

Scott Jeen

@enjeeneer.io

My bad for messing up the photo!

December 10, 2024 at 4:57 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news