James MacGlashan
banner
jmac-ai.bsky.social
James MacGlashan
@jmac-ai.bsky.social
Ask me about Reinforcement Learning
Research @ Sony AI
AI should learn from its experiences, not copy your data.

My website for answering RL questions: https://www.decisionsanddragons.com/

Views and posts are my own.
It was always cool. That the AI groupies didn’t know it until an AI pop star said so proves it.
November 26, 2025 at 6:06 PM
So in summary, the things I think are going on are
* The task isn't truly sparse.
* The pre-training highly correlates ways of answering (like policies) so that you get very good generalization.
* Inference search means you only need to slightly increase the probabilities to see major changes.
November 25, 2025 at 3:16 PM
All a few gradient steps need to do is make the right modes a little more consistently likely than the others. Once they are, inference search will do its thing and sharpen the outputs in that direction.
November 25, 2025 at 3:16 PM
That is, during inference, we don't just roll out the model. We search and deliberately decode results that are more likely.

If the pre-trained model had fairly uniform probabilities over the different modes at the start, you only need to modify the model much to get a major change in behavior.
November 25, 2025 at 3:16 PM
That covers "sample" complexity (again, a bit different from usual RL since it's all compute but same idea) It doesn't cover gradient updates being few.

For that I might suspect that the way inference works is a factor.
November 25, 2025 at 3:16 PM
So because the model already has this innate bias and can easily switch between "modes" of behavior, it too can benefit from few samples nudging it into that mode.
November 25, 2025 at 3:16 PM
LLMs are not so explicit as to have an engineered space of reward functions/options like I did. But the pretraining and correlation effectively does that. It's why prompting works at all: it nudges it into a kind of policy.
November 25, 2025 at 3:16 PM
The consequence is the agent could learn *very* quickly from few training samples. It made learning go from insufferably long, to feeling reactive like animal training. Like the agent got the point. And it all had to do with the internal model of the agent, no the observables.
November 25, 2025 at 3:16 PM
Back when I worked on RL from human feedback (before LLMs), one of the common ideas in my work was instead of only actions being the central object, the agent was biased by goals/policies.

The observables were states & actions, but the agent had an understanding of reward functions, options, etc.
November 25, 2025 at 3:16 PM
The first is that text is highly correlated. Sure, there might be a lot of tokens, but because they are all correlated you don't need to reinforce each token. The pre-training has already found all these correlations so by reinforcing one trajectory, you get massive generalization.
November 25, 2025 at 3:16 PM
I don't have enough experience training LLMs to have an opinion on whether it's enough, but let's assume you're right that it is surprisingly few steps. I have a couple of hypotheses of why.
November 25, 2025 at 3:16 PM
Because the problem is a computational cost instead of a sample complexity cost, we an ask what are the most computationally efficient methods?

Really dumb methods (like GRPO) are more effective uses of compute than any "smart" exploration method, because we sadly still still suck at exploration.
November 24, 2025 at 9:06 PM
An interesting facet of RL for LLMs is that there isn't much of a sample complexity issue. The "environment" model is known and computable (it's the LLM itself + verifier/reward model) That means you can do as much rollout as you want. It's a computational cost, but not a sample complexity cost.
November 24, 2025 at 9:06 PM
And incidentally, I'm not sure we have *any* RL methods that are good at hard exploration problems.

On-policy methods can have worse sample complexity than off-policy methods, but that's for orthogonal reasons to exploration. It has more to do with data reuse than exploration.
November 24, 2025 at 9:06 PM
It's the same situation for RL for LLMs: although the outcome reward is sparse and only at the "end" of the text [insert discussion of whether LLM RL is a bandit or not], it's not low variance.

From a pretrained model, many rollouts hit a positive signal and many hit a negative signal.
November 24, 2025 at 9:06 PM
A true sparse reward task is when you have low variance in returns for the current policy -- most rollouts yield a constant. In these kinds of situations MCTS/UCT can suck really bad and in the worst case has super exponential complexity!
November 24, 2025 at 9:06 PM
That is, while most steps in Go have 0 reward, there is high variance in the return in the self play: about half the time you win/lose against yourself.

That's actually a rich signal to learn from and is far more important than the UCT exploration.
November 24, 2025 at 9:06 PM
I'd say that's because it's not sparse reward in a meaningful way, in the same way Go in self-play is not sparse in a meaningful way.

That is, in Go, your reward is 0 for most time steps and only +1/-1 at end. That sound's sparse, but not from an algorithmic perspective.
November 24, 2025 at 9:06 PM
If it helps, most AI researchers agree. For long, top examples in robotics papers is making/fetching a coffee/beer, laundry, etc; but they're hard.

While it might seem strange, GenAI research is important research to building those things. But the tech industry has glommed onto the wrong bits.
November 21, 2025 at 4:35 PM
To change your mind I think we'd have to operationally define what you mean by "medicine" and "social science"

By Wikipedia's definition of social science, I would be inclined to agree that "health care" is a social science, but "medicine" is not.
November 19, 2025 at 4:45 PM
A failed journalist with poor ethics, who burnt it all down because she fell in love with RFK Jr of all people, but keeps failing up despite that and now we all have to suffer her.
November 18, 2025 at 2:28 PM