Lightnews — Scholar-powered news

James MacGlashan

@jmac-ai.bsky.social

It was always cool. That the AI groupies didn’t know it until an AI pop star said so proves it.

November 26, 2025 at 6:06 PM

James MacGlashan

@jmac-ai.bsky.social

So in summary, the things I think are going on are
* The task isn't truly sparse.
* The pre-training highly correlates ways of answering (like policies) so that you get very good generalization.
* Inference search means you only need to slightly increase the probabilities to see major changes.

November 25, 2025 at 3:16 PM

James MacGlashan

@jmac-ai.bsky.social

All a few gradient steps need to do is make the right modes a little more consistently likely than the others. Once they are, inference search will do its thing and sharpen the outputs in that direction.

November 25, 2025 at 3:16 PM

James MacGlashan

@jmac-ai.bsky.social

That is, during inference, we don't just roll out the model. We search and deliberately decode results that are more likely.

If the pre-trained model had fairly uniform probabilities over the different modes at the start, you only need to modify the model much to get a major change in behavior.

November 25, 2025 at 3:16 PM

James MacGlashan

@jmac-ai.bsky.social

That covers "sample" complexity (again, a bit different from usual RL since it's all compute but same idea) It doesn't cover gradient updates being few.

For that I might suspect that the way inference works is a factor.

November 25, 2025 at 3:16 PM

James MacGlashan

@jmac-ai.bsky.social

So because the model already has this innate bias and can easily switch between "modes" of behavior, it too can benefit from few samples nudging it into that mode.

November 25, 2025 at 3:16 PM

James MacGlashan

@jmac-ai.bsky.social

LLMs are not so explicit as to have an engineered space of reward functions/options like I did. But the pretraining and correlation effectively does that. It's why prompting works at all: it nudges it into a kind of policy.

November 25, 2025 at 3:16 PM

James MacGlashan

@jmac-ai.bsky.social

The consequence is the agent could learn *very* quickly from few training samples. It made learning go from insufferably long, to feeling reactive like animal training. Like the agent got the point. And it all had to do with the internal model of the agent, no the observables.

November 25, 2025 at 3:16 PM

James MacGlashan

@jmac-ai.bsky.social

Back when I worked on RL from human feedback (before LLMs), one of the common ideas in my work was instead of only actions being the central object, the agent was biased by goals/policies.

The observables were states & actions, but the agent had an understanding of reward functions, options, etc.

November 25, 2025 at 3:16 PM

James MacGlashan

@jmac-ai.bsky.social

The first is that text is highly correlated. Sure, there might be a lot of tokens, but because they are all correlated you don't need to reinforce each token. The pre-training has already found all these correlations so by reinforcing one trajectory, you get massive generalization.

November 25, 2025 at 3:16 PM

James MacGlashan

@jmac-ai.bsky.social

I don't have enough experience training LLMs to have an opinion on whether it's enough, but let's assume you're right that it is surprisingly few steps. I have a couple of hypotheses of why.

November 25, 2025 at 3:16 PM

James MacGlashan

@jmac-ai.bsky.social

Because the problem is a computational cost instead of a sample complexity cost, we an ask what are the most computationally efficient methods?

Really dumb methods (like GRPO) are more effective uses of compute than any "smart" exploration method, because we sadly still still suck at exploration.

November 24, 2025 at 9:06 PM

James MacGlashan

@jmac-ai.bsky.social

An interesting facet of RL for LLMs is that there isn't much of a sample complexity issue. The "environment" model is known and computable (it's the LLM itself + verifier/reward model) That means you can do as much rollout as you want. It's a computational cost, but not a sample complexity cost.

November 24, 2025 at 9:06 PM

James MacGlashan

@jmac-ai.bsky.social

And incidentally, I'm not sure we have *any* RL methods that are good at hard exploration problems.

On-policy methods can have worse sample complexity than off-policy methods, but that's for orthogonal reasons to exploration. It has more to do with data reuse than exploration.

November 24, 2025 at 9:06 PM

James MacGlashan

@jmac-ai.bsky.social

It's the same situation for RL for LLMs: although the outcome reward is sparse and only at the "end" of the text [insert discussion of whether LLM RL is a bandit or not], it's not low variance.

From a pretrained model, many rollouts hit a positive signal and many hit a negative signal.

November 24, 2025 at 9:06 PM

James MacGlashan

@jmac-ai.bsky.social

A true sparse reward task is when you have low variance in returns for the current policy -- most rollouts yield a constant. In these kinds of situations MCTS/UCT can suck really bad and in the worst case has super exponential complexity!

November 24, 2025 at 9:06 PM

James MacGlashan

@jmac-ai.bsky.social

That is, while most steps in Go have 0 reward, there is high variance in the return in the self play: about half the time you win/lose against yourself.

That's actually a rich signal to learn from and is far more important than the UCT exploration.

November 24, 2025 at 9:06 PM

James MacGlashan

@jmac-ai.bsky.social

I'd say that's because it's not sparse reward in a meaningful way, in the same way Go in self-play is not sparse in a meaningful way.

That is, in Go, your reward is 0 for most time steps and only +1/-1 at end. That sound's sparse, but not from an algorithmic perspective.

November 24, 2025 at 9:06 PM

James MacGlashan

@jmac-ai.bsky.social

If it helps, most AI researchers agree. For long, top examples in robotics papers is making/fetching a coffee/beer, laundry, etc; but they're hard.

While it might seem strange, GenAI research is important research to building those things. But the tech industry has glommed onto the wrong bits.

November 21, 2025 at 4:35 PM

James MacGlashan

@jmac-ai.bsky.social

To change your mind I think we'd have to operationally define what you mean by "medicine" and "social science"

By Wikipedia's definition of social science, I would be inclined to agree that "health care" is a social science, but "medicine" is not.

November 19, 2025 at 4:45 PM

James MacGlashan

@jmac-ai.bsky.social

a man in a black suit says do it in white letters

ALT: a man in a black suit says do it in white letters

media.tenor.com

November 19, 2025 at 5:11 AM

James MacGlashan

@jmac-ai.bsky.social

A failed journalist with poor ethics, who burnt it all down because she fell in love with RFK Jr of all people, but keeps failing up despite that and now we all have to suffer her.

November 18, 2025 at 2:28 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news