Alex Lew
banner
alexlew.bsky.social
Alex Lew
@alexlew.bsky.social
Theory & practice of probabilistic programming. Current: MIT Probabilistic Computing Project; Fall '25: Incoming Asst. Prof. at Yale CS
When we differentiate their (Schulman's) estimator, pi_ref comes back into the objective, but in a new role.

Now, the objective has a CrossEnt(pi_ref, pi_theta) term. KL(P,Q) = CrossEnt(P,Q) - Entropy(P), so this is related to KL, but note the direction of KL is reversed.

6/
February 10, 2025 at 4:32 AM
Interestingly, if they were *not* using this estimator, and instead using the standard estimator, pi_ref would not affect the gradient at all!

More evidence that there's something odd about their approach. And maybe one reason they turned to Schulman's estimator.

5/
February 10, 2025 at 4:32 AM
This means GRPO is not optimizing the usual objective. What objective *is* it optimizing? Well, it depends on the particular KL estimator they are using.

A few people have noticed that GRPO uses a non-standard KL estimator, from a blog post by Schulman.

4/
February 10, 2025 at 4:32 AM
GRPO instead directly differentiates the KL estimator, evaluated on samples taken from pi_old (the LM before this update).

But the point of policy gradient is that you can't just "differentiate the estimator": you need to account for the gradient of the sampling process.

3/
February 10, 2025 at 4:32 AM
RL for LMs often introduces a KL penalty term, to balance the "maximize reward" objective with an incentive to stay close to some reference model.

A way to implement is to modify the reward, so E[R~]=E[R] - KL term. Then you can apply standard RL (e.g. policy gradient).

2/
February 10, 2025 at 4:32 AM
@xtimv.bsky.social and I were just discussing this interesting comment in the DeepSeek paper introducing GRPO: a different way of setting up the KL loss.

It's a little hard to reason about what this does to the objective. 1/
February 10, 2025 at 4:32 AM
Hm, I think the base LMs are quite interesting. From the DPO paper: sampling 128 completions from a base model, and then selecting the sample with highest reward under the RLHF reward model, performs similarly to actual RLHF.
December 30, 2024 at 3:49 PM
If you're interested in a PhD at the intersection of machine learning and programming languages, consider applying to Yale CS!

We're exploring new approaches to building software that draws inferences and makes predictions. See alexlew.net for details & apply at gsas.yale.edu/admissions/ by Dec. 15
December 8, 2024 at 4:27 PM
Surprisal of title beginning with 'O'? 3.22
Surprisal of 'o' following 'Treatment '? 0.11
Surprisal that title includes surprisal of each title character? Priceless [...I did not know titles could do this]
November 21, 2024 at 4:06 PM
This is a very cool integration of LLMs + Bayesian methods.

LLMs serve as *likelihoods*: how likely would the human be to have issued this (English) command, given a particular (symbolic) plan? No generation, just scoring :)

A Bayesian agent can then resolve ambiguity in really sensible ways
November 19, 2024 at 7:19 PM
Hi Bluesky! My claim to fame is the development of the Alexander Hamiltonian Monte Carlo algorithm.

Younger researchers may not realize due to Moore's Law (Lin-Manuel Miranda becomes roughly half as cool every two years), but back when this was published in 2021, it was considered mildly topical
November 18, 2024 at 7:11 PM