Lightnews — Scholar-powered news

Alex Lew

@alexlew.bsky.social

When we differentiate their (Schulman's) estimator, pi_ref comes back into the objective, but in a new role.

Now, the objective has a CrossEnt(pi_ref, pi_theta) term. KL(P,Q) = CrossEnt(P,Q) - Entropy(P), so this is related to KL, but note the direction of KL is reversed.

6/

$**Mathematical formulation of an alternative KL estimator and its gradient.** The alternative KL estimator is defined as: \[ \widehat{KL}_\theta(x) := \frac{\pi_{\text{ref}}(x)}{\pi_\theta(x)} + \log \pi_\theta(x) - \log \pi_{\text{ref}}(x) - 1 \] From this, it follows that: \[ \mathbb{E}_{x \sim \pi_{\text{old}}} [\nabla_\theta \widehat{KL}_\theta(x)] = \nabla_\theta \mathbb{E}_{x \sim \pi_{\text{old}}} \left[ \frac{\pi_{\text{ref}}(x)}{\pi_\theta(x)} + \log \pi_\theta(x) \right] \] Approximating when $\pi_{\text{old}} \approx \pi_\theta$, we get: \[ \approx \nabla_\theta \mathbb{E}_{x \sim \pi_{\text{ref}}} [-\log \pi_\theta(x)] + \nabla_\theta \mathbb{E}_{x \sim \pi_{\text{old}}} [\log \pi_\theta(x)] \] The annotated explanation in purple states that this results in: \[ \text{CrossEnt}(\pi_{\text{ref}}, \pi_\theta) - \text{CrossEnt}(\pi_{\text{old}}, \pi_\theta) \] where $\text{CrossEnt}(\cdot, \cdot)$ denotes cross-entropy. ---- alt text generated by ChatGPT$

February 10, 2025 at 4:32 AM

Alex Lew

@alexlew.bsky.social

Interestingly, if they were *not* using this estimator, and instead using the standard estimator, pi_ref would not affect the gradient at all!

More evidence that there's something odd about their approach. And maybe one reason they turned to Schulman's estimator.

5/

$**Mathematical explanation of the standard KL estimator and its gradient.** The standard KL estimator is defined as: \[ \widehat{KL}_\theta(x) := \log \pi_\theta(x) - \log \pi_{\text{ref}}(x) \] From this, it follows that: \[ \mathbb{E}_{x \sim \pi_{\text{old}}} [\nabla_\theta \widehat{KL}_\theta(x)] = \nabla_\theta \mathbb{E}_{x \sim \pi_{\text{old}}} [\log \pi_\theta(x)] \] Annotated explanation in purple states that this term represents the *negative cross-entropy from $\pi_{\text{old}}$ to $\pi_\theta$.* -- alt text automatically generated by ChatGPT$

February 10, 2025 at 4:32 AM

Alex Lew

@alexlew.bsky.social

This means GRPO is not optimizing the usual objective. What objective *is* it optimizing? Well, it depends on the particular KL estimator they are using.

A few people have noticed that GRPO uses a non-standard KL estimator, from a blog post by Schulman.

4/

$Text from the GRPO paper: And different from the KL penalty term used in (2), we estimate the KL divergence with the following unbiased estimator (Schulman, 2020): KL(pi_theta || pi_ref) = pi_ref(o_{i,t} | q, o_{i,<t}) / pi_theta(o_{i,t} | q,o_{i,<t}) - log pi_ref(o_{i,t} | q, o_{i,<t}) / pi_theta(o_{i,t} | q,o_{i,<t}) - 1 which is guaranteed to be positive.$

February 10, 2025 at 4:32 AM

Alex Lew

@alexlew.bsky.social

GRPO instead directly differentiates the KL estimator, evaluated on samples taken from pi_old (the LM before this update).

But the point of policy gradient is that you can't just "differentiate the estimator": you need to account for the gradient of the sampling process.

3/

$**Mathematical formulation of the GRPO (Group-Relative Policy Optimization) objective and its gradient.** The objective function is defined as: \[ J_{\text{GRPO}}(\theta) = \mathbb{E}_{x \sim \pi_\theta} [R_\theta(x)] - \beta \cdot \mathbb{E}_{x \sim \pi_{\text{old}}} [\widehat{KL}_\theta(x)] \] The gradient of this objective is: \[ \nabla J_{\text{GRPO}}(\theta) = \nabla_\theta \mathbb{E}_{x \sim \pi_\theta} [R_\theta(x)] - \beta \cdot \mathbb{E}_{x \sim \pi_{\text{old}}} [\nabla_\theta \widehat{KL}_\theta(x)] \] Annotated explanations in purple indicate that the first term is *unbiasedly estimated via the group-relative policy gradient*, while the second term is *not* the derivative of the KL divergence, even when $\pi_{\text{old}} = \pi_\theta$. ---- (alt text automatically generated by ChatGPT)$

February 10, 2025 at 4:32 AM

Alex Lew

@alexlew.bsky.social

RL for LMs often introduces a KL penalty term, to balance the "maximize reward" objective with an incentive to stay close to some reference model.

A way to implement is to modify the reward, so E[R~]=E[R] - KL term. Then you can apply standard RL (e.g. policy gradient).

2/

$**Mathematical expression describing KL-penalized reinforcement learning objective.** The objective function is given by: \[ J(\theta) = \mathbb{E}_{x \sim \pi_\theta} [R_\theta(x)] - \beta \cdot D_{\text{KL}}(\pi_\theta, \pi_{\text{ref}}) \] Rewritten as: \[ J(\theta) = \mathbb{E}_{x \sim \pi_\theta} [\tilde{R}_\theta(x)] \] where: \[ \tilde{R}_\theta(x) := R_\theta(x) - \beta \cdot \widehat{KL}_\theta(x) \] \[ \widehat{KL}_\theta(x) := \log \pi_\theta(x) - \log \pi_{\text{ref}}(x) \] Annotations in purple indicate that $ R_\theta(x) $ represents the reward, and $ \widehat{KL}_\theta(x) $ is an unbiased estimator of the KL divergence. ---- (alt text generated by ChatGPT)$

February 10, 2025 at 4:32 AM

Alex Lew

@alexlew.bsky.social

@xtimv.bsky.social and I were just discussing this interesting comment in the DeepSeek paper introducing GRPO: a different way of setting up the KL loss.

It's a little hard to reason about what this does to the objective. 1/

Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of the advantage.

February 10, 2025 at 4:32 AM

Alex Lew

@alexlew.bsky.social

Hm, I think the base LMs are quite interesting. From the DPO paper: sampling 128 completions from a base model, and then selecting the sample with highest reward under the RLHF reward model, performs similarly to actual RLHF.

A plot from the Direct Preference Optimization paper, comparing various methods of aligning LMs to preference data.

December 30, 2024 at 3:49 PM

Alex Lew

@alexlew.bsky.social

If you're interested in a PhD at the intersection of machine learning and programming languages, consider applying to Yale CS!

We're exploring new approaches to building software that draws inferences and makes predictions. See alexlew.net for details & apply at gsas.yale.edu/admissions/ by Dec. 15

Probabilistic and differentiable programming at Yale — fully funded PhD positions starting Fall 2025! Apply by Dec. 15.

Do a PhD at the rich intersection of programming languages and machine learning.

December 8, 2024 at 4:27 PM

Alex Lew

@alexlew.bsky.social

Surprisal of title beginning with 'O'? 3.22
Surprisal of 'o' following 'Treatment '? 0.11
Surprisal that title includes surprisal of each title character? Priceless [...I did not know titles could do this]

Screenshot of the title of the paper "On the Proper Treatment of Tokenization in Psycholinguistics." Over each letter, the authors have plotted the surprisal of the letter (-log p(this letter | context)).

November 21, 2024 at 4:06 PM

Alex Lew

@alexlew.bsky.social

This is a very cool integration of LLMs + Bayesian methods.

LLMs serve as *likelihoods*: how likely would the human be to have issued this (English) command, given a particular (symbolic) plan? No generation, just scoring :)

A Bayesian agent can then resolve ambiguity in really sensible ways

Examples of the CLIPS agent from Zhi-Xuan, Ying et al. 2024 resolving ambiguity in human instructions.

November 19, 2024 at 7:19 PM

Alex Lew

@alexlew.bsky.social

Hi Bluesky! My claim to fame is the development of the Alexander Hamiltonian Monte Carlo algorithm.

Younger researchers may not realize due to Moore's Law (Lin-Manuel Miranda becomes roughly half as cool every two years), but back when this was published in 2021, it was considered mildly topical

Algorithm box from the joke paper "Alexander Hamiltonian Monte Carlo."

A proof of the main theorem from the paper, which is that the algorithm "eventually converges to the room where it happens, if the user is willing to wait for it."

November 18, 2024 at 7:11 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news