Now, the objective has a CrossEnt(pi_ref, pi_theta) term. KL(P,Q) = CrossEnt(P,Q) - Entropy(P), so this is related to KL, but note the direction of KL is reversed.
6/
Now, the objective has a CrossEnt(pi_ref, pi_theta) term. KL(P,Q) = CrossEnt(P,Q) - Entropy(P), so this is related to KL, but note the direction of KL is reversed.
6/
More evidence that there's something odd about their approach. And maybe one reason they turned to Schulman's estimator.
5/
More evidence that there's something odd about their approach. And maybe one reason they turned to Schulman's estimator.
5/
A few people have noticed that GRPO uses a non-standard KL estimator, from a blog post by Schulman.
4/
A few people have noticed that GRPO uses a non-standard KL estimator, from a blog post by Schulman.
4/
But the point of policy gradient is that you can't just "differentiate the estimator": you need to account for the gradient of the sampling process.
3/
But the point of policy gradient is that you can't just "differentiate the estimator": you need to account for the gradient of the sampling process.
3/
A way to implement is to modify the reward, so E[R~]=E[R] - KL term. Then you can apply standard RL (e.g. policy gradient).
2/
A way to implement is to modify the reward, so E[R~]=E[R] - KL term. Then you can apply standard RL (e.g. policy gradient).
2/
It's a little hard to reason about what this does to the objective. 1/
It's a little hard to reason about what this does to the objective. 1/
We're exploring new approaches to building software that draws inferences and makes predictions. See alexlew.net for details & apply at gsas.yale.edu/admissions/ by Dec. 15
We're exploring new approaches to building software that draws inferences and makes predictions. See alexlew.net for details & apply at gsas.yale.edu/admissions/ by Dec. 15
Surprisal of 'o' following 'Treatment '? 0.11
Surprisal that title includes surprisal of each title character? Priceless [...I did not know titles could do this]
Surprisal of 'o' following 'Treatment '? 0.11
Surprisal that title includes surprisal of each title character? Priceless [...I did not know titles could do this]
LLMs serve as *likelihoods*: how likely would the human be to have issued this (English) command, given a particular (symbolic) plan? No generation, just scoring :)
A Bayesian agent can then resolve ambiguity in really sensible ways
LLMs serve as *likelihoods*: how likely would the human be to have issued this (English) command, given a particular (symbolic) plan? No generation, just scoring :)
A Bayesian agent can then resolve ambiguity in really sensible ways
Younger researchers may not realize due to Moore's Law (Lin-Manuel Miranda becomes roughly half as cool every two years), but back when this was published in 2021, it was considered mildly topical
Younger researchers may not realize due to Moore's Law (Lin-Manuel Miranda becomes roughly half as cool every two years), but back when this was published in 2021, it was considered mildly topical