1 - run MCMC / SMC / etc. targeting p(θ|D)
2 - run VI to obtain q(θ) then draw independent θs from q via simple MC
Not obvious that Method 1 always wins (for a given computational budget)
1 - run MCMC / SMC / etc. targeting p(θ|D)
2 - run VI to obtain q(θ) then draw independent θs from q via simple MC
Not obvious that Method 1 always wins (for a given computational budget)
Check out the GenLM control library: github.com/genlm/genlm-...
GenLM supports not only grammars, but arbitrary programmable constraints from type systems to simulators.
If you can write a Python function, you can control your language model!
Check out the GenLM control library: github.com/genlm/genlm-...
GenLM supports not only grammars, but arbitrary programmable constraints from type systems to simulators.
If you can write a Python function, you can control your language model!
Julia: github.com/probcomp/ADE...
Haskell: github.com/probcomp/ade...
The (less pedagogical, more performant) JAX implementation is still under active development, led by McCoy Becker.
Julia: github.com/probcomp/ADE...
Haskell: github.com/probcomp/ade...
The (less pedagogical, more performant) JAX implementation is still under active development, led by McCoy Becker.
- Alexandra Silva has great work on semantics, static analysis, and verification of probabilistic & non-det. progs
- Annabelle McIver does too
- Nada Amin has cool recent papers on PPL semantics
- Alexandra Silva has great work on semantics, static analysis, and verification of probabilistic & non-det. progs
- Annabelle McIver does too
- Nada Amin has cool recent papers on PPL semantics
Curious what people think we should make of this!
8/8
Curious what people think we should make of this!
8/8
This penalty says: try not to lose any of the behaviors present in the pretrained model.
Which is a bit strange as a fine-tuning objective.
7/
This penalty says: try not to lose any of the behaviors present in the pretrained model.
Which is a bit strange as a fine-tuning objective.
7/
Now, the objective has a CrossEnt(pi_ref, pi_theta) term. KL(P,Q) = CrossEnt(P,Q) - Entropy(P), so this is related to KL, but note the direction of KL is reversed.
6/
Now, the objective has a CrossEnt(pi_ref, pi_theta) term. KL(P,Q) = CrossEnt(P,Q) - Entropy(P), so this is related to KL, but note the direction of KL is reversed.
6/
More evidence that there's something odd about their approach. And maybe one reason they turned to Schulman's estimator.
5/
More evidence that there's something odd about their approach. And maybe one reason they turned to Schulman's estimator.
5/
A few people have noticed that GRPO uses a non-standard KL estimator, from a blog post by Schulman.
4/
A few people have noticed that GRPO uses a non-standard KL estimator, from a blog post by Schulman.
4/
But the point of policy gradient is that you can't just "differentiate the estimator": you need to account for the gradient of the sampling process.
3/
But the point of policy gradient is that you can't just "differentiate the estimator": you need to account for the gradient of the sampling process.
3/
A way to implement is to modify the reward, so E[R~]=E[R] - KL term. Then you can apply standard RL (e.g. policy gradient).
2/
A way to implement is to modify the reward, so E[R~]=E[R] - KL term. Then you can apply standard RL (e.g. policy gradient).
2/
“If this is our conversation so far, what word would an assistant probably say next?”
“If this is our conversation so far, what word would an assistant probably say next?”