Alex Lew
banner
alexlew.bsky.social
Alex Lew
@alexlew.bsky.social
Theory & practice of probabilistic programming. Current: MIT Probabilistic Computing Project; Fall '25: Incoming Asst. Prof. at Yale CS
As a way of evaluating the model (rather than the variational family or inference method)
July 6, 2025 at 11:03 AM
By contrast, of course, a good VI method should give good posteriors. If better posteriors give worse predictions, it's the model's fault, not the VI method's
July 6, 2025 at 10:45 AM
Yes, I agree... It's a very (non-Bayesian) ML-inflected way of looking at things, where the whole game is to maximize predictive accuracy on new data, and the training data D is just instrumental to this goal.
July 6, 2025 at 10:45 AM
Right, I think if you're trying to estimate P*, it's okay that Method 2 is inconsistent. The way you "spend more compute" in Method 2 is by choosing a bigger/more expressive variational family q, not taking more samples from a fixed variational family. So consistency isn't quite the right property
July 6, 2025 at 10:30 AM
Oh -- I agree it doesn't make sense to choose q1 or q2 based on the quantity! Or at least, if you do that, you're just fitting the model q(θ)p(D'|θ) to the held-out data D', not doing posterior inference anymore.
July 6, 2025 at 10:27 AM
(Though even if you use approach 2, it is probably better to use q as a proposal within IS, rather than using it directly to substitute the posterior)
July 5, 2025 at 11:38 AM
We can't sample p(θ|D) exactly, so consider two procedures for sampling θ from distributions _close_ to p(θ|D):
1 - run MCMC / SMC / etc. targeting p(θ|D)
2 - run VI to obtain q(θ) then draw independent θs from q via simple MC
Not obvious that Method 1 always wins (for a given computational budget)
July 5, 2025 at 11:36 AM
Reposted by Alex Lew
Want to use AWRS SMC?

Check out the GenLM control library: github.com/genlm/genlm-...

GenLM supports not only grammars, but arbitrary programmable constraints from type systems to simulators.

If you can write a Python function, you can control your language model!
May 13, 2025 at 2:22 PM
I'd love to see these ideas migrated to Gen.jl! But there are some technical questions about how best to make that work.
February 25, 2025 at 4:26 PM
Hi, thanks for your interest! The pedagogical implementations can be found at:

Julia: github.com/probcomp/ADE...

Haskell: github.com/probcomp/ade...

The (less pedagogical, more performant) JAX implementation is still under active development, led by McCoy Becker.
February 25, 2025 at 4:26 PM
- Daphne Koller had arguably the first PPL paper about inference-in-Bayesian-models-cast-as-programs
- Alexandra Silva has great work on semantics, static analysis, and verification of probabilistic & non-det. progs
- Annabelle McIver does too
- Nada Amin has cool recent papers on PPL semantics
February 10, 2025 at 7:00 AM
DeepSeek's implementation isn't public, and maybe I'm misinterpreting their paper. But the TRL reimplementation does appear to follow this logic. github.com/huggingface/...

Curious what people think we should make of this!

8/8
trl/trl/trainer/grpo_trainer.py at 55e680e142d88e090dcbf5a469eab1ebba28ddef · huggingface/trl
Train transformer language models with reinforcement learning. - huggingface/trl
github.com
February 10, 2025 at 4:32 AM
The usual KL penalty says: try not to wander outside the realm of sensible generations [as judged by our pretrained model].

This penalty says: try not to lose any of the behaviors present in the pretrained model.

Which is a bit strange as a fine-tuning objective.

7/
February 10, 2025 at 4:32 AM
When we differentiate their (Schulman's) estimator, pi_ref comes back into the objective, but in a new role.

Now, the objective has a CrossEnt(pi_ref, pi_theta) term. KL(P,Q) = CrossEnt(P,Q) - Entropy(P), so this is related to KL, but note the direction of KL is reversed.

6/
February 10, 2025 at 4:32 AM
Interestingly, if they were *not* using this estimator, and instead using the standard estimator, pi_ref would not affect the gradient at all!

More evidence that there's something odd about their approach. And maybe one reason they turned to Schulman's estimator.

5/
February 10, 2025 at 4:32 AM
This means GRPO is not optimizing the usual objective. What objective *is* it optimizing? Well, it depends on the particular KL estimator they are using.

A few people have noticed that GRPO uses a non-standard KL estimator, from a blog post by Schulman.

4/
February 10, 2025 at 4:32 AM
GRPO instead directly differentiates the KL estimator, evaluated on samples taken from pi_old (the LM before this update).

But the point of policy gradient is that you can't just "differentiate the estimator": you need to account for the gradient of the sampling process.

3/
February 10, 2025 at 4:32 AM
RL for LMs often introduces a KL penalty term, to balance the "maximize reward" objective with an incentive to stay close to some reference model.

A way to implement is to modify the reward, so E[R~]=E[R] - KL term. Then you can apply standard RL (e.g. policy gradient).

2/
February 10, 2025 at 4:32 AM
Yeah — would be interesting to know if the pattern holds for today’s larger models! (This paper was done 1.5 years ago, in academia, using open models)
December 30, 2024 at 4:04 PM
Furthermore, regardless of training procedure, the models are still autoregressive probabilistic sequence models, so they can be understood as optimal “autocompleters” for *some* data distribution.

“If this is our conversation so far, what word would an assistant probably say next?”
December 30, 2024 at 3:52 PM
Hm, I think the base LMs are quite interesting. From the DPO paper: sampling 128 completions from a base model, and then selecting the sample with highest reward under the RLHF reward model, performs similarly to actual RLHF.
December 30, 2024 at 3:49 PM