Rosmine
banner
rosmineb.bsky.social
Rosmine
@rosmineb.bsky.social
Senior ML Scientist @ FAANG working on LLMs
DM me ml questions
Overview of GRPO (Group Relative Policy Optimization)

GRPO is an improvement on PPO introduced in the DeepSeekMath paper

The motivation is that PPO requires 4 large models, a policy, value function, reward model, and reference model. GRPO removes the need for the value model.
January 10, 2025 at 8:06 PM
- Performance improvement from RLAIF vs. SFT depends on the base model. E.g. For Llama models, SFT is much more effective than RLAIF (see graph)
December 2, 2024 at 5:18 PM
All I did was post paper summaries, I guess people don’t like my taste in papers
November 28, 2024 at 7:04 PM
I'm suspicious he's real, the pinned tweet makes me think it's a parody (but mostly I'm suspicious because he followed me lol)
November 27, 2024 at 2:06 PM
Training Verifiers to Solve Math Word Problems (2021)
- This paper introduced GSM8K, and showed how using verifiers can significantly improve performance (up to 20+ percentage points compared to finetuning, see graph below)
November 25, 2024 at 7:27 PM
I made a project to play dance dance revolution without a dance pad, instead using 2 high speed cameras, and running each frame through a shallow convnet to classify the steps.

The tape on the floor is so I know where to step. The stools are so I don't accidentally step on cameras
November 23, 2024 at 6:16 PM