DM me ml questions
GRPO is an improvement on PPO introduced in the DeepSeekMath paper
The motivation is that PPO requires 4 large models, a policy, value function, reward model, and reference model. GRPO removes the need for the value model.
GRPO is an improvement on PPO introduced in the DeepSeekMath paper
The motivation is that PPO requires 4 large models, a policy, value function, reward model, and reference model. GRPO removes the need for the value model.
- This paper introduced GSM8K, and showed how using verifiers can significantly improve performance (up to 20+ percentage points compared to finetuning, see graph below)
- This paper introduced GSM8K, and showed how using verifiers can significantly improve performance (up to 20+ percentage points compared to finetuning, see graph below)
The tape on the floor is so I know where to step. The stools are so I don't accidentally step on cameras
The tape on the floor is so I know where to step. The stools are so I don't accidentally step on cameras