DM me ml questions
GRPO is an improvement on PPO introduced in the DeepSeekMath paper
The motivation is that PPO requires 4 large models, a policy, value function, reward model, and reference model. GRPO removes the need for the value model.
GRPO is an improvement on PPO introduced in the DeepSeekMath paper
The motivation is that PPO requires 4 large models, a policy, value function, reward model, and reference model. GRPO removes the need for the value model.
At first it sounds dumb, but you could leverage GPU non-determinism to make it truly random, not just pseudo random
There are better ways to do rng so I still think it's a bad idea, but a cool bad idea
At first it sounds dumb, but you could leverage GPU non-determinism to make it truly random, not just pseudo random
There are better ways to do rng so I still think it's a bad idea, but a cool bad idea
Investigated paradigm of modifying model behavior by first doing SFT training using data from teacher model, then following with RLAIF training by teacher reward model
They found:
Investigated paradigm of modifying model behavior by first doing SFT training using data from teacher model, then following with RLAIF training by teacher reward model
They found:
This paper introduces LaTent Reasoning Optimization (LaTRO), a training framework
- Improves zero shot accuracy by +12.5% on GSM8K over base models.
- Doesn't use external feedback or reward models
...
This paper introduces LaTent Reasoning Optimization (LaTRO), a training framework
- Improves zero shot accuracy by +12.5% on GSM8K over base models.
- Doesn't use external feedback or reward models
...
I cannot read 50,000 papers per day
I cannot read 50,000 papers per day
- This paper introduced GSM8K, and showed how using verifiers can significantly improve performance (up to 20+ percentage points compared to finetuning, see graph below)
- This paper introduced GSM8K, and showed how using verifiers can significantly improve performance (up to 20+ percentage points compared to finetuning, see graph below)
How works:
1. Input a few papers
2. Input a description of the subfield. Could be generic like "optimizers" or highly specific like "improvements on LoRA"
🧵
How works:
1. Input a few papers
2. Input a description of the subfield. Could be generic like "optimizers" or highly specific like "improvements on LoRA"
🧵
e.g. Adafactor includes
- new low rank matrix approximation algorithm (used for second moment)
- detecting when Adam second moment is out of date
- better beta_2 schedules
- analysis of model training stability
arxiv.org/pdf/1804.04235
e.g. Adafactor includes
- new low rank matrix approximation algorithm (used for second moment)
- detecting when Adam second moment is out of date
- better beta_2 schedules
- analysis of model training stability
arxiv.org/pdf/1804.04235
I looked into it more and turns out it's a lot harder than I thought because...
I looked into it more and turns out it's a lot harder than I thought because...