Lightnews — Scholar-powered news

Rosmine

@rosmineb.bsky.social

[1] arxiv.org/pdf/2402.03300
[2] hijkzzz.notion.site/unraveling-r...

arxiv.org

January 10, 2025 at 8:07 PM

Rosmine

@rosmineb.bsky.social

One callout is [2] found similar performance between GRPO, RLOO, and REINFORCE++, but that GRPO was more prone to reward hacking, and that critic pretraining with PPO outperforms GRPO.

January 10, 2025 at 8:07 PM

Rosmine

@rosmineb.bsky.social

The drawback of GRPO is it requires you generate many responses for the same prompt, so if you were previously generating few responses per prompt, GRPO may increases computation time.

January 10, 2025 at 8:06 PM

Rosmine

@rosmineb.bsky.social

Additionally, it moves the KL penalty into the loss function (RLHF usually adds the KL penalty to the rewards), which simplifies the computation of the advantage.

January 10, 2025 at 8:06 PM

Rosmine

@rosmineb.bsky.social

To do this, it starts by generating several responses for each query. Then when computing the advantage, it replaces the value function by the reward of the sample normalized by the mean and std across all responses for the same query.

January 10, 2025 at 8:06 PM

Rosmine

@rosmineb.bsky.social

- Performance improvement from RLAIF vs. SFT depends on the base model. E.g. For Llama models, SFT is much more effective than RLAIF (see graph)

December 2, 2024 at 5:18 PM

Rosmine

@rosmineb.bsky.social

- usually people use SFT generated by gpt 3.5, and rlaif from gpt4. If you use a higher quality model to generate SFT data, then usually RLAIF is less effective than SFT

December 2, 2024 at 5:18 PM

Rosmine

@rosmineb.bsky.social

- for RLAIF to be valuable, you need 1. sufficiently strong pretrained base model 2. capability mismatch between the teacher used for the SFT data collection and the critic used for collecting ai feedback

December 2, 2024 at 5:18 PM

Rosmine

@rosmineb.bsky.social

ha ha nothing that funny, best was "AI/LLM Nerds (derogatory)"

November 29, 2024 at 12:14 AM

Rosmine

@rosmineb.bsky.social

All I did was post paper summaries, I guess people don’t like my taste in papers

November 28, 2024 at 7:04 PM

Rosmine

@rosmineb.bsky.social

(most other approaches just do reasoning through prompting, or require the dataset to have reasoning included)
- Improvement varies a lot on the dataset. There's huge improvement on GSM8K, but ARC improvement is at most 1.6%

arxiv.org/abs/2411.04282

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can im...

arxiv.org

November 27, 2024 at 4:13 PM

Rosmine

@rosmineb.bsky.social

- The paper shows how to use variational lower bound that explicitly includes the probability of the reasoning, then uses RL (REINFORCE Leave One Out) to optimize the reasoner. Basically, this gives a good way to train the model on reasoning without specialized training data

November 27, 2024 at 4:13 PM

Rosmine

@rosmineb.bsky.social

Nice to meet you, Mr. Cow 😉

November 27, 2024 at 3:10 PM

Rosmine

@rosmineb.bsky.social

I'm suspicious he's real, the pinned tweet makes me think it's a parody (but mostly I'm suspicious because he followed me lol)

November 27, 2024 at 2:06 PM

Rosmine

@rosmineb.bsky.social

Fingers crossed this means they're reallocating all compute towards a shiny new model

November 26, 2024 at 8:21 PM

Rosmine

@rosmineb.bsky.social

(I'm doing a lit review of reasoning for LLMs. Aiming to post 1-2 paper summaries per day. This is the first in the series)
arxiv.org/pdf/2110.14168

arxiv.org

November 25, 2024 at 7:28 PM

Rosmine

@rosmineb.bsky.social

- Found that training for more than 2 epochs on GSM8K caused performance degradation, because the verifiers require high diversity of samples, and too many epochs causes diversity to collapse

November 25, 2024 at 7:28 PM

Rosmine

@rosmineb.bsky.social

- using a verifier and generating too many solution and picking the best actually hurts performance: it increases the probability of finding adversarial solutions that fool the verifier

November 25, 2024 at 7:28 PM

Rosmine

@rosmineb.bsky.social

- To use verifiers, they train a model to score if a solution is correct or not. They then generate many high temperature solutions, and use the verifier to select the solution most likely to be correct.

November 25, 2024 at 7:27 PM

Rosmine

@rosmineb.bsky.social

- GSM8K has 8.5K high quality grade school math word problems. Problems take 2-8 steps to solve, using +,-,x,/, with a focus on diversity of problems.

November 25, 2024 at 7:27 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news