#GRPO
Besher Hassan, Xiuying Chen
GRASP LoRA: GRPO Guided Adapter Sparsity Policy for Cross Lingual Transfer
https://arxiv.org/abs/2601.06702
January 13, 2026 at 2:38 PM
Shubhashis Roy Dipta, Khairul Mahbub, Nadia Najjar
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO
https://arxiv.org/abs/2601.06767
January 13, 2026 at 2:36 PM
Shubhashis Roy Dipta, Khairul Mahbub, Nadia Najjar: GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO https://arxiv.org/abs/2601.06767 https://arxiv.org/pdf/2601.06767 https://arxiv.org/html/2601.06767
January 13, 2026 at 6:30 AM
Besher Hassan, Xiuying Chen: GRASP LoRA: GRPO Guided Adapter Sparsity Policy for Cross Lingual Transfer https://arxiv.org/abs/2601.06702 https://arxiv.org/pdf/2601.06702 https://arxiv.org/html/2601.06702
January 13, 2026 at 6:30 AM
Dimple Vijay Kochar, Nathaniel Pinckney, Guan-Ting Liu, Chia-Tung Ho, Chenhui Deng, Haoxing Ren, Brucek Khailany: GRPO with State Mutations: Improving LLM-Based Hardware Test Plan Generation https://arxiv.org/abs/2601.07593 https://arxiv.org/pdf/2601.07593 https://arxiv.org/html/2601.07593
January 13, 2026 at 6:29 AM
Building a Reasoning LLM with GRPO. If your AI agent only knows how to “search and find,” it’s already behind. The future of AI isn’t just about retrieval; it’s about Logical… Continue reading on Medium »

Interest | Match | Feed
Origin
medium.com
January 12, 2026 at 1:31 PM
Wang, Lu, Xu, Chen, Yang, Wang, Chen, Chen, Hu, Wu, Shao, Lu, Luo: TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment https://arxiv.org/abs/2601.05729 https://arxiv.org/pdf/2601.05729 https://arxiv.org/html/2601.05729
January 12, 2026 at 6:30 AM
2601.05242
言語モデルがますます高度になるにつれ、ユーザーは正確な応答だけでなく、様々なシナリオにおける多様な人間の嗜好に沿った行動も提供することを期待している。これを実現するため、強化学習(RL)パイプラインでは複数の報酬を取り入れ始めた。各報酬は異なる選好を捉え、モデルをこれらの望ましい行動へと...
January 11, 2026 at 12:06 AM
[24/30] 118 Likes, 6 Comments, 1 Posts
2601.05242, cs․CL | cs․AI | cs․LG, 08 Jan 2026

🆕GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Ch...
January 11, 2026 at 12:06 AM
Eu fico grata que o pessoal do subredit da crônica do matador do rei mantém aquela merda ativa, ficam postando teorias e tbm serve como um grpo de terapia kkkk amo demais, faz tudo por mim
January 10, 2026 at 7:44 PM
They find that the common practice of applying GRPO to multi-reward optimization normalizes distinct reward combinations into identical advantage values. This collapses the training signal, reduces reward-level resolution, and leads to suboptimal convergence.
January 10, 2026 at 2:20 PM
GDPO, not GRPO

NVIDIA introduces Group reward-Decoupled Normalization Policy Optimization (GDPO), a new multi-reward RL algorithm that consistently improves per-reward convergence over GRPO across a wide range of tasks.
January 10, 2026 at 2:20 PM
As a technical note, they seem to have tried using GRPO replay buffer, but apparently it just collects data that's never used because their training is too short to see any replayed prompts (7B & 8B models) and/or has too few steps for reaching target temperature that would enable it (all models).
January 8, 2026 at 11:45 PM
I tasked agentic Gemini to figure it out by giving it their released data and code. It reported that their GRPO trainer has hardcoded temperature annealing from T=1 down to T=0.3 over 3000 steps. So their supposed temperature variants are probably all trained the same, except for some random noise.
January 8, 2026 at 11:44 PM
They describe how GRPO was used to generate 8 answers per problem. The idea is to nudge model weights towards the best of them.

What if you do that with temperature=0? Yep, their result files are just sets of 8 identical* responses.

They tried other temperatures, but oddly chose to report that.
January 8, 2026 at 11:42 PM
So 2/3 of their benchmarks are already toast, and I'll get back to the remaining one (math) later. (spoiler: it doesn't get any better.)

But I'll take a more technical detour first.

Let's get to what their training was.

"We fine-tune reasoning models with GRPO"

First again: not reasoning models.
January 8, 2026 at 11:41 PM
Chi Liu, Xin Chen: Adaptive-Boundary-Clipping GRPO: Ensuring Bounded Ratios for Stable and Generalizable Training https://arxiv.org/abs/2601.03895 https://arxiv.org/pdf/2601.03895 https://arxiv.org/html/2601.03895
January 8, 2026 at 6:34 AM
Amir Hossein Yari, Fajri Koto: AMIR-GRPO: Inducing Implicit Preference Signals into GRPO https://arxiv.org/abs/2601.03661 https://arxiv.org/pdf/2601.03661 https://arxiv.org/html/2601.03661
January 8, 2026 at 6:33 AM
CME 295 - Transformers & Large Language Models cme295.stanford.edu/syllabus/ この講義は良い!このくらいの適度に粗い粒度で通しでやるモダンLLM解説,実はあまり無いんだよね.flash attn.やGRPOの実装まで踏み込む講義は結構あるけど…
Syllabus | CME 295 - Transformers & Large Language Models
Here, you will find slides and recordings of class lectures, along with suggested readings.
cme295.stanford.edu
January 8, 2026 at 5:56 AM
2601.02256
視覚生成は主に三つのパラダイムによって支配されている:自己回帰(AR)、拡散、および視覚自己回帰(VAR)モデルである。ARや拡散モデルとは異なり、VARは生成過程において異質な入力構造を扱うため、深刻な非同期的な政策の矛盾が生じる。この問題は強化学習(RL)シナリオにおいて特に深刻となり、不安定...
January 8, 2026 at 12:17 AM
[2/30] 30 Likes, 2 Comments, 1 Posts
2601.02256, cs․CV | cs․LG, 05 Jan 2026

🆕VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia...
January 8, 2026 at 12:16 AM
Geminiはモデルとしては有能なんだろうけどGRPOで失敗している感じで,性格がとても悪いし途中で仕事を放棄するから,全然信頼できない.GPTは何でもできる風をして一番最初の最初から勘違いして暴走するので,人間の時間を浪費する感じで,これも信頼できない.
January 7, 2026 at 1:40 AM
https://note.com/holy_fox/n/n976faac80012
ArrowIdeative-13b-NeoBase-ZERO-llm-jpは、GRPOのみを用いた強化学習で作られた純国産LLMです。
ベースモデルと指示追従モデルの中間的な性質を持ち、プロンプトエンジニアリングが有効。
本モデルはllm-jp-3.1-13bをベースにしたモデルです。
ArrowIdeative-13b-instruct-ZERO-llm-jpについて|Holy_fox
概要 ArrowIdeative-13b-NeoBase-ZERO-llm-jpは、日本語のベースモデルに対して、完全にGRPO(Group Relative Policy Optimization)のみを用いた強化学習を施して作られた純国産LLMです。また、これはおそらく世界初の「GRPO単独の事後学習で作られた日本語直感モデル」であり、 ベースモデルと指示追従(Instruct)モデルの中間的な性質を備えています。そのため、簡単に言えば「一定のプロンプトエンジニアリングが有効なベースモデル」という位置づけです。 モデルの詳細については、以下のリンクをご覧ください。 D
note.com
January 6, 2026 at 10:42 AM
That’s not true! All these models also are trained to reason before answering like R1… with GRPO
January 5, 2026 at 11:56 PM
What's more, Guo et al. stressed that e.g. the "wait" "was virtually absent during the initial training stages, appeared sporadically between steps 4,000 and 7,000 and exhibited a marked increase in frequency after step 8,000."

Yet all your experiments stopped at 500 or 1000 steps...?
January 5, 2026 at 11:29 PM