Lightnews — Scholar-powered news

Reposted by Masoud Jafaripour

James MacGlashan

@jmac-ai.bsky.social

This one's been a long time coming.

In this post on Decisions & Dragons I answer "Should we abandon RL?"

The answer is obviously no, but people ask because they have a fundamental misunderstanding of what RL is.

RL is a problem, not an approach.

www.decisionsanddragons.com/posts/should...

August 15, 2025 at 11:30 PM

Masoud Jafaripour

@masoudjafaripour.bsky.social

The best video I saw so far on explanation of KL divergence, cross-entropy loss, and their relation/application in LLM training.

Jia-Bin Huang @jbhuang0604.bsky.social · Jun 4

Kullback–Leibler (KL) divergence is a cornerstone of machine learning.

We use it everywhere, from training classifiers and distilling knowledge from models, to learning generative models and aligning LLMs.

BUT, what does it mean, and how do we (actually) compute it?

Video: youtu.be/tXE23653JrU

June 7, 2025 at 3:39 AM

Reposted by Masoud Jafaripour

Danica Sutherland

@djsutherland.ml

Thrilled to announce that Joshua’s paper won one of three Outstanding Paper awards at ICLR.

Come to the poster on Friday afternoon (#376 in Hall 3) or the talk on Saturday (4:30 in Hall 1), and while you’re at it snag him for a postdoc!

Yi (Joshua) Ren @joshuaren.bsky.social · Apr 21

📢Curious why your LLM behaves strangely after long SFT or DPO?
We offer a fresh perspective—consider doing a "force analysis" on your model’s behavior.
Check out our #ICLR2025 Oral paper:

Learning Dynamics of LLM Finetuning!

(0/12)

April 23, 2025 at 7:54 AM

Reposted by Masoud Jafaripour

Eugene Vinitsky 🍒

@eugenevinitsky.bsky.social

Smashing the endorse button as fast as I can
www.lesswrong.com/posts/oKAFFv...

A Bear Case: My Predictions Regarding AI Progress — LessWrong

This isn't really a "timeline", as such – I don't know the timings – but this is my current, fairly optimistic take on where we're heading. …

www.lesswrong.com

March 9, 2025 at 9:09 PM

Reposted by Masoud Jafaripour

Gokul Swamy

@gokul.dev

1.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to 🤿:

March 4, 2025 at 8:59 PM

Reposted by Masoud Jafaripour

Naomi Saphra

@nsaphra.bsky.social

Ever looked at LLM skill emergence and thought 70B parameters was a magic number? Our new paper shows sudden breakthroughs are samples from bimodal performance distributions across seeds. Observed accuracy jumps abruptly while the underlying accuracy DISTRIBUTION changes slowly!

Distributional Scaling Laws for Emergent Capabilities
Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, Naomi Saphra
In this paper, we explore the nature of sudden breakthroughs in language model performance at scale, which stands in contrast to smooth improvements governed by scaling laws. While advocates of "emergence" view abrupt performance gains as capabilities unlocking at specific scales, others have suggested that they are produced by thresholding effects and alleviated by continuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes, particularly when performance is bimodally distributed across random seeds. In synthetic length generalization tasks, we show that different random seeds can produce either highly linear or emergent scaling trends. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. Furthermore, we provide a case study of inverse scaling and show that even as the probability of a successful run declines, the average performance of a successful run continues to increase monotonically. We validate our distributional scaling framework on realistic settings by measuring MMLU performance in LLM populations. These insights emphasize the role of random variation in the effect of scale on LLM capabilities.

February 25, 2025 at 10:33 PM

Reposted by Masoud Jafaripour

Tim Kellogg

@timkellogg.me

LLMs That Don't Gaslight You

A new language model uses diffusion instead of next-token prediction. That means the text it can back out of a hallucination before it commits. This is a big win for areas like law & contracts, where global consistency is valued

timkellogg.me/blog/2025/02...

LLaDA: Large Language Diffusion Models

timkellogg.me

February 17, 2025 at 11:32 PM

Reposted by Masoud Jafaripour

Costa Huang

@vwxyzjn.bsky.social

🔥 allenai/Llama-3.1-Tulu-3-8B (trained with PPO) -> allenai/Llama-3.1-Tulu-3.1-8B (trained with GRPO)

We are happy to "quietly" release our latest GRPO-trained Tulu 3.1 model, which is considerably better in MATH and GSM8K!

February 12, 2025 at 5:33 PM

Reposted by Masoud Jafaripour

Jay Alammar

@jayalammar.bsky.social

The Illustrated DeepSeek-R1

Spent the weekend reading the paper and sorting through the intuitions. Here's a visual guide and the main intuitions to understand the model and the process that created it.

newsletter.languagemodels.co/p/the-illust...

January 27, 2025 at 8:22 PM

Reposted by Masoud Jafaripour

Nathan Lambert

@natolambert.bsky.social

Since everyone wants to learn RL for language models now post DeepSeek, reminder that I've been working on this book quietly in the background for months.

Policy gradient chapter is coming together. Plugging away at the book every day now.

rlhfbook.com/c/11-policy-...

February 1, 2025 at 10:05 PM

Reposted by Masoud Jafaripour

Nathan Lambert

@natolambert.bsky.social

Why reasoning models will generalize
DeepSeek R1 is just the tip of the ice berg of rapid progress.
People underestimate the long-term potential of “reasoning.”

Why reasoning models will generalize

People underestimate the long-term potential of “reasoning.”

buff.ly

January 28, 2025 at 9:04 PM

Reposted by Masoud Jafaripour

Tim Kellogg

@timkellogg.me

apparently RLCoT (chain of thought learned via RL) is in itself an emergent behavior that doesn’t happen until about 1.5B sized models

PPO, GPRO, PRIME — doesn’t matter what RL you use, the key is that it’s RL

experiment logs: wandb.ai/jiayipan/Tin...

x: x.com/jiayi_pirate...

jiayipan

Weights & Biases, developer tools for machine learning

wandb.ai

January 25, 2025 at 6:46 PM

Reposted by Masoud Jafaripour

Tim Kellogg

@timkellogg.me

Explainer: What's R1 and Everything Else

This is an attempt to consolidate the dizzying rate of AI developments since Christmas. If you're into AI but not deep enough, this should get you oriented again.

timkellogg.me/blog/2025/01...

The image depicts a monumental statue of Buddha, emphasizing serenity and grandeur. The statue's intricate design captures traditional Buddhist features, including a meditative posture with hands placed in a symbolic gesture, flowing robes, and a calm facial expression exuding peace. The perspective highlights the statue's immense size against a minimalistic white sky background, underscoring its significance as a spiritual and cultural landmark.

January 26, 2025 at 3:17 AM

Reposted by Masoud Jafaripour

David Lindner

@davidlindner.bsky.social

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?

Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!

Inspired by myopic optimization but better performance – details in🧵

January 23, 2025 at 3:33 PM

Reposted by Masoud Jafaripour

Chris Paxton

@cpaxton.bsky.social

Self-improvement of a population of llms over time: llm-multiagent-ft.github.io

January 14, 2025 at 1:26 AM

Reposted by Masoud Jafaripour

Ethan Mollick

@emollick.bsky.social

Paper shows very small LLMs can match or beat larger ones through 'deep thinking' - evaluating different solution paths - and other tricks. Their 7B model beats o1-preview on complex math by exploring 64 different solutions & picking the best one.

Test-time compute paradigm seems really fruitful.

January 11, 2025 at 5:34 AM

Reposted by Masoud Jafaripour

Nathan Lambert

@natolambert.bsky.social

Quick recap on the state of reasoning -- can LMs reason (I say yes, just different than humans)? How?
My talk at the NeurIPS Latent Space live event (pre o3).
Slides: https://buff.ly/40hsoTx
Post: https://buff.ly/40i2rDC
YouTube: https://buff.ly/40k8GH3

Quick recap on the state of reasoning

My talk at the NeurIPS Latent Space live event.

buff.ly

January 2, 2025 at 4:10 PM

Reposted by Masoud Jafaripour

Melanie Mitchell

@melaniemitchell.bsky.social

Some of my thoughts on OpenAI's o3 and the ARC-AGI benchmark

aiguide.substack.com/p/did-openai...

Did OpenAI Just Solve Abstract Reasoning?

OpenAI’s o3 model aces the "Abstraction and Reasoning Corpus" — but what does it mean?

aiguide.substack.com

December 23, 2024 at 2:38 PM

Reposted by Masoud Jafaripour

Nathan Lambert

@natolambert.bsky.social

Here are the slides for our language modeling tutorial with @kylelo.bsky.social and @akshitab.bsky.social in west ballroom b (ongoing).

docs.google.com/presentation...

[10 December 2024, NeurIPs] Tutorial on Language Modeling

Language Modeling Kyle Lo – Akshita Bhagia – Nathan Lambert Allen Institute of AI [email protected] Neural Information Processing Systems (NeurIPS) 10 December 2024 1

docs.google.com

December 10, 2024 at 6:29 PM

Reposted by Masoud Jafaripour

Arnaud Doucet

@arnauddoucet.bsky.social

The slides of my NeurIPS lecture "From Diffusion Models to Schrödinger Bridges - Generative Modeling meets Optimal Transport" can be found here
drive.google.com/file/d/1eLa3...

BreimanLectureNeurIPS2024_Doucet.pdf

drive.google.com

December 15, 2024 at 6:40 PM

Reposted by Masoud Jafaripour

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)

@rao2z.bsky.social

On the different hues of "Inference Time Scaling" (ITS) in LLMs #SundayHarangue

👉 x.com/rao2z/status...

(bsky still doesn't allow long posts, so..)

x.com

December 2, 2024 at 5:22 AM

Reposted by Masoud Jafaripour

Reinforcement Learning Conference

@rl-conference.bsky.social

You know how RL is back now? Well, so are our upcoming deadlines

AI Conference DL Countdown @dlcountdown.bsky.social · Dec 23

SIGGRAPH'25 (form): 24 days.
RSS'25 (abs): 25 days.
SIGGRAPH'25 (paper-md5): 31 days.
RSS'25 (paper): 32 days.
ICML'25: 38 days.
RLC'25 (abs): 53 days.
RLC'25 (paper): 60 days.
ICCV'25: 73 days.

December 23, 2024 at 11:42 PM

Reposted by Masoud Jafaripour

Gokul Swamy

@gokul.dev

The more time I spend on RLHF, the more I realize the devil is in the details (even more than RL for continuous control). My co-author Zhaolin Gao wrote this excellent blog post on some of these details: huggingface.co/blog/GitBag/.... Maybe it'll be your savior!

RLHF 101: A Technical Dive into RLHF

A Blog post by Zhaolin Gao on Hugging Face

huggingface.co

December 11, 2024 at 8:05 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news