Lightnews — Scholar-powered news

This was an incredible collaboration with our lead Sumeet Motwani, Phillip Torr, and Ronnie Clark from Oxford - Fabio Pizzati, Rocktim Jyoti Das, Ivan Laptev at MBZUAI -and Mark Rybchuk at Berkeley. Expert supervision from Christian Schroeder de Witt from @oxfordtvg.bsky.social !

December 6, 2024 at 10:38 PM

Chandler Smith

@chansmi.bsky.social

MALT is still preliminary work, and there is a lot to be explored, but I believe this is an important research direction. We’ll be working on scaling it in more settings (especially with partial observability for a critic who can use tools and smarter ways to distill things)

December 6, 2024 at 10:38 PM

Chandler Smith

@chansmi.bsky.social

We see very strong performance across MATH, GSM8k, and CommonsenseQA against trained and untrained baselines with Llama 3.1 8B!

December 6, 2024 at 10:38 PM

Chandler Smith

@chansmi.bsky.social

In this setup, models get better at checking/improving certain parts of answers based on what worked best during search. This can address limitations models might have around back-tracking or CoT critiques.

December 6, 2024 at 10:38 PM

Chandler Smith

@chansmi.bsky.social

Using SFT and DPO, we can learn from both positive and negative reasoning traces. The multi-agent setup allows for role specialization, where more context present in the prompts eases off computation for subsequent models.

December 6, 2024 at 10:38 PM

Chandler Smith

@chansmi.bsky.social

This allows us to compare final outputs to a ground truth, propagate rewards throughout downstream nodes, and post-train models on role-specific data. The generator learns to be a better generator, the critic learns to be a better critic, and so on by bootstrapping reasoning traces.

December 6, 2024 at 10:38 PM

Chandler Smith

@chansmi.bsky.social

By just looking at these trees, how do you tell which branches are useful for post-training without human feedback or trained PRMs? Value iteration can be used as a simple approach to propagate labels throughout branches with a thresholding factor to label the quality of reasoning steps.

December 6, 2024 at 10:38 PM

Chandler Smith

@chansmi.bsky.social

Training models in a single line might be a difficult problem to approach with discrete outputs produced by each model. We use a tree-based sampling strategy with an exponential branching factor that can generate an incredible amount of synthetic data for bootstrapping the performance of each model!

December 6, 2024 at 10:38 PM

Chandler Smith

@chansmi.bsky.social

Our goal was to develop techniques where a system of multiple models could be trained together. We use a generator, critic, and refinement setting that mimics how humans might interact with LLMs.

December 6, 2024 at 10:38 PM

Chandler Smith

@chansmi.bsky.social

It was a privilege to collaborate with
@ankareuel.bsky.social, Amelia Hardy, @mlamparth.bsky.social, Malcolm Hardy, and Professor Mykel Kochenderfer

November 26, 2024 at 5:30 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news