Chandler Smith
chansmi.bsky.social
Chandler Smith
@chansmi.bsky.social
Multi-Agent Researcher at CAIF | applied research at IQT | Thinking about making MA systems go well
Personal Site: chandlersmith.me
Chandler Smith
chandlersmith.me
December 9, 2024 at 7:09 PM
Here’s the link to the paper and hugging face page: arxiv.org/pdf/2412.01928 and huggingface.co/papers/2412....
arxiv.org
December 6, 2024 at 10:38 PM
This was an incredible collaboration with our lead Sumeet Motwani, Phillip Torr, and Ronnie Clark from Oxford - Fabio Pizzati, Rocktim Jyoti Das, Ivan Laptev at MBZUAI -and Mark Rybchuk at Berkeley. Expert supervision from Christian Schroeder de Witt from @oxfordtvg.bsky.social !
December 6, 2024 at 10:38 PM
MALT is still preliminary work, and there is a lot to be explored, but I believe this is an important research direction. We’ll be working on scaling it in more settings (especially with partial observability for a critic who can use tools and smarter ways to distill things)
December 6, 2024 at 10:38 PM
We see very strong performance across MATH, GSM8k, and CommonsenseQA against trained and untrained baselines with Llama 3.1 8B!
December 6, 2024 at 10:38 PM
In this setup, models get better at checking/improving certain parts of answers based on what worked best during search. This can address limitations models might have around back-tracking or CoT critiques.
December 6, 2024 at 10:38 PM
Using SFT and DPO, we can learn from both positive and negative reasoning traces. The multi-agent setup allows for role specialization, where more context present in the prompts eases off computation for subsequent models.
December 6, 2024 at 10:38 PM
This allows us to compare final outputs to a ground truth, propagate rewards throughout downstream nodes, and post-train models on role-specific data. The generator learns to be a better generator, the critic learns to be a better critic, and so on by bootstrapping reasoning traces.
December 6, 2024 at 10:38 PM
By just looking at these trees, how do you tell which branches are useful for post-training without human feedback or trained PRMs? Value iteration can be used as a simple approach to propagate labels throughout branches with a thresholding factor to label the quality of reasoning steps.
December 6, 2024 at 10:38 PM
Training models in a single line might be a difficult problem to approach with discrete outputs produced by each model. We use a tree-based sampling strategy with an exponential branching factor that can generate an incredible amount of synthetic data for bootstrapping the performance of each model!
December 6, 2024 at 10:38 PM
Our goal was to develop techniques where a system of multiple models could be trained together. We use a generator, critic, and refinement setting that mimics how humans might interact with LLMs.
December 6, 2024 at 10:38 PM
It was a privilege to collaborate with
@ankareuel.bsky.social, Amelia Hardy, @mlamparth.bsky.social, Malcolm Hardy, and Professor Mykel Kochenderfer
November 26, 2024 at 5:30 PM