Jason Weston
@jasonweston.bsky.social
560 followers 340 following 5 posts

Senior Director, Research Scientist @ Meta FAIR + Visiting Prof @ NYU. Pretrain+SFT: NLP from Scratch (2011). Multilayer attention+position encode+LLM: MemNet (2015). Recent (2024): Self-Rewarding LLMs & more!

Computer science 92%
Biology 7%
Posts Media Videos Starter Packs

Reposted by Jason Weston

Reposted by Luke Zettlemoyer

Reposted by Jason Weston

Reposted by Luke Zettlemoyer

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Luke Zettlemoyer

Reposted by Jason Weston

iscienceluvr.bsky.social
Training Large Language Models to Reason in a Continuous Latent Space

Introduces a new paradigm for LLM reasoning called Chain of Continuous Thought (COCONUT)

Directly feed the last hidden state (a continuous thought) as the input embedding for the next token.

arxiv.org/abs/2412.06769

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Jason Weston

jasonweston.bsky.social
Our new work on continuous chain of thought.
iscienceluvr.bsky.social
Training Large Language Models to Reason in a Continuous Latent Space

Introduces a new paradigm for LLM reasoning called Chain of Continuous Thought (COCONUT)

Directly feed the last hidden state (a continuous thought) as the input embedding for the next token.

arxiv.org/abs/2412.06769

jasonweston.bsky.social
Analysis: AD picks high temp for creative & low for fact-seeking prompts, automatically via training.

Our methods AD & Latent Pref Optimization are general & can be applied to train other hyperparams or latent features.

Excited how people could *adapt* this research!
🧵4/4

jasonweston.bsky.social
We train on a mix of tasks:
GSM8K - requires factuality (low temp)
Stories - requires creativity (high temp)
UltraFeedback - general instruction following, requires mix

Results: Adaptive Decoding outperforms any fixed temperature, automatically choosing via the AD layer.
🧵3/4

jasonweston.bsky.social
Recipe 👩‍🍳:
Adaptive Decoder (AD) Layer:
- Assigns probability to each hyperparam choice (decoding temp) given hidden state. Given temp, sample a token.

Training (Latent PO):
- Train AD by sampling params+tokens & use reward model on rejected hyperparam preference pairs
🧵2/4

jasonweston.bsky.social
🚨 Adaptive Decoding via Latent Preference Optimization 🚨
- New layer for Transformer, selects decoding params automatically *per token*
- Learnt via new method Latent Preference Optimization
- Outperforms any fixed temperature decoding, choosing creativity or factuality
arxiv.org/abs/2411.09661
🧵1/4