Lightnews — Scholar-powered news

Jason Weston

@jasonweston.bsky.social

560 followers 340 following 5 posts

Senior Director, Research Scientist @ Meta FAIR + Visiting Prof @ NYU. Pretrain+SFT: NLP from Scratch (2011). Multilayer attention+position encode+LLM: MemNet (2015). Recent (2024): Self-Rewarding LLMs & more!

Videos

Computer science 92%

Biology 7%

Posts Media Videos Starter Packs

Reposted by Jason Weston

Reposted by Luke Zettlemoyer

Reposted by Jason Weston

Reposted by Luke Zettlemoyer

Reposted by Jason Weston

Reposted by Luke Zettlemoyer

Reposted by Jason Weston

Tanishq Mathew Abraham @iscienceluvr.bsky.social · Dec 10

Training Large Language Models to Reason in a Continuous Latent Space

Introduces a new paradigm for LLM reasoning called Chain of Continuous Thought (COCONUT)

Directly feed the last hidden state (a continuous thought) as the input embedding for the next token.

arxiv.org/abs/2412.06769

2 8 53

Reposted by Jason Weston

Reposted by Jason Weston, Kush R. Varshney

Reposted by Jason Weston

Jason Weston @jasonweston.bsky.social · Dec 10

Our new work on continuous chain of thought.

Tanishq Mathew Abraham @iscienceluvr.bsky.social · Dec 10

Jason Weston @jasonweston.bsky.social · Nov 22

Analysis: AD picks high temp for creative & low for fact-seeking prompts, automatically via training.

Our methods AD & Latent Pref Optimization are general & can be applied to train other hyperparams or latent features.

Excited how people could *adapt* this research!
🧵4/4

Jason Weston @jasonweston.bsky.social · Nov 22

We train on a mix of tasks:
GSM8K - requires factuality (low temp)
Stories - requires creativity (high temp)
UltraFeedback - general instruction following, requires mix

Results: Adaptive Decoding outperforms any fixed temperature, automatically choosing via the AD layer.
🧵3/4

2 2

Jason Weston @jasonweston.bsky.social · Nov 22

Recipe 👩‍🍳:
Adaptive Decoder (AD) Layer:
- Assigns probability to each hyperparam choice (decoding temp) given hidden state. Given temp, sample a token.

Training (Latent PO):
- Train AD by sampling params+tokens & use reward model on rejected hyperparam preference pairs
🧵2/4

1 1

Jason Weston @jasonweston.bsky.social · Nov 22

🚨 Adaptive Decoding via Latent Preference Optimization 🚨
- New layer for Transformer, selects decoding params automatically *per token*
- Learnt via new method Latent Preference Optimization
- Outperforms any fixed temperature decoding, choosing creativity or factuality
arxiv.org/abs/2411.09661
🧵1/4

2 6 44