Lightnews — Scholar-powered news

Nicholas Lourie @ COLM

@nicholaslourie.bsky.social

240 followers 86 following 10 posts

Better empirical methods for deep learning & NLP. PhD at NYU. Advised by He He and @kyunghyuncho.bsky.social. Prev: @ai2.bsky.social. I build things. 🤖

Posts Media Videos Starter Packs

Pinned

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

📄🔈✨ Deep learning is an empirical science, but we rely on basic empirical methods. What might a better foundation—a simple theory—for empirical work look like?

@kyunghyuncho.bsky.social, He He, and I move towards one in "Hyperparameter Loss Surfaces Are Simple Near their Optima" at #COLM2025!

🧵1/9

1 2 6

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

And for anyone at #COLM2025, if you're curious come chat at our poster! We're presenting as Poster 67 at Poster Session 4 this afternoon!

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

Deep learning is an empirical science, its progress depends on empirical tools.

We hope these tools help you make progress in your research! You can get them with just a `pip install opda`.

paper: arxiv.org/abs/2510.027...
code: github.com/nicholaslour...
docs: nicholaslourie.github.io/opda/

🧵9/9

Hyperparameter Loss Surfaces Are Simple Near their Optima

Hyperparameters greatly impact models' capabilities; however, modern models are too large for extensive search. Instead, researchers design recipes that train well across scales based on their underst...

arxiv.org

1 1

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

The noisy quadratic emerges across a range of architectures, tasks, and modalities—including language modeling, supervised finetuning, and imagenet pretraining.

In all these scenarios, our theory displays an excellent fit! 👇

See the paper for even more!

🧵8/9

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

The noisy quadratic distribution (Q) has 4 parameters, corresponding to properties of the loss surface like the *best possible performance* or the *effective number of hyperparameters*. Using the noisy quadratic, you can construct confidence intervals for these quantities.

🧵7/9

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

The score distribution's tail converges to new distribution: the noisy quadratic.

If you find where the noisy quadratic matches the score distribution, then you've found where the simple structure starts, or (as we call it) the *asymptotic regime*.

🧵6/9

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

But, how do you find the region where that simple structure holds? With a familiar tool: random search!

When you sample hyperparameters and evaluate them you get a validation score. That process defines the *score distribution* from random search, and we prove a novel limit theorem about it.

🧵5/9

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

The problem is: the validation score isn't a deterministic function of the hyperparameters. If you retrain the same model, you'll get two different scores!

Luckily, the noise is simple: normally distributed with constant variance. You see this empirically if you retrain a model many times. 👇

🧵4/9

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

You get quadratic structure from a Taylor expansion about the optimum. As search progresses, the hyperparameters you care about get closer to the optimum and the Taylor expansion becomes a better approximation.

🧵3/9

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

Hyperparameters are complex, but near the optimum—near the hyperparameters that matter most—their structure becomes surprisingly simple: *quadratic* with *additive normal noise*.

🧵2/9

Nicholas Lourie @ COLM @nicholaslourie.bsky.social · 7d

1 2 6

Reposted by Nicholas Lourie @ COLM

Charlie Snell @seasnell.bsky.social · Nov 26

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

3 6 44