Nicholas Lourie @ COLM
@nicholaslourie.bsky.social
240 followers 86 following 10 posts
Better empirical methods for deep learning & NLP. PhD at NYU. Advised by He He and @kyunghyuncho.bsky.social. Prev: @ai2.bsky.social. I build things. 🤖
Posts Media Videos Starter Packs
Pinned
nicholaslourie.bsky.social
📄🔈✨ Deep learning is an empirical science, but we rely on basic empirical methods. What might a better foundation—a simple theory—for empirical work look like?

@kyunghyuncho.bsky.social, He He, and I move towards one in "Hyperparameter Loss Surfaces Are Simple Near their Optima" at #COLM2025!

🧵1/9
nicholaslourie.bsky.social
And for anyone at #COLM2025, if you're curious come chat at our poster! We're presenting as Poster 67 at Poster Session 4 this afternoon!
nicholaslourie.bsky.social
The noisy quadratic emerges across a range of architectures, tasks, and modalities—including language modeling, supervised finetuning, and imagenet pretraining.

In all these scenarios, our theory displays an excellent fit! 👇

See the paper for even more!

🧵8/9
nicholaslourie.bsky.social
The noisy quadratic distribution (Q) has 4 parameters, corresponding to properties of the loss surface like the *best possible performance* or the *effective number of hyperparameters*. Using the noisy quadratic, you can construct confidence intervals for these quantities.

🧵7/9
nicholaslourie.bsky.social
The score distribution's tail converges to new distribution: the noisy quadratic.

If you find where the noisy quadratic matches the score distribution, then you've found where the simple structure starts, or (as we call it) the *asymptotic regime*.

🧵6/9
nicholaslourie.bsky.social
But, how do you find the region where that simple structure holds? With a familiar tool: random search!

When you sample hyperparameters and evaluate them you get a validation score. That process defines the *score distribution* from random search, and we prove a novel limit theorem about it.

🧵5/9
nicholaslourie.bsky.social
The problem is: the validation score isn't a deterministic function of the hyperparameters. If you retrain the same model, you'll get two different scores!

Luckily, the noise is simple: normally distributed with constant variance. You see this empirically if you retrain a model many times. 👇

🧵4/9
nicholaslourie.bsky.social
You get quadratic structure from a Taylor expansion about the optimum. As search progresses, the hyperparameters you care about get closer to the optimum and the Taylor expansion becomes a better approximation.

🧵3/9
nicholaslourie.bsky.social
Hyperparameters are complex, but near the optimum—near the hyperparameters that matter most—their structure becomes surprisingly simple: *quadratic* with *additive normal noise*.

🧵2/9
nicholaslourie.bsky.social
📄🔈✨ Deep learning is an empirical science, but we rely on basic empirical methods. What might a better foundation—a simple theory—for empirical work look like?

@kyunghyuncho.bsky.social, He He, and I move towards one in "Hyperparameter Loss Surfaces Are Simple Near their Optima" at #COLM2025!

🧵1/9
Reposted by Nicholas Lourie @ COLM
seasnell.bsky.social
Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵