Desi R Ivanova
banner
desirivanova.bsky.social
Desi R Ivanova
@desirivanova.bsky.social
Research fellow @OxfordStats @OxCSML, spent time at FAIR and MSR
Former quant 📈 (@GoldmanSachs), former former gymnast 🤸‍♀️
My opinions are my own
🇧🇬-🇬🇧 sh/ssh
Along with the lightweight library, we provide short code snippets in the paper.
March 6, 2025 at 3:00 PM
…and for constructing error bars on more complicated metrics, such as F1 score, that require the flexibility of Bayes.
March 6, 2025 at 3:00 PM
...and treated without an independence assumption (e.g. using the same eval questions on both LLMs)...
March 6, 2025 at 3:00 PM
...for making comparisons between two LLMs treated independently...
March 6, 2025 at 3:00 PM
We also suggest simple methods for the clustered-question setting (where we don't assume all questions are IID -- instead we have T groups of N/T IID questions)...
March 6, 2025 at 3:00 PM
We suggest using Bayesian credible intervals for your error bars instead, with a simple Beta-Binomial model. (The aim is for the methods to achieve nominal 1-alpha coverage i.e. match the dotted line in the top row. A 95% confidence interval should be right 95% of the time.)
March 6, 2025 at 3:00 PM
This, along with the CLT's ignorance of typically binary eval data (correct/incorrect responses to an eval question) lead to poor error bars which collapse to zero-width or extend past [0,1].
March 6, 2025 at 3:00 PM
As LLMs get better, benchmarks to evaluate their capabilities are getting smaller (and harder). This starts to violate the CLT's large N assumption. Meanwhile, we have lots of eval settings in which questions aren't IID (e.g. questions in a benchmark often aren't independent).
March 6, 2025 at 3:00 PM
Our paper on the best way to add error bars to LLM evals is on arXiv! TL;DR: Avoid the Central Limit Theorem -- there are better, simple Bayesian and frequentist methods you should be using instead.

We also provide a super lightweight library: github.com/sambowyer/baye… 🧵👇
March 6, 2025 at 3:00 PM
I’m teaching a grad course for the first time (a bit terrifying 😅) and I’ve decided to write a short blog post after each lecture, that will highlight a key takeaway from it and reflect on what can be improved.

First one ⬇️
February 21, 2025 at 1:10 PM
A tiny "embers of autoregression" artifact in simple arithmetic

probapproxincorrect.substack.com/p/embers-of-au…
February 8, 2025 at 9:05 PM
✨✨
December 8, 2024 at 2:01 PM
Decent 0-shot performance (conditional on it has been like 100 years since retirement) 😅
November 19, 2024 at 8:34 PM