Clayton Thorrez
cthorrez.bsky.social
Clayton Thorrez
@cthorrez.bsky.social
EsportsBench refreshed with data up through June 2025, over 61k new matches across 20 esports have been recorded in the last 3 months!
huggingface.co/datasets/Esp...
July 16, 2025 at 6:31 AM
Extremely excited to announce that I've joined @lmarena.bsky.social !

For years I've been working in LLMs for my job, and hacking on rankings and ratings for fun, beyond thrilled to be able to join this project at the intersection!
July 15, 2025 at 3:36 AM
Just ran into Simpsons paradox in the wild for the first time lol. Was looking at some data and was like "that doesn't look right all the means went up when all I did was assign groups differently, this is like Simpson's paradox or something lol"
July 10, 2025 at 4:08 AM
if you fork any of my GitHub repos I WILL add you on LinkedIn.

There are so few people actively working on rating system stuff and I want to talk to all of them
July 3, 2025 at 10:31 PM
I just realized MSI is going on and in Vancouver so got a ticket for tomorrow lol.

Any on esports/machine learning people going?
July 3, 2025 at 8:57 PM
I love it when the same notation can mean the *exact* opposite thing when used by different authors...

Should "A ≻ B" mean:
"A is preferred to B (higher rating)"
"B is preferred to A (lower rank number)"?

arxiv.org/pdf/2411.049...
www.tandfonline.com/doi/full/10....
July 3, 2025 at 6:56 PM
if you have a really small network, and a really small dataset, is it possible to fuse an entire transformer training loop into a single kernel?
July 3, 2025 at 3:13 PM
It turns out adding sum to 0 constraints on some parameters is actually a fair but harder than simplex constraints (non negative, sum to 1) in the iterative gradient based optimization setting.

Is there a good equivalent to the softmax truck?
July 2, 2025 at 2:16 AM
Super disappointed by this.

I'm a huge fan of esports, attending numerous events over the past 12 years and watching countless hours on twitch. I would rather see esports shrink down to a grassroots core than get children addicted to gambling.

x.com/riotgames/st...
Riot Games on X: "Why We're Opening Betting Sponsorships in Esports & How We're Doing It Responsibly" / X
Why We're Opening Betting Sponsorships in Esports & How We're Doing It Responsibly
x.com
June 26, 2025 at 11:51 PM
qwen3-235b be like
June 24, 2025 at 9:57 PM
How much value does thinking add to an LLM?

Well for the largest Qwen3, the answer is -28 points

Thinking on academic benchmarks seems to help a lot, I wonder what's going wrong in the arena?

Maybe people can sense the hedging and don't like it, or it poisons its own context with overthinking
June 24, 2025 at 9:56 PM
there's a lotta with the last name Ferguson

not a lotta people with the first name Fergus
June 19, 2025 at 6:04 AM
Mark Glickman is on a roll now! 2 Paper in two weeks
This time extending the stength dependent draw model to the online setting for use in dynamic rating systems.

Haven't read the whole thing but it looks to contain some cool approximation tricks for the posterior

arxiv.org/abs/2506.11354
Rating competitors in games with strength-dependent tie probabilities
Competitor rating systems for head-to-head games are typically used to measure playing strength from game outcomes. Ratings computed from these systems are often used to select top competitors for eli...
arxiv.org
June 19, 2025 at 4:01 AM
When gemini writes code in an artifact window, there are 9 buttons on the UI

None of them are to copy the code
June 14, 2025 at 5:33 AM
I want to train an 16 billion parameter model.

Specifically, a 16 billion parameter TrueSkill model which fits a skill mean and variance for each of the 8 billion people on earth.

But in my quest to scale rating systems, I guess I start with lichess, with 6B games and a measly 20M unique players
June 9, 2025 at 5:49 AM
🚨NEW MARK GLICKMAN PAPER🚨
Paired comparison models with strength-dependent ties and order effects
arxiv.org/abs/2505.24783

Basically what the title says, extending Bradley-Terry with ties, home field advantage, and allowing those factors to depend on how strong the competitors are.
Paired comparison models with strength-dependent ties and order effects
Paired comparison models, such as the Bradley-Terry (1952) model and its variants, are commonly used to measure competitor strength in games and sports. Extensions have been proposed to account for or...
arxiv.org
June 5, 2025 at 9:55 PM
Fear leads to anger, anger leads to Rēvolfed, Rēvolfed is the path the dark side
a friend of mine shared this ai-generated "emotion wheel" and unfortunately i have been laughing my ass off at it for like 15 minutes now. today i am feeling Fnliinneon
June 5, 2025 at 4:24 PM
Anyone know how to report a bug in Google Scholar? Online seems like there is not great public support. It's not my paper just one I reference a lot.

Google thinks TrueSkill is in Russian!
scholar.google.com/scholar?hl=e...
June 5, 2025 at 4:22 PM
This is the post/thread that made me believe bsky has a future
Are all muppets the same species just with wildly different phenotypes like dogs or are there a lot of different species of muppet
June 4, 2025 at 3:42 PM
Question for my rating systems peeps: what does the RD in Glicko, or the sigma in TrueSkill represent? How do you interpret that number?
June 3, 2025 at 7:52 PM
Is this like a numerical issue on wolfram or am I missing something? This should be 0 right?
www.wolframalpha.com/input?i=%281...
June 1, 2025 at 11:05 PM
Ok I've officially had an idea for a modification of Elo that is consistently an improvement over base Elo averged over 20 datasets with significant baseline hyperparameter tuning.

So I've passed the state of the art form the 1960's. The bad news is that it still gets mogged by Glicko.
May 31, 2025 at 6:09 PM
look I'm a big fan of WSL, and I used it for all of my side projects, but it also has a enough issues to require me to make this powershell script lol
May 31, 2025 at 6:23 AM
Interesting little finding, in Elo there are basically 2 hyperparameters, k which is the learning rate/step size, and scale, which is the temperature of the softmax. Most people understand k, high k means big rating changes after wins and losses, low k means small updates.
May 31, 2025 at 1:14 AM
Pretty cool that I wrote the currently deployed ranking code for a company now valued at 600 million dollars 🤯🤯🤯

github.com/lm-sys/FastC...

techcrunch.com/2025/05/21/l...
Accelerate Bradley Terry MLE model fitting by cthorrez · Pull Request #3523 · lm-sys/FastChat
Why are these changes needed? The bootstrap Bradley Terry model takes upwards of 15 minutes to run for 100 samples. This is costly on resources and hinders experiments such as studying hyperparamet...
github.com
May 22, 2025 at 8:53 PM