Esports stuff for fun:
https://cthorrez.github.io/riix/riix.html
https://huggingface.co/datasets/EsportsBench/EsportsBench
huggingface.co/datasets/Esp...
huggingface.co/datasets/Esp...
For years I've been working in LLMs for my job, and hacking on rankings and ratings for fun, beyond thrilled to be able to join this project at the intersection!
For years I've been working in LLMs for my job, and hacking on rankings and ratings for fun, beyond thrilled to be able to join this project at the intersection!
There are so few people actively working on rating system stuff and I want to talk to all of them
There are so few people actively working on rating system stuff and I want to talk to all of them
Any on esports/machine learning people going?
Any on esports/machine learning people going?
Should "A ≻ B" mean:
"A is preferred to B (higher rating)"
"B is preferred to A (lower rank number)"?
arxiv.org/pdf/2411.049...
www.tandfonline.com/doi/full/10....
Should "A ≻ B" mean:
"A is preferred to B (higher rating)"
"B is preferred to A (lower rank number)"?
arxiv.org/pdf/2411.049...
www.tandfonline.com/doi/full/10....
Is there a good equivalent to the softmax truck?
Is there a good equivalent to the softmax truck?
I'm a huge fan of esports, attending numerous events over the past 12 years and watching countless hours on twitch. I would rather see esports shrink down to a grassroots core than get children addicted to gambling.
x.com/riotgames/st...
I'm a huge fan of esports, attending numerous events over the past 12 years and watching countless hours on twitch. I would rather see esports shrink down to a grassroots core than get children addicted to gambling.
x.com/riotgames/st...
Well for the largest Qwen3, the answer is -28 points
Thinking on academic benchmarks seems to help a lot, I wonder what's going wrong in the arena?
Maybe people can sense the hedging and don't like it, or it poisons its own context with overthinking
Well for the largest Qwen3, the answer is -28 points
Thinking on academic benchmarks seems to help a lot, I wonder what's going wrong in the arena?
Maybe people can sense the hedging and don't like it, or it poisons its own context with overthinking
not a lotta people with the first name Fergus
not a lotta people with the first name Fergus
This time extending the stength dependent draw model to the online setting for use in dynamic rating systems.
Haven't read the whole thing but it looks to contain some cool approximation tricks for the posterior
arxiv.org/abs/2506.11354
This time extending the stength dependent draw model to the online setting for use in dynamic rating systems.
Haven't read the whole thing but it looks to contain some cool approximation tricks for the posterior
arxiv.org/abs/2506.11354
None of them are to copy the code
None of them are to copy the code
Specifically, a 16 billion parameter TrueSkill model which fits a skill mean and variance for each of the 8 billion people on earth.
But in my quest to scale rating systems, I guess I start with lichess, with 6B games and a measly 20M unique players
Specifically, a 16 billion parameter TrueSkill model which fits a skill mean and variance for each of the 8 billion people on earth.
But in my quest to scale rating systems, I guess I start with lichess, with 6B games and a measly 20M unique players
Paired comparison models with strength-dependent ties and order effects
arxiv.org/abs/2505.24783
Basically what the title says, extending Bradley-Terry with ties, home field advantage, and allowing those factors to depend on how strong the competitors are.
Paired comparison models with strength-dependent ties and order effects
arxiv.org/abs/2505.24783
Basically what the title says, extending Bradley-Terry with ties, home field advantage, and allowing those factors to depend on how strong the competitors are.
Google thinks TrueSkill is in Russian!
scholar.google.com/scholar?hl=e...
Google thinks TrueSkill is in Russian!
scholar.google.com/scholar?hl=e...
www.wolframalpha.com/input?i=%281...
www.wolframalpha.com/input?i=%281...
So I've passed the state of the art form the 1960's. The bad news is that it still gets mogged by Glicko.
So I've passed the state of the art form the 1960's. The bad news is that it still gets mogged by Glicko.
github.com/lm-sys/FastC...
techcrunch.com/2025/05/21/l...
github.com/lm-sys/FastC...
techcrunch.com/2025/05/21/l...