Nick Tomlin
nickatomlin.bsky.social
Nick Tomlin
@nickatomlin.bsky.social
Incoming assistant professor at TTIC, current faculty fellow at NYU, and previous PhD student at Berkeley. Natural language processing. He/him.

🌐 nickatomlin.github.io
CRA changed their interface and it's much harder to browse now for some reason...

Last year, I ended up just making a list of schools/departments that I wanted to apply to and individually searching through each of their websites for job postings
October 12, 2025 at 11:16 PM
Haha main reason for using Gym was that we wanted a way to automatically evaluate models against trained RL agents. Doing the full arena-style evaluation on reasoning models gets really expensive

It also helps that current LLMs are really good at generating functional Gym code
May 14, 2025 at 4:36 PM
I think in the short term that’s reasonable, e.g., current models can play chess but they definitely can’t understand chess variants

In the long term, I suspect there’s more risk of over-optimizing to those specific games, so the hope is that our approach is a bit more future-proof
May 14, 2025 at 4:29 PM
For anyone interested in evaluating or expanding on this benchmark, we have a nice code release here: github.com/vivek3141/gg...
GitHub - vivek3141/gg-bench: Measuring General Intelligence With Generated Games (Preprint)
Measuring General Intelligence With Generated Games (Preprint) - vivek3141/gg-bench
github.com
May 13, 2025 at 9:30 PM
This is a difficult benchmark: the best non-reasoning LLMs score around 9%, while the best reasoning models score around 36%. In the future, as models get stronger, we anticipate that they'll also be able to generate harder games
May 13, 2025 at 9:30 PM
We use o1 to generate natural language rulebooks for 1000 two-player games and then implement these games as Gym environments. For each game, we train baseline agents in self-play with RL and then evaluate whether LLMs can beat the RL baselines
May 13, 2025 at 9:30 PM