Lightnews — Scholar-powered news

Nick Tomlin

@nickatomlin.bsky.social

1.7K followers 110 following 12 posts

Incoming assistant professor at TTIC, current faculty fellow at NYU, and previous PhD student at Berkeley. Natural language processing. He/him.

🌐 nickatomlin.github.io

Posts Replies Media Videos

Nick Tomlin

@nickatomlin.bsky.social

CRA changed their interface and it's much harder to browse now for some reason...

Last year, I ended up just making a list of schools/departments that I wanted to apply to and individually searching through each of their websites for job postings

October 12, 2025 at 11:16 PM

Nick Tomlin

@nickatomlin.bsky.social

Haha main reason for using Gym was that we wanted a way to automatically evaluate models against trained RL agents. Doing the full arena-style evaluation on reasoning models gets really expensive

It also helps that current LLMs are really good at generating functional Gym code

May 14, 2025 at 4:36 PM

Nick Tomlin

@nickatomlin.bsky.social

I think in the short term that’s reasonable, e.g., current models can play chess but they definitely can’t understand chess variants

In the long term, I suspect there’s more risk of over-optimizing to those specific games, so the hope is that our approach is a bit more future-proof

May 14, 2025 at 4:29 PM

Nick Tomlin

@nickatomlin.bsky.social

For anyone interested in evaluating or expanding on this benchmark, we have a nice code release here: github.com/vivek3141/gg...

GitHub - vivek3141/gg-bench: Measuring General Intelligence With Generated Games (Preprint)

Measuring General Intelligence With Generated Games (Preprint) - vivek3141/gg-bench

github.com

May 13, 2025 at 9:30 PM

Nick Tomlin

@nickatomlin.bsky.social

This is a difficult benchmark: the best non-reasoning LLMs score around 9%, while the best reasoning models score around 36%. In the future, as models get stronger, we anticipate that they'll also be able to generate harder games

Results table. The best model (o1) wins about 36% of games against the RL baselines.

May 13, 2025 at 9:30 PM

Nick Tomlin

@nickatomlin.bsky.social

We use o1 to generate natural language rulebooks for 1000 two-player games and then implement these games as Gym environments. For each game, we train baseline agents in self-play with RL and then evaluate whether LLMs can beat the RL baselines

Main paper figure showing a three-step pipeline of game description generation, implementation generation, and self-play training of RL agents

May 13, 2025 at 9:30 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news