Lightnews — Scholar-powered news

Noam Brown

@polynoamial.bsky.social

1.5K followers 20 following 18 posts

Researching reasoning at OpenAI | Co-created Libratus/Pluribus superhuman poker AIs, CICERO Diplomacy AI, and OpenAI o-series / 🍓

Posts Replies Media Videos

Noam Brown

@polynoamial.bsky.social

AI researchers will literally negotiate $100 million comp packages by themselves but they won’t play poker for more than $50 buy-ins

August 30, 2025 at 11:44 AM

Noam Brown

@polynoamial.bsky.social

GPT-5 Thinking isn’t perfect, but it’s the first AI model I can trust more than many common sources of truth on the internet.

August 25, 2025 at 9:39 AM

Noam Brown

@polynoamial.bsky.social

People often ask me: will reasoning models ever move beyond easily verifiable tasks? I tell them we already have empirical proof that they can, and we released a product around it: OpenAI Deep Research.

May 13, 2025 at 5:46 PM

Noam Brown

@polynoamial.bsky.social

This METR "doubling every ∼7 mo" plot keeps popping up. It's striking, but let's be precise about what's measured: self‑contained code and ML tasks.

I think agentic AI may move faster than the METR trend, but we should report the data faithfully rather than over‑generalize to fit a belief we hold.

May 11, 2025 at 5:48 PM

Noam Brown

@polynoamial.bsky.social

I recently made this plot for a talk I gave on AI progress and it helped me appreciate how quickly AI models are improving.

I know there's still a lot of benchmarks where progress is flat, but progress on Codeforces was quite flat for a long time too.

May 3, 2025 at 7:37 PM

Noam Brown

@polynoamial.bsky.social

Today, we're releasing OpenAI o3/o4-mini. The eval numbers are SOTA (2700 Elo is among the top 200 competition coders)

But what I'm most excited about is the stuff we can't benchmark. I expect o3/o4-mini will aid scientists in their research and I'm excited to see what they do!

April 16, 2025 at 5:33 PM

Noam Brown

@polynoamial.bsky.social

I worked in quant trading for a year after undergrad, but didn't want my lifetime contribution to humanity to be making equity markets marginally more efficient. Taking a paycut to pursue AI research was my best life decision. Today, you don't even need to take a paycut to do it.

April 15, 2025 at 2:03 PM

Noam Brown

@polynoamial.bsky.social

Our latest OpenAI model in the API, GPT-4.1, achieves 55% on SWE-Bench Verified *without being a reasoning model*. It also has 1M token context. Michelle Pokrass and team did an amazing job on this! Blog post with more details: openai.com/index/gpt-4-1/

(New reasoning models coming soon too.)

April 14, 2025 at 5:40 PM

Noam Brown

@polynoamial.bsky.social

Today, OpenAI is starting to roll out a new memory feature to ChatGPT. It signals a shift from episodic interactions (call center) to evolving ones (colleague or friend).

Still a lot of research to do but it's a step toward fundamentally changing how we interact with LLMs openai.com/index/memory...

April 10, 2025 at 5:47 PM

Noam Brown

@polynoamial.bsky.social

Listening to Reid Hoffman on @economist.com argue it's fine if AI replaces all jobs because we'll live like medieval nobility with "AI peasants" doing all the work. Weird choice of analogy. Remind me, Reid, how did that turn out for the nobles?

March 21, 2025 at 4:27 PM

Noam Brown

@polynoamial.bsky.social

AI pioneer Richard Sutton just won the Turing Award. In 2019, Rich wrote a powerful essay that distills 75 years of AI into a simple "Bitter Lesson": general methods that scale with data and compute ultimately win. With the rise of AI agents it's an important lesson to keep in mind: bit.ly/4iLaTlh

March 6, 2025 at 5:13 PM

Noam Brown

@polynoamial.bsky.social

Scaling pretraining and scaling thinking are two different dimensions of improvement. They are complementary, not in competition.

February 27, 2025 at 8:33 PM

Noam Brown

@polynoamial.bsky.social

LLM evals are slow to adapt. MMLU/GSM8K continued to be reported long after they were obsolete. I think the next thing to go away will be comparing models on evals by a single number. Intelligence/$ is a much better metric. I loved this plot from o1-mini's blog for example openai.com/index/openai...

February 21, 2025 at 2:56 AM

Noam Brown

@polynoamial.bsky.social

o3-mini is the first LLM released that can consistently play tic-tac-toe well.

The summarized CoT is pretty unhinged but you can see on the right that by the end it figures it out.

February 8, 2025 at 10:31 PM

Noam Brown

@polynoamial.bsky.social

There's a lot of talk of LLMs "saturating all the evals" but there's plenty of evals people could make where LLMs would do poorly:
-Beat a Zelda game
-Make a profit in a prediction market
-Write a stand-up set that's original and funny

I'm bullish on AI, but we're far from done.

February 6, 2025 at 7:21 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news