The problem is R&D spend chasing scaling laws that continue to hold and continue to have extreme diminishing returns.
Though many have realized this (except xAI).
The problem is R&D spend chasing scaling laws that continue to hold and continue to have extreme diminishing returns.
Though many have realized this (except xAI).
This tradeoff comes directly from the economics of inference.
2/6
This tradeoff comes directly from the economics of inference.
2/6
Also, notice that switching from H100 to GB200 with fancy interconnects (that's the NVL72 aka "NVLink") gives a huge performance boost.
Also, notice that switching from H100 to GB200 with fancy interconnects (that's the NVL72 aka "NVLink") gives a huge performance boost.
For low speeds, the GB200 can get you ~8x lower energy use.
Across recent generations, chip designers have gotten ~3x improvements in energy efficiency.
For low speeds, the GB200 can get you ~8x lower energy use.
Across recent generations, chip designers have gotten ~3x improvements in energy efficiency.
Here's hyperscaler costs for serving DeepSeek-R1-0528 with 1K input tokens and 1K output.
We'll get to the GB200 in a second, but notice how everything else is quite similar in price.
H200 scales better for high interactivity tho.
Here's hyperscaler costs for serving DeepSeek-R1-0528 with 1K input tokens and 1K output.
We'll get to the GB200 in a second, but notice how everything else is quite similar in price.
H200 scales better for high interactivity tho.
Say you need 5x tokens for optimal thinking, interactivity must go up 5x for users to enjoy same response time.
A one-time jump in the point of diminishing returns.
Say you need 5x tokens for optimal thinking, interactivity must go up 5x for users to enjoy same response time.
A one-time jump in the point of diminishing returns.
You can be cheap and slow or fast and expensive.
You can be cheap and slow or fast and expensive.
With some work sampling and picking LLM responses could give current models a performance bump.
arxiv.org/pdf/2509.06870
With some work sampling and picking LLM responses could give current models a performance bump.
arxiv.org/pdf/2509.06870
You see, your enemies can do the *exact same thing*, so it nets to zero.
To win you need asymmetric weapons that point only towards truth. Reasoned debate.
slatestarcodex.com/2017/03/24/g...
You see, your enemies can do the *exact same thing*, so it nets to zero.
To win you need asymmetric weapons that point only towards truth. Reasoned debate.
slatestarcodex.com/2017/03/24/g...
From one of my favorite SSC posts:
From one of my favorite SSC posts:
CAIS poses a different set of risks best addressed by governance and defensive technology.
www.greaterwrong.com/posts/8e3676...
CAIS poses a different set of risks best addressed by governance and defensive technology.
www.greaterwrong.com/posts/8e3676...
This market had 85% confidence it would be solved this year but that fell as time went on:
manifold.markets/Austin/will-...
This market had 85% confidence it would be solved this year but that fell as time went on:
manifold.markets/Austin/will-...
Alignment for simple models in simple environments is looking pretty good.
xcancel.com/Turn_Trout/s...
Alignment for simple models in simple environments is looking pretty good.
xcancel.com/Turn_Trout/s...
Mellow heuristic leans against Shear.
bsky.app/profile/hars...
Mellow heuristic leans against Shear.
bsky.app/profile/hars...
Quentin's speedup may be a result of chance in addition to his good habits.
Quentin's speedup may be a result of chance in addition to his good habits.
BUT performance on other number formats (e.g. FP4) has improved a lot.
BUT performance on other number formats (e.g. FP4) has improved a lot.
The consistency is remarkable. The US has had 2% per capita GDP growth for the last two centuries.
www.nber.org/system/files...
The consistency is remarkable. The US has had 2% per capita GDP growth for the last two centuries.
www.nber.org/system/files...
Their example decentralized system trains as fast (wall clock time) as an example centralized system.
Wonder what the utilization looks like though.
Their example decentralized system trains as fast (wall clock time) as an example centralized system.
Wonder what the utilization looks like though.
They claim that the row space of matrix AB is inside the row space of matrix B.
So if weights have a small rank (see prev fig), then the activations should too.
They claim that the row space of matrix AB is inside the row space of matrix B.
So if weights have a small rank (see prev fig), then the activations should too.