Sam Harsimony
banner
harsimony.bsky.social
Sam Harsimony
@harsimony.bsky.social
I write about opportunities in science, space, and policy here: https://splittinginfinity.substack.com/
AI companies can straightforwardly avoid a bubble. Their current models are profitable!

The problem is R&D spend chasing scaling laws that continue to hold and continue to have extreme diminishing returns.

Though many have realized this (except xAI).
November 14, 2025 at 5:59 PM
The two types are "low-speed, low-cost" and "high-speed, high-cost"

This tradeoff comes directly from the economics of inference.

2/6
November 13, 2025 at 4:02 PM
They are LoRA-pilled as well:
October 16, 2025 at 7:43 PM
Their figure 24 confirms what we've been talking about, more GPU's means more performance.

Also, notice that switching from H100 to GB200 with fancy interconnects (that's the NVL72 aka "NVLink") gives a huge performance boost.
October 16, 2025 at 7:43 PM
" ... as we increase the number of nodes involved (the EP number), the per node performance increases."
October 16, 2025 at 7:43 PM
The section on energy efficiency (tokens/s/MW) highlights why data center energy use isn't a big concern.

For low speeds, the GB200 can get you ~8x lower energy use.

Across recent generations, chip designers have gotten ~3x improvements in energy efficiency.
October 14, 2025 at 7:22 PM
Let's look at my preferred metric, the cost per million tokens.

Here's hyperscaler costs for serving DeepSeek-R1-0528 with 1K input tokens and 1K output.

We'll get to the GB200 in a second, but notice how everything else is quite similar in price.

H200 scales better for high interactivity tho.
October 14, 2025 at 7:22 PM
We know gains from reasoning diminish sharply with more tokens. There's probably a fixed amount of thinking that is optimal.

Say you need 5x tokens for optimal thinking, interactivity must go up 5x for users to enjoy same response time.

A one-time jump in the point of diminishing returns.
October 14, 2025 at 7:22 PM
I think this is one of the key charts. The overall datacenter cost for 1M tokens vs how many tokens per second each user enjoys. For different GPU's.
October 14, 2025 at 7:22 PM
The key tradeoff: batching more user requests into a single run (i.e. loading the weights to your GPU's) means the GPU's are more efficient but users have to wait longer.

You can be cheap and slow or fast and expensive.
October 14, 2025 at 7:22 PM
Curve for fentanyl OD's looks similar, esp. considering the age of typical users.
October 3, 2025 at 2:52 PM
Trains a model to choose among LLM responses to get better performance.

With some work sampling and picking LLM responses could give current models a performance bump.

arxiv.org/pdf/2509.06870
September 15, 2025 at 6:33 PM
But can't you convince a few people using impassioned pleas, rhetorical tricks, and lies? Not really.

You see, your enemies can do the *exact same thing*, so it nets to zero.

To win you need asymmetric weapons that point only towards truth. Reasoned debate.

slatestarcodex.com/2017/03/24/g...
August 21, 2025 at 4:54 PM
"Soft" tactics like reasoned debate and persuasion look superficially like they are losing, yet over the long run have come to dominate everything around us. Particularly for the cause of classical Liberalism.

From one of my favorite SSC posts:
August 21, 2025 at 4:54 PM
The paradigm of specialized models distilled from larger models was a predictable result. Points towards a world with "comprehensive AI services", not FOOM (for now).

CAIS poses a different set of risks best addressed by governance and defensive technology.

www.greaterwrong.com/posts/8e3676...
August 10, 2025 at 4:24 PM
OpenAI's claim of an model reaching IMO Gold comes about 0.5-1 years earlier than expected.

This market had 85% confidence it would be solved this year but that fell as time went on:

manifold.markets/Austin/will-...
July 19, 2025 at 10:38 PM
Oh neat. Want to remove some behavior or data from a model? Simply suppress or hide that output and train a fresh model on the clean outputs.

Alignment for simple models in simple environments is looking pretty good.

xcancel.com/Turn_Trout/s...
July 17, 2025 at 9:59 PM
Becker points out that outcomes didn't improve as developers worked through more problems. Suggests that lack of experience is due to Cursor not being useful to these devs previously.

Mellow heuristic leans against Shear.

bsky.app/profile/hars...
July 15, 2025 at 8:37 PM
First, I just realized all their error bars overlap with zero. The real headline should be "LLM's offer zero speedup" not a 20% slowdown.

Quentin's speedup may be a result of chance in addition to his good habits.
July 13, 2025 at 2:57 PM
This post prompted me to look at the price performance of GPU's. Apparently it's stagnated since 2018??

BUT performance on other number formats (e.g. FP4) has improved a lot.
June 30, 2025 at 5:37 PM
Oh that link was for the H100 performance number! For the meteor number you want table 1 in Supplementary Info S9:
June 25, 2025 at 9:35 PM
If true, then the reasons why the human economy approaches steady growth should also apply to AI.

The consistency is remarkable. The US has had 2% per capita GDP growth for the last two centuries.

www.nber.org/system/files...
June 23, 2025 at 5:07 PM
June 13, 2025 at 3:46 PM
Now I'm going to skip some of their technical details and theorems and jump to the experiments.

Their example decentralized system trains as fast (wall clock time) as an example centralized system.

Wonder what the utilization looks like though.
June 13, 2025 at 3:46 PM
They rearrange the formula for a transformer layer into a constant matrix (that's the PE, TE stuff) and a sum.

They claim that the row space of matrix AB is inside the row space of matrix B.

So if weights have a small rank (see prev fig), then the activations should too.
June 13, 2025 at 3:46 PM