Lightnews — Scholar-powered news

David Schneider-Joseph

@thedavidsj.bsky.social

570 followers 3.9K following 84 posts

Working on AI gov. Past: Technology and Security Policy @ RAND re: AI compute, telemetry database lead & 1st stage landing software @ SpaceX; AWS; Google.

Posts Replies Media Videos

David Schneider-Joseph

@thedavidsj.bsky.social

Steven Pinker on my essay: x.com/sapinker/sta...

August 25, 2025 at 10:24 PM

David Schneider-Joseph

@thedavidsj.bsky.social

Claude 4 Opus seems very excitable.

May 23, 2025 at 1:59 AM

David Schneider-Joseph

@thedavidsj.bsky.social

Why are modern book covers so bad?

May 11, 2025 at 6:04 AM

David Schneider-Joseph

@thedavidsj.bsky.social

Asking Claude important questions about papal succession.

April 23, 2025 at 6:19 AM

David Schneider-Joseph

@thedavidsj.bsky.social

Appears Trump admin is adversarially interpreting SCOTUS order to “facilitate” return of Garcia (who, by their admission, they sent without cause to El Salvadorian prison), to mean they must merely “remove any domestic obstacles”, rather than actually work to secure his return.

April 14, 2025 at 1:00 AM

David Schneider-Joseph

@thedavidsj.bsky.social

I haven’t checked these numbers myself, but it appears that the “Tariffs Charged to the US” column in the White House’s new tariff legend is actually just the ratio of the US trade deficit with that country divided by US imports from that country, with a floor at 10%. Pretty incredible really.

April 2, 2025 at 11:52 PM

David Schneider-Joseph

@thedavidsj.bsky.social

Probably the most blatant autobiographical confabulation I’ve seen from Claude.

March 17, 2025 at 12:24 AM

David Schneider-Joseph

@thedavidsj.bsky.social

Did you all know that Hawaii was this long

March 5, 2025 at 3:58 AM

David Schneider-Joseph

@thedavidsj.bsky.social

Mostly false, as this applied to wage income but not capital gains. But the affected individuals mostly made their income from capital gains.

January 31, 2025 at 5:34 AM

David Schneider-Joseph

@thedavidsj.bsky.social

The key/value decompression matrices are transposed and absorbed into the query and output projection matrices to avoid the cost of decompression, giving an effective head width of 512 for values and 576 for keys (with 64 additional for decoupled positional encoding). 4/6

January 30, 2025 at 9:23 PM

David Schneider-Joseph

@thedavidsj.bsky.social

R1’s Multi-head Latent Attention (MLA) achieves substantial KV compression (~1/4× compared to GQA) and requires much more arithmetic (about 4× per head). At large batch sizes and context lengths, this makes it more FLOP-bound than IO-bound, which is unusual for inference. 3/6

January 30, 2025 at 9:23 PM

David Schneider-Joseph

@thedavidsj.bsky.social

Assumptions: Arithmetic in FP8, response lengths of 10k output tokens (Fig 3 from R1 report), serving on H100s or H800s at $2/GPU hour. 2/6

January 30, 2025 at 9:22 PM

David Schneider-Joseph

@thedavidsj.bsky.social

Multi-head Latent Attention is essentially the same as Multi-Query Attention (single KV head), but with a larger head width, same vector for key and value, and low-rank factorization of query/output projection matrices (the K, V "decompression matrices" acting as one factor).

January 29, 2025 at 9:59 PM

David Schneider-Joseph

@thedavidsj.bsky.social

I tried this on GPT-4o, o1, o1-mini, o1-pro, Claude 3.5 Sonnet, Claude 3.0 Opus, Gemini 2.0 Experimental Advanced, Gemini 2.0 Flash Thinking Mode, DeepSeek-V3, and DeepSeek-V3 w/DeepThink.

Every "reasoning" model got it right. Every other model got it wrong. Seems notable.

January 17, 2025 at 4:24 AM

David Schneider-Joseph

@thedavidsj.bsky.social

And sorting.

January 16, 2025 at 9:12 PM

David Schneider-Joseph

@thedavidsj.bsky.social

Some pretty impressive folding abilities on display here with π0 trained with FAST.

This single model can control lots of different robots and responds to language instruction inputs.