Lightnews — Scholar-powered news

Tim Kellogg

@timkellogg.me

sent this to my brother asking, “does this count as wealth redistribution?”

(fun fact: my bro voted for Trump and is also undergoing collapse of the company he’s CEO of due to tariffs)

TKL
The Kobeissi Letter v
@KobeissiLetter • 8h
BREAKING: President Trump announces that he will be paying a "tariff dividend" of at least $2,000 per person.
•••
Stimulus checks are officially back.

November 9, 2025 at 9:36 PM

Tim Kellogg

@timkellogg.me

The town of German, NY elected 2 positions on write-in ballots alone

1. Superintendent of Highways
2. Town Justice

apparently no one ran

A cropped section of an election results report showing two contests from the town of German.

The first section, titled “Superintendent of Highways - German (Vote for 1)”, lists:
• 85 ballots total (0 over voted ballots, 0 overvotes, 66 undervotes).
• “1 precincts reported out of 1 total.”
• Results: Write-in — 19 votes, 100.00%.
• Total — 19 votes, 100.00%.
• Overvotes — 0.
• Undervotes — 66.

The second section, titled “Town Justice - German (Vote for 1)”, lists:
• 85 ballots total (0 over voted ballots, 0 overvotes, 81 undervotes).
• “1 precincts reported out of 1 total.”
• Results: Write-in — 4 votes, 100.00%.
• Total — 4 votes, 100.00%.
• Overvotes — 0.
• Undervotes — 81.

The text is printed in black on a white background with horizontal divider lines separating sections.

November 9, 2025 at 9:06 PM

Tim Kellogg

@timkellogg.me

Polaris Alpha, believed to be GPT-5.1 non-reasoning, scores just below Sonnet 4.5 on HLE (unofficial run)

There will be a reasoning version too, and OpenAI excels at RL & post training, so I have high expectations for it

also leaked: Nov 24 release date

The bar chart titled “Humanity’s Last Exam (HLE) – Text-Only Performance” compares accuracy percentages of three language models:
• Sonnet 4.5 — 7.65% (blue bar)
• Polaris Alpha — 6.0% (red bar)
• GPT 4.5 — 5.8% (teal bar)

The y-axis is labeled Accuracy (%), ranging from 0 to 10.
Sonnet 4.5 achieves the highest score, outperforming both Polaris Alpha and GPT 4.5 on this challenging benchmark.

November 9, 2025 at 2:18 PM

Tim Kellogg

@timkellogg.me

idk is a 50 year mortgage even worth it?

$300,000
30-year fixed: $1,529 principal and interest
40-year fixed: $1,418 principal and interest
50-year fixed: $1,366 principal and interest
$400,000
30-year fixed: $2,038 principal and interest
40-year fixed: $1,891 principal and interest
50-year fixed: $1,822 principal and interest
$500,000
30-year fixed: $2,548 principal and interest
40-year fixed: $2,363 principal and interest
50-year fixed: $2,277 principal and interest

November 8, 2025 at 10:45 PM

Tim Kellogg

@timkellogg.me

pro tip

Malte Ubl
@cramforce • 1d
I told ChatGPT that I'm a CTO and now it dumbs down all the answers to technical questions so l can understand them

November 8, 2025 at 10:20 PM

Tim Kellogg

@timkellogg.me

GPT-5-codex-mini

Almost same performance as GPT-5-codex on high, but 4x faster and without pesky things like warm personality

www.neowin.net/amp/openai-i...

November 8, 2025 at 4:46 PM

Tim Kellogg

@timkellogg.me

“nah, we don’t do 996”

November 8, 2025 at 12:49 PM

Tim Kellogg

@timkellogg.me

this morning, X is saturated with people from US claiming that their favorite unknown benchmark (that happens to show K2 trailing US models) is actually the best single benchmark to watch

lol notice how they clipped off the top 12

A leaderboard-style table ranking AI models by performance percentage.

Rank Model Score Organization
13th o1-preview 41.7% OpenAI
14th Claude 3.5 Sonnet 10-22 41.4% Anthropic
15th Gemini 2.5 Flash (latest) 41.2% Google
16th DeepSeek R1 05/28 40.8% DeepSeek
17th o1-2024-12-17 (high) 40.1% OpenAI
18th DeepSeek V3.1 40.0% DeepSeek
19th Kimi K2 Thinking (NEW) 39.6% Moonshot AI

The table shows incremental differences between model scores, with Kimi K2 Thinking newly added to the list at 19th place, just below DeepSeek V3.1.

November 8, 2025 at 12:10 PM

Tim Kellogg

@timkellogg.me

Alex Wise @awssnarkitect.bsky.social • 2d
I still think someone should build a wrapper that sits in front of all of your
MCP servers and lets you check off exactly which tools you want to present to the agent. Having each MCP server implement this their own way is bad.

November 8, 2025 at 11:01 AM

Tim Kellogg

@timkellogg.me

wow, i had no idea

A simple cartoon-style bar chart titled “Model Training Costs in $.”

Two colored bars compare training expenses for Kimi-K2 and GPT-5, with a legend at the top:
• Light pink = Without thinking
• Darker pink = With thinking
• Kimi-K2: stacked bar with two sections — 2.8M (without thinking) and 4.6M total (with thinking).
• GPT-5: single bar showing 69M.

The chart visually emphasizes that Kimi-K2’s total training cost (4.6M) is dramatically lower than GPT-5’s (69M).

November 7, 2025 at 8:26 PM

Tim Kellogg

@timkellogg.me

K2-Thinking is available in the Kimi app now

A screenshot of a dark-themed chat interface showing toggles for two AI features.

At the top of the pop-up menu:
• 🌐 Search — labeled “Enable to search web,” with the toggle switched on (blue).
Below it:
• 💡 Thinking — labeled “Enable for reasoning,” also switched on (blue).

In the background, the input bar shows the model identifier K2 on the left and the text prompt placeholder “Ask anything…”. A few circular icons for audio input and settings appear beside it. The design is minimalist, with a sleek, modern UI against a black background.

November 7, 2025 at 7:29 PM

Tim Kellogg

@timkellogg.me

longer form position here

www.vaticannews.va/en/pope/news...

i really like this part

"The question is not merely what Al can do,' the Pope wrote, "but who we are becoming through the technologies we build."

November 7, 2025 at 6:35 PM

Tim Kellogg

@timkellogg.me

GPT-5.1 is live on OpenRouter via stealth preview

OpenRouter & @OpenRouterAI
X.com
The new stealth model, "Polaris Alpha",
now live.
It's a powerful, general-purpose model that excels across real-world tasks, with standout performance in coding, tool calling, and instruction following.
Polaris Alpha
openrouter/polaris-alpha
Created Nov 6, 2025
$0/M input tokens
256,000 context
$0/M output tokens
This is a cloaked model provided to the community to gather feedback. A powerful, general-purpose model that excels across real-world tasks, with standout performance in coding, tool calling, and instruction following.
Note: All prompts and completions for this model are logged by the provider and may be used to improve the model.

November 7, 2025 at 4:15 PM

Tim Kellogg

@timkellogg.me

i haven’t figured out how to use it, but apparently Kimi K2-Thinking has a Heavy mode with 8 parallel trajectories that are reflectively aggregated

it does better than GPT-5-pro on HLE

A dark-themed comparison table titled “Reasoning Tasks” showing benchmark results across six large models: K2 Thinking, GPT-5, Claude Sonnet 4.5 (Thinking), K2 0905, DeepSeek-V3.2, and Grok-4.

Benchmarks and highlights:
• Intro:
• no tools: scores range 7.9–26.3 (highest: GPT-5 26.3).
• Humanity’s Last Exam (Text-only):
• w/ tools: K2 Thinking 44.9, GPT-5 41.7, Grok-4 41.0.
• heavy: K2 51.0 (best), GPT-5 42.0.
• AIME 2025:
• no tools: GPT-5 94.6, K2 94.5, Grok-4 91.7.
• w/ python: Claude 4.5 (Thinking) 100.0, K2 99.1, GPT-5 99.6.
• heavy: K2 100.0 (also GPT-5 100.0 and Grok-4 100.0).
• HMMT 2025:
• no tools: K2 89.4 (top), GPT-5 93.3 slightly higher.
• w/ python: GPT-5 96.7 (best), K2 95.1.
• heavy: GPT-5 100.0 (best), K2 97.5.
• IMO-AnswerBench:
• no tools: K2 78.6 (best), GPT-5 76.0, Claude 4.5 65.9.
• GPQA-Diamond:
• no tools: Grok-4 87.5 (best), GPT-5 85.7, K2 84.5.

Blue numbers mark top scores, yellow “heavy” labels denote advanced or tool-assisted reasoning modes. Overall, K2 Thinking and GPT-5 dominate most reasoning benchmarks, with Claude 4.5 occasionally matching at 100 on AIME 2025.

November 7, 2025 at 4:04 PM

Tim Kellogg

@timkellogg.me

K2-Thinking is SOTA, top model in agentic tool calling

A horizontal bar chart titled “τ²-Bench Telecom (Agentic Tool Use)” comparing AI model performance across vendors.

Each bar shows a model’s accuracy percentage, color-coded by provider.

From left to right:
• Kimi K2 Think — 93% (blue, highest)
• GPT-5 (high) — 87% (black)
• MiniMax-M2 — 87% (pink)
• GPT-5 (base) — 85%
• Claude 4.5 Sonnet — 78%
• Grok-1 — 75%
• Kimi K2 0905 — 73%
• Claude 4.1 Opus — 71%
• GLM-4-9B — 71%
• Abel-v1.15 / 1.85B Thinker — 68%
• gpt-oss-210D (high) — 66%
• Grok 4 (test) — 66%
• Kimi K2 — 61%
• Claude 4.5 Haiku — 55%
• Gemini 2.5 Pro — 54%
• Qwen 2.5 32B — 53%
• Amazon Bedrock Medistinct-12 — 52%
• DeepSeek R1 025B — 37%
• DeepSeek V3 24B — 34%
• Nim Llama Super-490B v1.5 — 28%
• Llama Maverick — 18% (lowest).

A purple arrow points from MiniMax-M2 (87%) to Kimi K2 Think (93%).
The top-right corner shows “Artificial Analysis” as the source.

November 7, 2025 at 10:40 AM

Tim Kellogg

@timkellogg.me

this really highlights how LLMs do math

math is a string of many operations, so one small error (e.g. a misremembered shortcut) causes cascading calculation errors downstream

In between those extremes lie tasks like math and question-answering. Perhaps surprisingly, some mathematical tasks seem to rely on memorization-heavy structure more than most of the other tasks we tested. When the model solves an arithmetic problem like "30 + 60," its learnt rule appears to recruit parts of the model that are also used for memorized sequences, so removing those components often disrupts these precise operations.
In the example below from GSM8K, the reasoning chain remains intact, but the model makes an arithmetic mistake in the final calculation. This and similar examples seem to indicate that the reduced performance on math benchmarks comes largely from arithmetic errors. Since solving word problems requires both reasoning (to understand and formalize the question) and calculation, the edited model's poor arithmetic abilities mean it does poorly on the overall math benchmarks - even though its reasoning capabilities are preserved.

November 7, 2025 at 1:02 AM

Tim Kellogg

@timkellogg.me

Surprising: Math requires a lot of memorization

Goodfire is at it again!

They developed a method similar to PCA that measures how much of an LLM’s weights are dedicated to memorization

www.goodfire.ai/research/und...

A bar chart titled “Relative benchmark performance after K-FAC edit.”

The y-axis shows K-FAC Edit Accuracy / Baseline (ranging from 0.0 to 1.0).
The x-axis lists various benchmarks from left to right, grouped by category and color-coded:
• Dark blue (Memory): Heldout, Quotes — strong drop, near zero to 0.2.
• Light blue (Math): GSM8K, MMLU-Pro Math, SimpleMath — moderate performance (~0.65–0.75).
• Pale blue (Closed-book QA): PopQA, TriviaQA, Relations — higher (~0.8–0.9).
• Light orange (Open-book QA): TriviaQA-Open, BoolQ, OBQA — near 1.0.
• Red-orange (Logic): Boar, Etruscan, Winogrande, Logical Deduction, Tracking Objs, Bool Expr. — around 1.0 or slightly above.

At the bottom, a gradient arrow labeled “Memorization (specialized patterns)” → “Reasoning (shared mechanisms)” illustrates the trend: memory-heavy tasks degrade sharply after K-FAC editing, while reasoning-based tasks retain or improve performance.

November 7, 2025 at 1:02 AM

Tim Kellogg

@timkellogg.me

notable: they ripped out the silicon that supports training

they say: “it’s the age of inference”

which, yeah, RL is mostly inference. Continual learning is almost all inference. Ambient agents, fast growing inference demands in general audiences

kartik343.wixstudio.com/blogorithm/p...

Key Architecture Innovations
Ironwood's matrix multiply units (MXUs) have been entirely reengineered for inference-only operations with some major differences from training-centric architectures:

November 7, 2025 at 12:43 AM

Tim Kellogg

@timkellogg.me

Karmay &
@karmay007
X.com
From my tests, Kimi K2 thinking is better than everything Xai, Anthropic, Google has to offer atm.
The only thing that is better than this is Gpt 5 codex (at code) and Gpt 5 pro (at high level algorithm design)
It beats the SOTA at creative writing by a mile.
Good work @crystalsssup!

November 6, 2025 at 9:49 PM

Tim Kellogg

@timkellogg.me

Kimi K2-Thinking

a new leader?

moonshotai.github.io/Kimi-K2/thin...

November 6, 2025 at 6:00 PM

Tim Kellogg

@timkellogg.me

OpenAI has been getting ready to release GPT-5.1 (this from their iOS code)

pretty sure i’ve A/B tested it, and it was a big step up, at least for the search-type queries i typically do

Hfe = V(() => k(() => import(" ./mf2r2k4z87234aed.js"),
loading: e => J(e)
zfe = V(() → k(() → import("/nerbu18r13sjd8cx. js"),
Loading: e = J (e)
Y fe = "gpt-5-thinking"
F6e = "gpt-5-1-thinking"
j6e = e = {
const t = lt (e);
return t != null && t.isPro() || t |= null && t. isEnt
}.
U6e = (e, t) → {
Le. logEventWithStatsig ("ChatGPT Composer Thinking Ef
effort: t,

November 6, 2025 at 1:32 PM

Tim Kellogg

@timkellogg.me

lol this part was cute

like, you realize we have space probes still functioning beyond pluto, right? there are answers for this stuff..

Without human maintenance, failures could be higher in space than on Earth, requiring even more launches than just upgrades - and as the whole module would have to be replaced, a small percentage of failures could doom the entire module.

November 5, 2025 at 2:11 AM

Tim Kellogg

@timkellogg.me

Windsurf Codemaps

actually this makes a ton of sense — if vibe coding only works on small/non-complex projects, then the answer is to tackle complexity directly

Codemaps uses LLMs to create an “index” over your code, a map of where things are

cognition.ai/blog/codemaps

A hand-drawn style line graph titled “Your coding ability is constrained by your codebase understanding.”
• Y-axis: “Complexity of Codebase / Tasks you can Prompt.”
• X-axis: “Time.”
• The main black curve rises gradually, plateaus, then steeply increases again — labeled “Your max coding ability.”

Three additional annotated elements:
• Red dashed line: “Limits of Manual Understanding,” showing how traditional comprehension plateaus early.
• Green arrows: “Vibe coding uplift” (small early improvement) and later “Code understanding uplift” (larger improvement).
• Blue dashed line: “Codemaps Understanding scales w/ model intelligence,” rising linearly and surpassing the red limit.

The diagram visually conveys that while manual understanding limits progress, model-assisted tools like Codemaps expand coding capability by improving codebase comprehension over time.

November 5, 2025 at 1:47 AM

Tim Kellogg

@timkellogg.me

how did you come to those numbers? these are theirs

November 5, 2025 at 12:53 AM

Tim Kellogg

@timkellogg.me

Starcloud: GPUs in space

This company finally launched their first H100 into high Earth orbit. A solar array for power, uninterrupted by weather or nighttime, and a black plate in the back to radiate heat away into -270°C space

starcloudinc.github.io/wp.pdf

A black-background graph showing solar spectral irradiance as a function of wavelength (nm).
• Y-axis: Spectral Irradiance (W/m²/nm), ranging from 0 to 2.5.
• X-axis: Wavelength (nm), spanning 250–2500.
• The top horizontal labels divide the spectrum into UV, Visible, and Infrared regions.

Two curves are plotted:
• A smooth green line labeled “Solar Irradiance Outside Atmosphere,” peaking near 500 nm (~2 W/m²/nm).
• A jagged blue line labeled “Solar Irradiance at Sea Level,” lower in magnitude and irregular due to atmospheric absorption (notably dips around 900, 1100, 1400, and 1900 nm).

The chart highlights how Earth’s atmosphere absorbs parts of the sunlight spectrum, especially in infrared bands, reducing overall irradiance at sea level.

November 5, 2025 at 12:34 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news