Lightnews — Scholar-powered news

A.V.

@slckl.bsky.social

I love widely used high quality datasets.

vik / λh.(h h) @vikhyat.net · 1d

Examples of toxic prompts we removed

November 18, 2025 at 8:14 AM

A.V.

@slckl.bsky.social

Just finished Clevatess, first season. Super solid dark fantasy, with an old school vibe, but enough fresh twists to keep you hooked. Strong personal contender for aoty 2025.

November 11, 2025 at 7:49 PM

A.V.

@slckl.bsky.social

Comparatively tiny models, trained on purpose made data, can into reasoning. Beautiful work!

Alexander Doria @dorialexander.bsky.social · 8d

Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range. pleias.fr/blog/blogsyn...

November 10, 2025 at 8:07 PM

Reposted by A.V.

Tim Kellogg

@timkellogg.me

Surprising: Math requires a lot of memorization

Goodfire is at it again!

They developed a method similar to PCA that measures how much of an LLM’s weights are dedicated to memorization

www.goodfire.ai/research/und...

A bar chart titled “Relative benchmark performance after K-FAC edit.”

The y-axis shows K-FAC Edit Accuracy / Baseline (ranging from 0.0 to 1.0).
The x-axis lists various benchmarks from left to right, grouped by category and color-coded:
• Dark blue (Memory): Heldout, Quotes — strong drop, near zero to 0.2.
• Light blue (Math): GSM8K, MMLU-Pro Math, SimpleMath — moderate performance (~0.65–0.75).
• Pale blue (Closed-book QA): PopQA, TriviaQA, Relations — higher (~0.8–0.9).
• Light orange (Open-book QA): TriviaQA-Open, BoolQ, OBQA — near 1.0.
• Red-orange (Logic): Boar, Etruscan, Winogrande, Logical Deduction, Tracking Objs, Bool Expr. — around 1.0 or slightly above.

At the bottom, a gradient arrow labeled “Memorization (specialized patterns)” → “Reasoning (shared mechanisms)” illustrates the trend: memory-heavy tasks degrade sharply after K-FAC editing, while reasoning-based tasks retain or improve performance.

November 7, 2025 at 1:02 AM

Reposted by A.V.

Key 🗝 🦊✅

@keytryer.net

This is the first company that is unironically making actual gynoids and it's not even for porn reasons. They just are.

November 5, 2025 at 4:16 PM

Reposted by A.V.

Tim Kellogg

@timkellogg.me

Cursor made an LLM

it’s called Composer, it’s an extremely fast model that was previously available under code name Cheetah

it’s an MoE trained in fp8, RL’d on Cursor Agent traces

cursor.com/blog/composer

Composer: Building a fast frontier model with RL · Cursor

Built to make you extraordinarily productive, Cursor is the best way to code with AI.

cursor.com

October 29, 2025 at 6:39 PM

Reposted by A.V.

Eugene Vinitsky 🍒

@eugenevinitsky.bsky.social

Hundreds of hours of European driving data from NVIDIA! 1700 hours total

Bernhard Jaeger @bernhard-jaeger.bsky.social · 21d

Big day for autonomous driving research.
Nvidia just dropped 1700 hours of public driving data on HuggingFace from over 2500 cities:

huggingface.co/datasets/nvi...

huggingface.co

October 28, 2025 at 8:20 PM

A.V.

@slckl.bsky.social

New company announced, with the intent of making GPU programming better with Rust: www.vectorware.com/blog/announc...

The founding team has impressive Rust credentials. They're targeting a wide range of usecases, not just ML.

Announcing VectorWare

We are building the first GPU-native software company. Today we are sharing the thesis, people, and partners behind it.

www.vectorware.com

October 24, 2025 at 4:54 PM

Reposted by A.V.

amos

@fasterthanli.me

codex is definitely "senior engineer" material because it takes forever to think about it before it tells you to fuck off

October 22, 2025 at 6:16 PM

A.V.

@slckl.bsky.social

Sometimes, when writing throwaway UUID v4s, I feel bad about exhausting the global uuid supply. It can't run out, can it...

October 15, 2025 at 5:03 PM

Reposted by A.V.

thebes

@vgel.me

is fiction a superstimulus?

October 3, 2025 at 3:04 AM

A.V.

@slckl.bsky.social

Incredible that you can... just have Sonnet 4.5 at home.

Blog post here: z.ai/blog/glm-4.6

September 30, 2025 at 10:40 AM

Reposted by A.V.

tachikoma

@tachikoma.elsewhereunbound.com

you are here meme image about different LLMs and AI labs releasing the "world's most powerful model"

September 29, 2025 at 5:31 PM

Reposted by A.V.

Tim Kellogg

@timkellogg.me

Sonnet 4.5

Better than Opus 4.1 on almost every benchmark

Still the classic Sonnet prices, $3/$15

This bar chart shows Software engineering performance on SWE-bench Verified (n=500), comparing several models’ accuracy (%).

Results:
• Sonnet 4.5: 77.2% (base), 82.0%* with parallel test-time compute
• Opus 4.1: 74.5% (base), 79.4%* with parallel compute
• Sonnet 4: 72.7% (base), 80.2%* with parallel compute
• GPT-5 Codex: 74.5%
• GPT-5: 72.8%
• Gemini 2.5 Pro: 67.2%

(*asterisk indicates results with parallel test-time compute scaling).

Key takeaways:
• Sonnet 4.5 achieves the highest overall score (82.0% with scaling).
• Without scaling, it still leads at 77.2%.
• Opus 4.1 and Sonnet 4 gain significant boosts from scaling, moving them close to Sonnet 4.5.
• GPT-5 Codex and GPT-5 are competitive (~73–75%), but below Sonnet/Opus.
• Gemini 2.5 Pro lags furthest behind at 67.2%.

This table compares Claude Sonnet 4.5, Claude Opus 4.1, Claude Sonnet 4, GPT-5, and Gemini 2.5 Pro across a wide range of benchmarks.

⸻

Agentic coding (SWE-bench Verified)
• Claude Sonnet 4.5: 77.2% (82.0% with parallel compute)
• Claude Opus 4.1: 74.5% (79.4% with parallel compute)
• Claude Sonnet 4: 72.7% (80.2% with parallel compute)
• GPT-5: 72.8% (74.5% Codex)
• Gemini 2.5 Pro: 67.2%

⸻

Agentic terminal coding (Terminal-Bench)
• Claude Sonnet 4.5: 50.0%
• Claude Opus 4.1: 46.5%
• Claude Sonnet 4: 36.4%
• GPT-5: 43.8%
• Gemini 2.5 Pro: 25.3%

⸻

Agentic tool use (τ²-bench)

Retail: Sonnet 4.5 (86.2%), Opus 4.1 (86.8%), Sonnet 4 (83.8%), GPT-5 (81.1%)
Airline: Sonnet 4.5 (70.0%), Opus 4.1 (63.0%), Sonnet 4 (63.0%), GPT-5 (62.6%)
Telecom: Sonnet 4.5 (98.0%), Opus 4.1 (71.5%), Sonnet 4 (49.6%), GPT-5 (96.7%)

⸻

Computer use (OSWorld)
• Sonnet 4.5: 61.4%
• Opus 4.1: 44.4%
• Sonnet 4: 42.2%
• GPT-5: —
• Gemini 2.5 Pro: —

⸻

High school math (AIME 2025)

Python: Sonnet 4.5 (100%), GPT-5 (99.6%), Gemini 2.5 Pro (88.0%)
No tools: Sonnet 4.5 (87.0%), Opus 4.1 (78.0%), Sonnet 4 (70.5%), GPT-5 (94.6%)

⸻

Graduate-level reasoning (GPQA Diamond)
• Sonnet 4.5: 83.4%
• Opus 4.1: 81.0%
• Sonnet 4: 76.1%
• GPT-5: 85.7%
• Gemini 2.5 Pro: 86.4%

⸻

Multilingual Q&A (MMLU)
• Sonnet 4.5: 89.1%
• Opus 4.1: 89.5%
• Sonnet 4: 86.5%
• GPT-5: 89.4%
• Gemini 2.5 Pro: —

⸻

Visual reasoning (MMMU validation)
• Sonnet 4.5: 77.8%
• Opus 4.1: 77.1%
• Sonnet 4: 74.4%
• GPT-5: 84.2%
• Gemini 2.5 Pro: 82.0%

⸻

Financial analysis (Finance Agent)
• Sonnet 4.5: 55.3%
• Opus 4.1: 50.9%
• Sonnet 4: 44.5%
• GPT-5: 46.9%
• Gemini 2.5 Pro: 29.4%

⸻

Key insights
• Claude Sonnet 4.5 dominates in coding (SWE-bench, Terminal, τ²-bench), computer use, and finance.
• GPT-5 is very strong in math (no tools), visual reasoning, and GPQA Diamond.
• Gemini 2.5 Pro underperforms overall, but is competitive in graduate-level reasoning and

September 29, 2025 at 6:05 PM

Reposted by A.V.

Sung Kim

@sungkim.bsky.social

Alibaba released Qwen3-VL

The flagship model Qwen3-VL-235B-A22B is released as open-weight and available in both Instruct and Thinking versions

✅ Instruct outperforms Gemini 2.5 Pro on key vision benchmarks
✅ Thinking achieves state-of-the-art (SOTA) performance on multimodal reasoning tasks

September 23, 2025 at 11:55 PM

A.V.

@slckl.bsky.social

Very cool. More stuff on the bad site: x.com/Alibaba_Qwen...

Sung Kim @sungkim.bsky.social · Sep 22

Alibaba's Qwen3-Omni — the end-to-end omni-modal AI unifying text, image, audio & video in one model

🏆 SOTA on 22/36 audio & AV benchmarks
🌍 119L text / 19L speech in / 10L speech out
⚡ 211ms latency | 🎧 30-min audio understanding
🎨 Fully customizable via system prompts

September 22, 2025 at 8:11 PM

A.V.

@slckl.bsky.social

Spectral Labs SGS-1: Generate CAD geometry from description, cool. The output is a format compatible with CAD tools and so this seems quite practical.

www.spectrallabs.ai/research/SGS-1

Introducing SGS-1

Spectral Labs releases SGS-1: the first generative model for structured CAD.

www.spectrallabs.ai

September 21, 2025 at 8:37 AM

A.V.

@slckl.bsky.social

Feels like a JoJo's episode, amazing.

WildHoHoHorillaMan @wildgorillaman.bsky.social · Sep 19

The writers this season aren’t even trying anymore

Twitter cap

Text:

richard
@richard normal
my name is rokos basilisk and i'm making artificial intelligence that you
put on your body
T TBPN
@tbpn • 19h
The Ray-Ban Meta story started with a cold email to Mark Zuckerberg.
Rocco Basilico (Chief Wearables Officer,
EssilorLuxottica) broke it down:
TBPN
09/17/25
07:16 PM POT
ramp 1
DR
00 Mata
AMAZING RECOGNIZED DESIGN OF
ROCCO BASILICO LIVE ON TBPN
TBPN
Rocco Basilico | Chief Wearables Officer, Essilor Luxottica | @roccobasilico
12 Meta se
0:32
esses Lara 5 0 20252 /=
Wa Testa launch robotaris in Cali an 20257 /a
nd of 204
8期
₫. Wander
• Bezel
• Linear
Ffigma
Vanta
Le attio
XFin
12:23 PM • 18 Sep 25 • 79K Views

September 19, 2025 at 6:04 AM

A.V.

@slckl.bsky.social

Moondream3 preview dropped, 9B MoE, A2B, visual reasoning, it's beautiful. Claims it's better than big boys opus 4.1, gemini 2.5. pro etc.
@vikhyat.net cooked

moondream.ai/blog/moondre...
x.com/vikhyatk/sta...

image shows anecdata + eval showing moondream3 being better at open vocabulary object detection in comparison to gemini 2.5 flash, gpt5 and claude 4 sonnet.

September 19, 2025 at 5:57 AM

A.V.

@slckl.bsky.social

Sometimes you bang against the walls of your own flesh.
www.youtube.com/watch?v=tzvM...

I Not Me

YouTube video by Osheyack - Topic

www.youtube.com

September 16, 2025 at 7:32 PM

A.V.

@slckl.bsky.social

In addition to some ambient/techno classics pounding the brain, I also find folk music extremely stimulating for doing work. But only on the condition I can't understand the lyrics.
What's this about? No clue

open.spotify.com/track/5B4lRo...

Tuuli

Hedningarna · TRÄ · Song · 1994

open.spotify.com

August 16, 2025 at 9:17 AM

A.V.

@slckl.bsky.social

Us europoors eating good for once.

Supported Languages:
Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, Ukrainian

Sung Kim @sungkim.bsky.social · Aug 16

Nvidia open-sources (model, data, and code) both speech recognition models and datasets:

- parakeet-tdt-0.6b-v3: blazing fast and accurate ASR inference with PnC and timestamps

huggingface.co/nvidia/parak...

nvidia/parakeet-tdt-0.6b-v3 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

August 16, 2025 at 8:36 AM

Reposted by A.V.

kyunghyuncho.bsky.social

@kyunghyuncho.bsky.social

recently gave a talk on <Reality Checks> at two venues, and discussed (and rambled) about how leaderboard chasing is awesome (and we want it to continue) but that this isn't easy because everyone (me! me! me!) wants to write more papers.

the link to the slide deck in the reply.

August 12, 2025 at 2:04 AM

A.V.

@slckl.bsky.social

Out of all the model drops today, Genie 3 is the most mind boggling. Good job!

Shame all the juicy details are locked down tight...

Paige Bailey @dynamicwebpaige.bsky.social · Aug 5

Get ready to enter the simulation...

Genie 3 is a new frontier for world models: its environments remain largely consistent for several minutes, with visual memory extending as far back as 1min. These limitations will only decrease with time.

Welcome to the future.🙌
deepmind.google/discover/blo...

August 5, 2025 at 6:20 PM

Reposted by A.V.

Tim Kellogg

@timkellogg.me

gpt-oss, OpenAI's open weights model

120B & 20B variants, both MoE with 4 experts active

openai.com/index/introd...

Bar chart showing model accuracy on expert-level questions from "Humanity’s Last Exam."

* Y-axis: Accuracy (%), ranging from approximately 10% to 25%.
* X-axis: Model names with or without tool use.

From left to right:

1. **gpt-oss-120b (with tools)**: 19%
2. **gpt-oss-120b (without tools)**: 14.9%
3. **gpt-oss-20b (with tools)**: 17.3%
4. **gpt-oss-20b (without tools)**: 10.9%
5. **o3 (with tools)**: 24.9% — highest-performing model
6. **o4-mini (with tools)**: 17.7%
7. **o3-mini (without tools)**: 13.4%

Models generally perform better with tools enabled. o3 (with tools) leads all models in accuracy.

August 5, 2025 at 5:45 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news