Lightnews — Scholar-powered news

Andreas Hochlehnert

@ahochlehnert.bsky.social

Cambrian-S is a valuable first step in defining what “supersensing” might mean for video models. Our results simply highlight how subtle benchmark design choices can be exploited — and how we can improve them together.

📄 arxiv.org/abs/2511.16655
🔗 github.com/bethgelab/s...

GitHub - bethgelab/supersanity: A critical analysis of the Cambrian-S model and VSI-Super benchmarks

A critical analysis of the Cambrian-S model and VSI-Super benchmarks - bethgelab/supersanity

github.com

November 24, 2025 at 5:19 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

This indicates that the tailored Cambrian-S inference strategy may rely on benchmark-specific shortcuts (e.g. rooms are never revisited), rather than building a persistent, spatial world model over time.

November 24, 2025 at 5:19 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

For VSI-Super-Counting (VSC), we run a sanity check:

🔁 VSC-Repeat: we concatenate each video with itself 1-5×
✅ Unique object count stays the same
❌ Cambrian-S accuracy drops from 42% → 0%

A genuine supersensing system should be robust here.

November 24, 2025 at 5:19 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

We introduce a simple baseline called NoSense, an image-only (SigLIP) model that discards almost all temporal structure.

Surprisingly, it reaches 95% accuracy on VSI-Super-Recall (VSR), even on 4-hour videos.

This suggests VSR can be solved without true spatial supersensing.

November 24, 2025 at 5:19 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

🖐️

September 5, 2025 at 6:20 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

7/ Takeaway?

Many supposed gains don’t hold up under scrutiny.
Progress is possible—but let’s build on reproducible foundations.

🧠 Full paper: arxiv.org/abs/2504.07086

🧑‍🔬 By: @hrdkbhatnagar.bsky.social @vishaalurao.bsky.social @samuelalbanie.bsky.social @bayesiankitten.bsky.social @MatthiasBethge

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with...

arxiv.org

April 10, 2025 at 3:42 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

6/ Our recommendations: – Evaluate with ≥10 seeds

– Tune decoding per model
– Use appropriate prompts/templates
– Standardize hardware/software (we use Docker)
– Open-source everything

📦 Code, prompts, outputs: github.com/bethgelab/so...

GitHub - bethgelab/sober-reasoning

Contribute to bethgelab/sober-reasoning development by creating an account on GitHub.

github.com

April 10, 2025 at 3:38 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

5/ What actually works?
🔹 RL methods over distillations? Often negligible gains, prone to overfitting.

🔹 Supervised finetuning (SFT) on reasoning traces? Stable & generalizable.

April 10, 2025 at 3:38 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

4/ Variance is everywhere:

– Random seed: swings Pass@1 by 5–15pp
– Temperature/top-p: another ±10pp
– Software & Hardware? Yes, even that changes scores

🎯 Single-seed results on small datasets are essentially noise.

April 10, 2025 at 3:37 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

3/ We re-evaluated recent 1.5B and 7B reasoning models on 6 benchmarks under controlled settings.

➡️ Performance dropped by up to 17%
➡️ Improvements fall within variance range of the base model
➡️ Some models don’t beat the baseline!

April 10, 2025 at 3:37 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

2/ Reasoning is the next frontier for LMs—but current evaluation practices often lack rigor.

We find that many celebrated gains from RL methods vanish once you:

✅ average over multiple seeds
✅ control decoding
✅ standardize prompt & infra

April 10, 2025 at 3:36 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

We are just getting started! We're building better filters, aggregating released benchmarks — datacomp style — and develop fast, accurate OpenThinking models. Stay tuned! w/
@hrdkbhatnagar.bsky.social, @vishaalurao.bsky.social, @bayesiankitten.bsky.social, Matthias Bethge [6/6]

February 17, 2025 at 6:27 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

These issues encourage shortcuts and flawed reasoning. If GRPO rewards bad logic, models reinforce errors instead of improving. Garbage In, Garbage Out 🚨 [5/6]

February 17, 2025 at 6:26 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

🔸 Some questions reference figures that aren't included! Text-only models can't infer missing visuals. [4/6]

February 17, 2025 at 6:25 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

🔸 Mathematical proofs are a challenge. There's no automated way to verify them, and answers often only show an initial equation, leading to unreliable training signals. [3/6]

February 17, 2025 at 6:25 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

Blog (For Updates): huggingface.co/datasets/bet...

🔸 Some questions contain subquestions, but only one answer is labeled. The model may get penalized for "wrong" but valid reasoning. [2/6]

Example of multiple questions asked in the analyzed datasets

February 17, 2025 at 6:24 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

This is joint work with @oripress.bsky.social, @vishaalurao.bsky.social, @bayesiankitten.bsky.social, @ofirpress.bsky.social and Matthias Bethge

December 13, 2024 at 4:20 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news