Lightnews — Scholar-powered news

joelniklaus.bsky.social

@joelniklaus.bsky.social

Training on the Test Task Confounds Evaluation and Emergence

We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data...

arxiv.org

November 12, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

Moreover, we anticipate that the ways to effectively train on the test task will only grow in scope and adoption."

By Ricardo Dominguez-Olmedo, Florian E. Dorner, and Moritz Hardt

November 12, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

Detecting what training data a model has seen is a notoriously difficult problem –existing heuristics achieve partial success at best. Researchers routinely acknowledge the futility of fighting data contamination.

November 12, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

Instead, we propose to adjust for it by giving every model the same task-specific preparation before evaluation. We work from the assumption that training on the test task, in general, cannot be effectively detected, disallowed, or disincentivized.

November 12, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

The anecdote holds a lesson for the evaluation of large language models half a century later. Knowledge about the evaluation conditions necessarily influences training practices under competitive pressure. It may be a fool’s errand to prohibit the practice.

November 12, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

But the hotly debated results of the Games did not lead the organizers to prohibit training at natural altitude. Instead, they let everyone do it, and athletes came to consider altitude training an excellent way to train.

November 12, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

"The 1968 Olympics took place in Mexico City at the significant altitude of 2340 meters, higher than Australia’s tallest peak. Runners who had trained at altitude in their home countries were better prepared to compete in Mexico City’s conditions, as it turned out.

November 12, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

Blog Post: pleias.fr/blog/blogsy...

Dataset: huggingface.co/datasets/Pl...

Large model: huggingface.co/PleIAs/Bagu...

Small model: huggingface.co/PleIAs/Monad

PleIAs/Monad · Hugging Face

huggingface.co

November 11, 2025 at 3:59 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

- Cool to see this being done on the French supercomputer Jean Zay

November 11, 2025 at 3:59 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

- They don't release any code and the method description is quite high-level only: For example I am curious how they finetuned their models and would love to learn more about how they set up their synthetic data pipeline. Looking forward to the full report.

November 11, 2025 at 3:59 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

- They only evaluate on MMLU, GSM8K and HotPotQA. This seems cherry-picked, I wonder how their dataset performs on other standard benchmarks. They say that they basically skip pre-training and go straight to post-training.

November 11, 2025 at 3:59 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

- Seems like a cool case study pushing really small models to the limits (30 MMLU for a 56M model)

November 11, 2025 at 3:59 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

Paper: arxiv.org/abs/2510.18212

Gary Marcus' comment: garymarcus.substack.com/p/is-agi-th...

Is AGI the right goal for AI?

And also, what the heck is AGI anyway?

garymarcus.substack.com

November 10, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

- Co-author Gary Marcus notes he doesn't agree with every detail but signed on to support better articulation of what AGI means. The equal 10% weighting across domains is one choice among many reasonable configurations, though the paper argues for prioritizing breadth over depth.

November 10, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

For instance, GPT-5 reaches 70.8% on visual reasoning tasks where humans average 88.9%, yet scores 0% on adaptation tasks that test flexible rule inference.

November 10, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

- The framework reveals a "jagged" cognitive profile where models excel in knowledge-intensive domains but have critical deficits in foundational machinery.

November 10, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

Models compensate by expanding context windows, but the paper calls this a "capability contortion" that masks the absence of genuine experiential memory.

November 10, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

- Both GPT-4 and GPT-5 score exactly 0% on long-term memory storage. This isn't a bug but an architectural constraint of transformer models, where attention mechanisms scale quadratically with context length.

November 10, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

The framework tests ten core domains: general knowledge, reading and writing, math, reasoning, working memory, long-term memory storage, memory retrieval, visual processing, auditory processing, and speed. Applying this to current models reveals GPT-4 scores 27% and GPT-5 scores 58%.

My take:

November 10, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

A who's who in AI, 33 researchers from institutions including Berkeley, MIT, Stanford, and Oxford, including Yoshua Bengio, Eric Schmidt, Gary Marcus, and Max Tegmark, developed a quantifiable framework grounded in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition.

November 10, 2025 at 3:56 PM

joelniklaus.bsky.social

@joelniklaus.bsky.social

The term AGI acts as a constantly moving goalpost, with criteria shifting as AI systems master tasks once thought to require human intellect. This ambiguity obscures how far we actually are from human-level cognition.

November 10, 2025 at 3:56 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news