joelniklaus.bsky.social
@joelniklaus.bsky.social
Moreover, we anticipate that the ways to effectively train on the test task will only grow in scope and adoption."

By Ricardo Dominguez-Olmedo, Florian E. Dorner, and Moritz Hardt
November 12, 2025 at 3:56 PM
Detecting what training data a model has seen is a notoriously difficult problem –existing heuristics achieve partial success at best. Researchers routinely acknowledge the futility of fighting data contamination.
November 12, 2025 at 3:56 PM
Instead, we propose to adjust for it by giving every model the same task-specific preparation before evaluation. We work from the assumption that training on the test task, in general, cannot be effectively detected, disallowed, or disincentivized.
November 12, 2025 at 3:56 PM
The anecdote holds a lesson for the evaluation of large language models half a century later. Knowledge about the evaluation conditions necessarily influences training practices under competitive pressure. It may be a fool’s errand to prohibit the practice.
November 12, 2025 at 3:56 PM
But the hotly debated results of the Games did not lead the organizers to prohibit training at natural altitude. Instead, they let everyone do it, and athletes came to consider altitude training an excellent way to train.
November 12, 2025 at 3:56 PM
"The 1968 Olympics took place in Mexico City at the significant altitude of 2340 meters, higher than Australia’s tallest peak. Runners who had trained at altitude in their home countries were better prepared to compete in Mexico City’s conditions, as it turned out.
November 12, 2025 at 3:56 PM
- Cool to see this being done on the French supercomputer Jean Zay
November 11, 2025 at 3:59 PM
- They don't release any code and the method description is quite high-level only: For example I am curious how they finetuned their models and would love to learn more about how they set up their synthetic data pipeline. Looking forward to the full report.
November 11, 2025 at 3:59 PM
- They only evaluate on MMLU, GSM8K and HotPotQA. This seems cherry-picked, I wonder how their dataset performs on other standard benchmarks. They say that they basically skip pre-training and go straight to post-training.
November 11, 2025 at 3:59 PM
- Seems like a cool case study pushing really small models to the limits (30 MMLU for a 56M model)
November 11, 2025 at 3:59 PM
- Co-author Gary Marcus notes he doesn't agree with every detail but signed on to support better articulation of what AGI means. The equal 10% weighting across domains is one choice among many reasonable configurations, though the paper argues for prioritizing breadth over depth.
November 10, 2025 at 3:56 PM
For instance, GPT-5 reaches 70.8% on visual reasoning tasks where humans average 88.9%, yet scores 0% on adaptation tasks that test flexible rule inference.
November 10, 2025 at 3:56 PM
- The framework reveals a "jagged" cognitive profile where models excel in knowledge-intensive domains but have critical deficits in foundational machinery.
November 10, 2025 at 3:56 PM
Models compensate by expanding context windows, but the paper calls this a "capability contortion" that masks the absence of genuine experiential memory.
November 10, 2025 at 3:56 PM
- Both GPT-4 and GPT-5 score exactly 0% on long-term memory storage. This isn't a bug but an architectural constraint of transformer models, where attention mechanisms scale quadratically with context length.
November 10, 2025 at 3:56 PM
The framework tests ten core domains: general knowledge, reading and writing, math, reasoning, working memory, long-term memory storage, memory retrieval, visual processing, auditory processing, and speed. Applying this to current models reveals GPT-4 scores 27% and GPT-5 scores 58%.

My take:
November 10, 2025 at 3:56 PM
A who's who in AI, 33 researchers from institutions including Berkeley, MIT, Stanford, and Oxford, including Yoshua Bengio, Eric Schmidt, Gary Marcus, and Max Tegmark, developed a quantifiable framework grounded in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition.
November 10, 2025 at 3:56 PM
The term AGI acts as a constantly moving goalpost, with criteria shifting as AI systems master tasks once thought to require human intellect. This ambiguity obscures how far we actually are from human-level cognition.
November 10, 2025 at 3:56 PM