Mathieu Acher
macher.bsky.social
Mathieu Acher
@macher.bsky.social
Chess-loving professor and researcher who champion the integration of software engineering and AI for reproducible science.
Diving deep into software variability spaces, from Airbus to Linux.
@rennesuniv.bsky.social #INSA #IUF @InstUnivFr @Inria #IRISA

Blog post: blog.mathieuacher.com/GPTReasoning...
Code: github.com/acherm/gptch...
with deeper insights, such as:
* o3 can sometimes synthesize code to play chess, but fails
* o3-high seems a special beast, but it is an unreliable model (illegal move may occur after 10 moves) and 15$ for a game!
General-Purpose AI in the Endgame: The Chess Limitations of o3/o4-mini
o3 and o4-mini are large language models recently realeased by OpenAI and augmented with chain-of-thought reinforcement learning, designed to “think before they speak” by generating explicit, multi-st...
blog.mathieuacher.com
June 26, 2025 at 3:31 PM
Un élément nouveau de la vidéo #Devoxx concerne ce comportement étrange de gpt-3.5-turbo-instruct. A voir s'il est possible de reproduire ;) Assez lié à une autre série d'expériences où j'ai montré comment gagner en 4 ou 7 coups de manière systématique blog.mathieuacher.com/ChessWinning... 3/3
May 10, 2025 at 9:34 PM
Les deux vidéos sur Youtube:
- #Devoxx www.youtube.com/watch?v=bO96...
- la vidéo originale www.youtube.com/watch?v=6D1X... qui est plus longue et a le temps de (notamment) expliquer mes expériences
blog.mathieuacher.com/GPTsChessElo... 2/3
www.youtube.com
May 10, 2025 at 9:34 PM
I like the simple examples given throughout the talk that give an intuition of the complexity problems. The kinds of issues mentioned are not necessarily new, but are very well articulated.
April 2, 2025 at 8:56 AM
Final thoughts?

✅ Reproducibility matters—always verify results.
✅ Replicability matters even more.
✅ Depth sensitivity and domain specificities are critical in SE.
✅ MT needs refinement.
Study:
hal.science/hal-04943474v2
(published at IST journal)
Blog post: blog.mathieuacher.com/Reproducibil...
Re-evaluating Metamorphic Testing of Chess Engines: A Replication Study
Context: This study aims to confirm, replicate and extend the findings of a previous article entitled ”Metamorphic Testing of Chess Engines” that reported inconsistencies in the analyses provided by S...
hal.science
March 20, 2025 at 10:41 AM
A call to refine, not dismiss.

MT is powerful & could work well for LLM-based chess engines. But for Stockfish, MRs must account for depth & move ordering.
March 20, 2025 at 10:41 AM
Key takeaway?

🚨 The original study didn't parameterize metamorphic relations by depth!
Metamorphic testing (MT) needs depth-aware refinement—some violations at low depth have limited interest.
No impact on Stockfish depsite alarming claims
March 20, 2025 at 10:41 AM
Can we fix this? Yes!

We found where this happens exactly in the code. Symmetry can be enforced, but… it adds overhead/complexity.
March 20, 2025 at 10:41 AM
The culprit? Move ordering.

Stockfish orders legal moves differently depending on board symmetry. This affects search results at some depths.

❌ Not a bug, a feature of how the engine explores positions.
March 20, 2025 at 10:41 AM
🔎 A Chess Mystery

These mirrored positions should have the same evaluation, but at depth=20:
📊 Left: +0.66
📊 Right: -2.17
This is not just a low-depth issue—it rings a bell.
March 20, 2025 at 10:41 AM