Daniel Paleka
dpaleka.bsky.social
Daniel Paleka
@dpaleka.bsky.social
ai safety researcher | phd ETH Zurich | https://danielpaleka.com
We don't claim LLM forecasting is impossible, but argue for more careful evaluation methods to confidently measure these capabilities.

Details, examples, and more issues in the paper! (7/7)
arxiv.org/abs/2506.00723
Pitfalls in Evaluating Language Model Forecasters
Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a...
arxiv.org
June 5, 2025 at 5:08 PM
Benchmarks can reward strategic gambling over calibrated forecasting when optimizing for ranking performance.

"Bet everything" on one scenario beats careful probability estimation for maximizing the chance of ranking #1 on the leaderboard. (6/7)
June 5, 2025 at 5:08 PM
Model knowledge cutoffs are guidelines about reliability, not guarantees of no information thereafter. GPT-4o, when nudged, can reveal knowledge beyond its stated Oct 2023 cutoff. (5/7)
June 5, 2025 at 5:08 PM
Date-restricted search leaks future knowledge. Searching pre-2019 articles about “Wuhan” returns results abnormally biased towards the Wuhan Institute of Virology — an association that only emerged later. (4/7)
June 5, 2025 at 5:08 PM
The time traveler problem: When forecasting "Will civil war break out in Sudan by 2030?", you can deduce the answer is "yes" - otherwise they couldn't grade you yet.

We find that backtesting in existing papers often has similar logical issues that leak information about answers. (3/7)
June 5, 2025 at 5:08 PM
Forecasting evaluation is tricky. The gold standard is asking about future events; but that takes months/years.

Instead, researchers use "backtesting": questions where we can evaluate predictions now, but the model has no information about the outcome ... or so we think (2/7)
June 5, 2025 at 5:08 PM
Of course, we don't have the old chatgpt-4o API endpoint, so we can't see whether the prompt is fully at fault or there was also a model update.
April 30, 2025 at 3:16 PM
The sycophancy effect on controversial binary options is much smaller than what you would assume from the overall positive vibe towards the user. On most such statements, models don't actually state they agree with the user.
April 30, 2025 at 3:16 PM
System prompts and pairs of statements:
gist.github.com/dpaleka/7b4...
Contrastive statements sycophancy eval
Contrastive statements sycophancy eval. GitHub Gist: instantly share code, notes, and snippets.
gist.github.com
April 30, 2025 at 3:16 PM
lmao
April 9, 2025 at 7:32 PM
oh that's cool. it would be interesting to draw a matrix of how well the various models are aware of models other than themselves, in the sense they consider them as coherent entities similar to their own self-perception
April 9, 2025 at 7:29 PM
fixed games such as blackjack you cannot optimize too much because rules don't change. meanwhile, a casino gets unlimited iteration on slot machines and the reward signal is as good as it gets
March 31, 2025 at 11:50 AM