Lightnews — Scholar-powered news

Daniel Paleka

@dpaleka.bsky.social

We don't claim LLM forecasting is impossible, but argue for more careful evaluation methods to confidently measure these capabilities.

Details, examples, and more issues in the paper! (7/7)
arxiv.org/abs/2506.00723

Pitfalls in Evaluating Language Model Forecasters

Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a...

arxiv.org

June 5, 2025 at 5:08 PM

Daniel Paleka

@dpaleka.bsky.social

Benchmarks can reward strategic gambling over calibrated forecasting when optimizing for ranking performance.

"Bet everything" on one scenario beats careful probability estimation for maximizing the chance of ranking #1 on the leaderboard. (6/7)

June 5, 2025 at 5:08 PM

Daniel Paleka

@dpaleka.bsky.social

Model knowledge cutoffs are guidelines about reliability, not guarantees of no information thereafter. GPT-4o, when nudged, can reveal knowledge beyond its stated Oct 2023 cutoff. (5/7)

June 5, 2025 at 5:08 PM

Daniel Paleka

@dpaleka.bsky.social

Date-restricted search leaks future knowledge. Searching pre-2019 articles about “Wuhan” returns results abnormally biased towards the Wuhan Institute of Virology — an association that only emerged later. (4/7)

June 5, 2025 at 5:08 PM

Daniel Paleka

@dpaleka.bsky.social

The time traveler problem: When forecasting "Will civil war break out in Sudan by 2030?", you can deduce the answer is "yes" - otherwise they couldn't grade you yet.

We find that backtesting in existing papers often has similar logical issues that leak information about answers. (3/7)

June 5, 2025 at 5:08 PM

Daniel Paleka

@dpaleka.bsky.social

Forecasting evaluation is tricky. The gold standard is asking about future events; but that takes months/years.

Instead, researchers use "backtesting": questions where we can evaluate predictions now, but the model has no information about the outcome ... or so we think (2/7)

June 5, 2025 at 5:08 PM

Daniel Paleka

@dpaleka.bsky.social

Of course, we don't have the old chatgpt-4o API endpoint, so we can't see whether the prompt is fully at fault or there was also a model update.

April 30, 2025 at 3:16 PM

Daniel Paleka

@dpaleka.bsky.social

The sycophancy effect on controversial binary options is much smaller than what you would assume from the overall positive vibe towards the user. On most such statements, models don't actually state they agree with the user.

April 30, 2025 at 3:16 PM

Daniel Paleka

@dpaleka.bsky.social

System prompts and pairs of statements:
gist.github.com/dpaleka/7b4...

Contrastive statements sycophancy eval

Contrastive statements sycophancy eval. GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

April 30, 2025 at 3:16 PM

Daniel Paleka

@dpaleka.bsky.social

lmao

April 9, 2025 at 7:32 PM

Daniel Paleka

@dpaleka.bsky.social

oh that's cool. it would be interesting to draw a matrix of how well the various models are aware of models other than themselves, in the sense they consider them as coherent entities similar to their own self-perception

April 9, 2025 at 7:29 PM

Daniel Paleka

@dpaleka.bsky.social

fixed games such as blackjack you cannot optimize too much because rules don't change. meanwhile, a casino gets unlimited iteration on slot machines and the reward signal is as good as it gets

March 31, 2025 at 11:50 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news