Lightnews — Scholar-powered news

Sara Fish

@sarafish.bsky.social

30 followers 58 following 8 posts

PhD student at Harvard interested in EconCS and ML / previously Caltech undergrad in math

Posts Replies Media Videos

Sara Fish

@sarafish.bsky.social

Scalable oversight / debate, to an extent

May 13, 2025 at 4:02 PM

Sara Fish

@sarafish.bsky.social

More details, including statistical significance, in the paper.

joint w/ Julia Shephard, Minkai Li, @yannaigonch.bsky.social , @ranshorrer.bsky.social

Paper: arxiv.org/abs/2503.18825
Code: github.com/sara-fish/ec... 6/6

April 4, 2025 at 3:48 PM

Sara Fish

@sarafish.bsky.social

In addition to the EconEvals benchmarks, in the EconEvals “litmus tests”, we quantify tendencies of LLMs and LLM agents when faced with tradeoffs for which there is no objectively correct choice: for example efficiency vs. equality. 5/6

April 4, 2025 at 3:48 PM

Sara Fish

@sarafish.bsky.social

(And a score of 70% on each of our benchmarks has a specific economic meaning. For example, 70% at pricing corresponds to capturing 70% of total possible profits. Very different from 70% accuracy at a closed-ended Q&A benchmark!) 4/6

April 4, 2025 at 3:48 PM

Sara Fish

@sarafish.bsky.social

To forestall saturation, we can scale the difficulty of our benchmark questions by scaling parameters of the economic environment. Our HARD difficulty level is challenging: no LLM we test, including o3-mini, scores above 70%. (Low scores of o3-mini possibly driven by underexploration.) 3/6

April 4, 2025 at 3:48 PM

Sara Fish

@sarafish.bsky.social

In EconEvals benchmarks, LLM agents repeatedly take actions in an economic environment, and must learn optimal actions via trial and error (a capability SoTA LLMs struggle with!) 2/6

April 4, 2025 at 3:48 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news