In addition to the EconEvals benchmarks, in the EconEvals “litmus tests”, we quantify tendencies of LLMs and LLM agents when faced with tradeoffs for which there is no objectively correct choice: for example efficiency vs. equality. 5/6
April 4, 2025 at 3:48 PM
In addition to the EconEvals benchmarks, in the EconEvals “litmus tests”, we quantify tendencies of LLMs and LLM agents when faced with tradeoffs for which there is no objectively correct choice: for example efficiency vs. equality. 5/6
(And a score of 70% on each of our benchmarks has a specific economic meaning. For example, 70% at pricing corresponds to capturing 70% of total possible profits. Very different from 70% accuracy at a closed-ended Q&A benchmark!) 4/6
April 4, 2025 at 3:48 PM
(And a score of 70% on each of our benchmarks has a specific economic meaning. For example, 70% at pricing corresponds to capturing 70% of total possible profits. Very different from 70% accuracy at a closed-ended Q&A benchmark!) 4/6
To forestall saturation, we can scale the difficulty of our benchmark questions by scaling parameters of the economic environment. Our HARD difficulty level is challenging: no LLM we test, including o3-mini, scores above 70%. (Low scores of o3-mini possibly driven by underexploration.) 3/6
April 4, 2025 at 3:48 PM
To forestall saturation, we can scale the difficulty of our benchmark questions by scaling parameters of the economic environment. Our HARD difficulty level is challenging: no LLM we test, including o3-mini, scores above 70%. (Low scores of o3-mini possibly driven by underexploration.) 3/6
In EconEvals benchmarks, LLM agents repeatedly take actions in an economic environment, and must learn optimal actions via trial and error (a capability SoTA LLMs struggle with!) 2/6
April 4, 2025 at 3:48 PM
In EconEvals benchmarks, LLM agents repeatedly take actions in an economic environment, and must learn optimal actions via trial and error (a capability SoTA LLMs struggle with!) 2/6