The graph also incorrectly showed a [0,0] CI.
We’ve updated the blog post to show a new figure, which more accurately conveys what we observed.
The graph also incorrectly showed a [0,0] CI.
We’ve updated the blog post to show a new figure, which more accurately conveys what we observed.
- Core functionality errors
- Poor test coverage
- Missing/incorrect documentation
- Linting/formatting violations
- Other quality issues (verbosity, brittleness, poor maintainability)
All agent attempts contain at least 3 of these issues!
- Core functionality errors
- Poor test coverage
- Missing/incorrect documentation
- Linting/formatting violations
- Other quality issues (verbosity, brittleness, poor maintainability)
All agent attempts contain at least 3 of these issues!
We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality.
We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality.
In new research, we find that the CoT is informative about LLM cognition as long as the cognition is complex enough that it can’t be performed in a single forward pass.
In new research, we find that the CoT is informative about LLM cognition as long as the cognition is complex enough that it can’t be performed in a single forward pass.
We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.
We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.
The paper is also live on arXiv, with two new sections: One discussing alternative uncertainty estimation methods, and a new 'bias from developer recruitment' factor that has unclear effect on slowdown.
The paper is also live on arXiv, with two new sections: One discussing alternative uncertainty estimation methods, and a new 'bias from developer recruitment' factor that has unclear effect on slowdown.
The median doubling time across 9 benchmarks is ~4 months (range is 2.5 - 17 months).
The median doubling time across 9 benchmarks is ~4 months (range is 2.5 - 17 months).
We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement.
We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement.
1. Our setting represents all (or potentially even most) software engineering.
2. Future models won't be better (or current models can’t be used more effectively).
1. Our setting represents all (or potentially even most) software engineering.
2. Future models won't be better (or current models can’t be used more effectively).