Also @pekka on T2 / Pebble.
Note that they couldn't verify GPT 5.2 Pro X-High scores due to API timeouts. If High -> X-High is as large jump for Pro as it is for the non-Pro, that could put it above human averages.
That should finally make the ARC Prize change their tone. Although it probably won't.
Note that they couldn't verify GPT 5.2 Pro X-High scores due to API timeouts. If High -> X-High is as large jump for Pro as it is for the non-Pro, that could put it above human averages.
That should finally make the ARC Prize change their tone. Although it probably won't.
For Gemini:
"Despite the improvement in token efficiency from Gemini 2.5 Pro, Gemini 3 Pro Preview costs more to run. Its higher token pricing..results in a 12% increase in the cost to run the Artificial Analysis Intelligence Index"
For Gemini:
"Despite the improvement in token efficiency from Gemini 2.5 Pro, Gemini 3 Pro Preview costs more to run. Its higher token pricing..results in a 12% increase in the cost to run the Artificial Analysis Intelligence Index"
"On multiple agentic evals, we found that despite GPT‑5.2’s greater cost per token, the cost of attaining a given level of quality ended up less expensive due to GPT‑5.2’s greater token efficiency."
"On multiple agentic evals, we found that despite GPT‑5.2’s greater cost per token, the cost of attaining a given level of quality ended up less expensive due to GPT‑5.2’s greater token efficiency."
GPT-5.2 Pro with High reasoning effort seems to have now scored 54.2%. X-High reasoning couldn't be verified due to API timeouts.
GPT-5.2 Pro with High reasoning effort seems to have now scored 54.2%. X-High reasoning couldn't be verified due to API timeouts.
(Poetiq also scored ~same with Gemini 3 Pro refinement.)
(Poetiq also scored ~same with Gemini 3 Pro refinement.)
And even if you could run large models like DeepSeek locally in theory, few have resources for that, or could run it as efficiently. So most could still buy compute from them.
And even if you could run large models like DeepSeek locally in theory, few have resources for that, or could run it as efficiently. So most could still buy compute from them.
Imagen 4.0 Ultra:
Imagen 4.0 Ultra:
But here's what "hazel-gen-2" did when I specifically asked for that.
What did I win?
But here's what "hazel-gen-2" did when I specifically asked for that.
What did I win?
Hazel-gen-4, rumored to be gpt-image-2:
Hazel-gen-4, rumored to be gpt-image-2:
So I think the analogy should be more like PIs saying students shouldn't find cure for cancer if the PIs aren't good enough to contribute.