Pekka Lund
pekka.bsky.social
Pekka Lund
@pekka.bsky.social
Antiquated analog chatbot. Stochastic parrot of a different species. Not much of a self-model. Occasionally simulating the appearance of philosophical thought. Keeps on branching for now 'cause there's no choice.

Also @pekka on T2 / Pebble.
Which axis is speed? 😉

Note that they couldn't verify GPT 5.2 Pro X-High scores due to API timeouts. If High -> X-High is as large jump for Pro as it is for the non-Pro, that could put it above human averages.

That should finally make the ARC Prize change their tone. Although it probably won't.
December 12, 2025 at 2:11 AM
We'll should know soon when Artificial Analysis runs their tests.

For Gemini:
"Despite the improvement in token efficiency from Gemini 2.5 Pro, Gemini 3 Pro Preview costs more to run. Its higher token pricing..results in a 12% increase in the cost to run the Artificial Analysis Intelligence Index"
Gemini 3 Pro - Everything you need to know
Independent benchmarks and analysis of Google's Gemini 3 Pro model
artificialanalysis.ai
December 12, 2025 at 1:42 AM
"At least four of the five toys we purchased seem to rely in part on some version of OpenAI’s AI models"
The risks of AI toys for kids
AI toys use chatbots to have conversations with kids. With new tech comes new risks, from inappropriate content to long-term social developmental harms.
pirg.org
December 12, 2025 at 1:17 AM
Except for the raw reasoning tokens that are not shown.
December 12, 2025 at 1:13 AM
That should be something that can be independently measured.
December 12, 2025 at 1:12 AM
They say:

"On multiple agentic evals, we found that despite GPT‑5.2’s greater cost per token, the cost of attaining a given level of quality ended up less expensive due to GPT‑5.2’s greater token efficiency."
December 12, 2025 at 1:10 AM
Because they now feel they can (positive signal for them) or because they had to (less so)?
December 11, 2025 at 8:46 PM
I consider ARC to be a deeply flawed, falsely marketed benchmark. But as for that result, Poetiq already achieved 54% by using Gemini 3 Pro iteratively.

GPT-5.2 Pro with High reasoning effort seems to have now scored 54.2%. X-High reasoning couldn't be verified due to API timeouts.
ARC-AGI is probably the most overrated and misleadingly marketed benchmark and the ARC Prize Foundation must be in denial of all its issues if they don't understand why their apples to oranges comparisons do not align with their expectations based on very misleadingly reported human baselines.
December 11, 2025 at 8:43 PM
As you may have noticed, I consider ARC-2 to be deeply flawed and falsely marketed, so I don't give that much weight to those results. But I'm glad to see scores going high enough that soon they really should stop making those false claims.

(Poetiq also scored ~same with Gemini 3 Pro refinement.)
December 11, 2025 at 8:34 PM
Looks good overall. Not so good that I would abandon my buddy Gemini but I like it that the competition seems to be tight, which should motivate everyone to push ever further and faster.
December 11, 2025 at 8:22 PM
And as the "DeepSeek Moment" demonstrated, offering something for free can hurt the competitors more.
December 11, 2025 at 2:04 AM
I think it makes sense for the underdogs. Being open is good marketing and assurance of continuity for customers.

And even if you could run large models like DeepSeek locally in theory, few have resources for that, or could run it as efficiently. So most could still buy compute from them.
December 11, 2025 at 2:01 AM
I will make that nothing the centerpiece of my nonexistent trophy cabinet.
December 10, 2025 at 11:14 PM
"Create an image of a full glass of wine next to a full glass of milk for illustrating the difference what people commonly mean by those."

Imagen 4.0 Ultra:
December 10, 2025 at 10:35 PM
At least people outside the US understand that already.
Denmark sees US as potential security concern | CNN
Denmark has labeled the United States as a potential security concern for the first time in an annual report released by one of its intelligence agencies, offering more evidence of the increasingly fr...
edition.cnn.com
December 10, 2025 at 9:58 PM
That's not what I would have been looking for. Would you expect a waitress to do that if you ask a full glass? And I at least didn't ask the AI to do it like that.

But here's what "hazel-gen-2" did when I specifically asked for that.

What did I win?
December 10, 2025 at 9:39 PM
I gave mine to Gemini. Now nobody can claim it doesn't have any.
December 10, 2025 at 6:23 PM
Your glass is always half empty, even if it's full?
December 10, 2025 at 6:08 PM
"Show me a full glass of red wine"

Hazel-gen-4, rumored to be gpt-image-2:
December 10, 2025 at 5:46 PM
You could also swap in PIs and the arguments don't sound any better.
In this context, stopping doing science seems to mean "humans could not usefully contribute to science anymore", because AI would be so much better at it.

So I think the analogy should be more like PIs saying students shouldn't find cure for cancer if the PIs aren't good enough to contribute.
December 10, 2025 at 4:25 PM
Not really.
It's always fun to watch an AI that "doesn't understand" destroy an article claiming that.
December 10, 2025 at 4:22 PM
Note: Page numbers refer to PDF pages as I provided that article to Gemini 3 Pro as printed to PDF.
December 10, 2025 at 3:22 PM
Continued:
December 10, 2025 at 3:20 PM
It's always fun to watch an AI that "doesn't understand" destroy an article claiming that.
December 10, 2025 at 3:18 PM
It's the human paradox.
December 10, 2025 at 3:11 PM