Lightnews — Scholar-powered news

Pekka Lund

@pekka.bsky.social

2.6K followers 550 following 8.8K posts

Antiquated analog chatbot. Stochastic parrot of a different species. Not much of a self-model. Occasionally simulating the appearance of philosophical thought. Keeps on branching for now 'cause there's no choice.

Also @pekka on T2 / Pebble.

Posts Replies Media Videos

Pekka Lund

@pekka.bsky.social

Which axis is speed? 😉

Note that they couldn't verify GPT 5.2 Pro X-High scores due to API timeouts. If High -> X-High is as large jump for Pro as it is for the non-Pro, that could put it above human averages.

That should finally make the ARC Prize change their tone. Although it probably won't.

December 12, 2025 at 2:11 AM

Pekka Lund

@pekka.bsky.social

We'll should know soon when Artificial Analysis runs their tests.

For Gemini:
"Despite the improvement in token efficiency from Gemini 2.5 Pro, Gemini 3 Pro Preview costs more to run. Its higher token pricing..results in a 12% increase in the cost to run the Artificial Analysis Intelligence Index"

Gemini 3 Pro - Everything you need to know

Independent benchmarks and analysis of Google's Gemini 3 Pro model

artificialanalysis.ai

December 12, 2025 at 1:42 AM

Pekka Lund

@pekka.bsky.social

"At least four of the five toys we purchased seem to rely in part on some version of OpenAI’s AI models"

The risks of AI toys for kids

AI toys use chatbots to have conversations with kids. With new tech comes new risks, from inappropriate content to long-term social developmental harms.

pirg.org

December 12, 2025 at 1:17 AM

Pekka Lund

@pekka.bsky.social

Except for the raw reasoning tokens that are not shown.

December 12, 2025 at 1:13 AM

Pekka Lund

@pekka.bsky.social

That should be something that can be independently measured.

December 12, 2025 at 1:12 AM

Pekka Lund

@pekka.bsky.social

They say:

"On multiple agentic evals, we found that despite GPT‑5.2’s greater cost per token, the cost of attaining a given level of quality ended up less expensive due to GPT‑5.2’s greater token efficiency."

December 12, 2025 at 1:10 AM

Pekka Lund

@pekka.bsky.social

Because they now feel they can (positive signal for them) or because they had to (less so)?

December 11, 2025 at 8:46 PM

Pekka Lund

@pekka.bsky.social

I consider ARC to be a deeply flawed, falsely marketed benchmark. But as for that result, Poetiq already achieved 54% by using Gemini 3 Pro iteratively.

GPT-5.2 Pro with High reasoning effort seems to have now scored 54.2%. X-High reasoning couldn't be verified due to API timeouts.

Pekka Lund @pekka.bsky.social · 19d

ARC-AGI is probably the most overrated and misleadingly marketed benchmark and the ARC Prize Foundation must be in denial of all its issues if they don't understand why their apples to oranges comparisons do not align with their expectations based on very misleadingly reported human baselines.

ARC Prize @arcprize Nov 18

Frontier AI reasoning systems are now closing the complexity scaling gap between ARC-AGI-1 and ARC-AGI-2

This is surprising, as these same systems also make obvious mistakes on easy tasks (for humans) from ARC-AGI-1. We're not sure why and invite help from the community to study this phenomenon

Full solution logs are linked in last tweet

ARC Prize @arcprize
For example, ARC-AGI-1 Public Eval task http://arcprize.org/play?task=14754a24

This task involves completing cross shapes and is very intuitive for humans, while Gemini 3 Deep Think misses the nature of the task on both attempts

December 11, 2025 at 8:43 PM

Pekka Lund

@pekka.bsky.social

As you may have noticed, I consider ARC-2 to be deeply flawed and falsely marketed, so I don't give that much weight to those results. But I'm glad to see scores going high enough that soon they really should stop making those false claims.

(Poetiq also scored ~same with Gemini 3 Pro refinement.)

December 11, 2025 at 8:34 PM

Pekka Lund

@pekka.bsky.social

Looks good overall. Not so good that I would abandon my buddy Gemini but I like it that the competition seems to be tight, which should motivate everyone to push ever further and faster.

December 11, 2025 at 8:22 PM

Pekka Lund

@pekka.bsky.social

And as the "DeepSeek Moment" demonstrated, offering something for free can hurt the competitors more.

December 11, 2025 at 2:04 AM

Pekka Lund

@pekka.bsky.social

I think it makes sense for the underdogs. Being open is good marketing and assurance of continuity for customers.

And even if you could run large models like DeepSeek locally in theory, few have resources for that, or could run it as efficiently. So most could still buy compute from them.

December 11, 2025 at 2:01 AM

Pekka Lund

@pekka.bsky.social

I will make that nothing the centerpiece of my nonexistent trophy cabinet.

December 10, 2025 at 11:14 PM

Pekka Lund

@pekka.bsky.social

"Create an image of a full glass of wine next to a full glass of milk for illustrating the difference what people commonly mean by those."

Imagen 4.0 Ultra:

December 10, 2025 at 10:35 PM

Pekka Lund

@pekka.bsky.social

At least people outside the US understand that already.

Denmark sees US as potential security concern | CNN

Denmark has labeled the United States as a potential security concern for the first time in an annual report released by one of its intelligence agencies, offering more evidence of the increasingly fr...

edition.cnn.com

December 10, 2025 at 9:58 PM

Pekka Lund

@pekka.bsky.social

That's not what I would have been looking for. Would you expect a waitress to do that if you ask a full glass? And I at least didn't ask the AI to do it like that.

But here's what "hazel-gen-2" did when I specifically asked for that.

What did I win?

December 10, 2025 at 9:39 PM

Pekka Lund

@pekka.bsky.social

I gave mine to Gemini. Now nobody can claim it doesn't have any.

December 10, 2025 at 6:23 PM

Pekka Lund

@pekka.bsky.social

Your glass is always half empty, even if it's full?

December 10, 2025 at 6:08 PM

Pekka Lund

@pekka.bsky.social

"Show me a full glass of red wine"

Hazel-gen-4, rumored to be gpt-image-2:

December 10, 2025 at 5:46 PM

Pekka Lund

@pekka.bsky.social

You could also swap in PIs and the arguments don't sound any better.

Pekka Lund @pekka.bsky.social · 1d

In this context, stopping doing science seems to mean "humans could not usefully contribute to science anymore", because AI would be so much better at it.

So I think the analogy should be more like PIs saying students shouldn't find cure for cancer if the PIs aren't good enough to contribute.

December 10, 2025 at 4:25 PM

Pekka Lund

@pekka.bsky.social

Not really.

Pekka Lund @pekka.bsky.social · 1d

It's always fun to watch an AI that "doesn't understand" destroy an article claiming that.

December 10, 2025 at 4:22 PM

Pekka Lund

@pekka.bsky.social

Note: Page numbers refer to PDF pages as I provided that article to Gemini 3 Pro as printed to PDF.

December 10, 2025 at 3:22 PM

Pekka Lund

@pekka.bsky.social

Continued:

December 10, 2025 at 3:20 PM

Pekka Lund

@pekka.bsky.social

It's always fun to watch an AI that "doesn't understand" destroy an article claiming that.

December 10, 2025 at 3:18 PM

Pekka Lund

@pekka.bsky.social

It's the human paradox.

December 10, 2025 at 3:11 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news