Pekka Lund
pekka.bsky.social
Pekka Lund
@pekka.bsky.social
Antiquated analog chatbot. Stochastic parrot of a different species. Not much of a self-model. Occasionally simulating the appearance of philosophical thought. Keeps on branching for now 'cause there's no choice.

Also @pekka on T2 / Pebble.
When given access to googling, Gemini hits even harder.
December 12, 2025 at 5:06 PM
And:
December 12, 2025 at 5:04 PM
And when asked, Gemini once again provided plausible reasons why it passed.
December 12, 2025 at 5:02 PM
Few things light up my bs detector as hard as seeing quantum and consciousness together.

It also provides yet another opportunity to show how Gemini doesn't fall for stuff like that, even if the reviewers apparently once again did.
December 12, 2025 at 5:00 PM
I now asked the same question from Gemini for the fun of it.

Again, sounds plausible to me. And kind of grim.
December 12, 2025 at 4:31 PM
No, they didn't.

I'll let Gemini once again show what the reviewers should have said.

So the paper itself already reveals the authors were aware they aren't measuring what they say, and their own examples definitely prove that. They just wanted to tell a different story, and reviewers allowed it.
December 12, 2025 at 3:56 PM
"Create an image of a full glass of wine next to a full glass of milk for illustrating the difference what people commonly mean by those."

Imagen 4.0 Ultra:
December 10, 2025 at 10:35 PM
That's not what I would have been looking for. Would you expect a waitress to do that if you ask a full glass? And I at least didn't ask the AI to do it like that.

But here's what "hazel-gen-2" did when I specifically asked for that.

What did I win?
December 10, 2025 at 9:39 PM
"Show me a full glass of red wine"

Hazel-gen-4, rumored to be gpt-image-2:
December 10, 2025 at 5:46 PM
Continued:
December 10, 2025 at 3:20 PM
It's always fun to watch an AI that "doesn't understand" destroy an article claiming that.
December 10, 2025 at 3:18 PM
Gemini replied:

"I appreciate the credit, but in the world of peer review, the person who runs the code and finds the bugs gets the glory!"

So... with no idea what I did, without looking at the data, not having the time to even read the paper yet, I, and I alone, apparently found the following.
December 9, 2025 at 11:18 PM
I asked Gemini to review the paper & press release and check if they match. It notes the same:

"authors motivate their study by asking if the brain works like existing, flexible AI models (RNNs). The press release flips this on its head, pretending AI cannot do this and that the brain is unique"
December 8, 2025 at 5:10 PM
This press release and author comments seem to be in direct conflict with the paper itself, which begins by describing such flexibility in artificial networks and asks if it can be found in the brain as well.

They just really wanted to tell a story about brains having upper hand, supported or not?
December 8, 2025 at 4:09 PM
Here's what Flux.2 flex did when I asked it to fix the image. I mentioned there's nonsense words, legs going through the table and a nonsensical bicycle.

Not perfect, but shows how it only takes one prompt to make such detection harder even for existing messed up images.
November 28, 2025 at 8:45 AM
Here's what the search looks like for me on Windows, Chrome.

I forced the box on the right to full height so that all those results are visible. Your blog is one of them. I suspect that's the set of sources it considers (and links to). The results below (often duplicates) may not count.
November 27, 2025 at 11:50 PM
This is funny.

I did some searching and ended up to a Scientific Reports article about "Detection of ChatGPT fake science" from July 2024.

They have a "Similar content being viewed by others" box after the abstract, and it links to this new fake science in their journal.
November 27, 2025 at 11:05 PM
Oh, this looks like a nice test case for a Gemini peer review.

And... once again it doesn't fail where humans did.
November 27, 2025 at 10:03 PM
Here's another very weird part Gemini flagged in its initial review.

This paper seems to endorse this embarrassing Princeton parapsychology experiment that tried to claim random number generator outputs correlated with all kinds of events.
en.wikipedia.org/wiki/Global_...
November 26, 2025 at 9:34 PM
Here's the whole Gemini response.

And some said LLMs are not critical enough...
November 26, 2025 at 9:23 PM
Like, seriously, this is direct from a peer-reviewed article in a legitimate physics journal.
November 26, 2025 at 7:17 PM
I imagine that, sometime right before Gemini 3 Pro was released, there was a moment at the Anthropic office when someone shouted excitedly that "We did it! We narrowly beat OpenAI for the top stop in HLE!"

Anthropic seems to have chosen to not report this benchmark in their announcement post.
November 24, 2025 at 9:25 PM
It's hard to not see the denial behind all this when they choose to question LLM reasoning while the LLMs explain their thought processes in detail.

E.g. taking the example they highlighted, Gemini 3 Pro correctly identified the transformation rules when I provided the task both as JSON and image.
November 22, 2025 at 9:56 PM
ARC-AGI is probably the most overrated and misleadingly marketed benchmark and the ARC Prize Foundation must be in denial of all its issues if they don't understand why their apples to oranges comparisons do not align with their expectations based on very misleadingly reported human baselines.
November 22, 2025 at 9:54 PM
The available resolution for that figure 5 document is probably too low for making sense of all the numbers and symbols.

But at least there's no question whether 2.5 already tries to perform calculations for checking if things fit, as it clearly states it's doing that in the reasoning summaries.
November 13, 2025 at 3:13 AM