Also @pekka on T2 / Pebble.
It also provides yet another opportunity to show how Gemini doesn't fall for stuff like that, even if the reviewers apparently once again did.
It also provides yet another opportunity to show how Gemini doesn't fall for stuff like that, even if the reviewers apparently once again did.
Again, sounds plausible to me. And kind of grim.
Again, sounds plausible to me. And kind of grim.
I'll let Gemini once again show what the reviewers should have said.
So the paper itself already reveals the authors were aware they aren't measuring what they say, and their own examples definitely prove that. They just wanted to tell a different story, and reviewers allowed it.
I'll let Gemini once again show what the reviewers should have said.
So the paper itself already reveals the authors were aware they aren't measuring what they say, and their own examples definitely prove that. They just wanted to tell a different story, and reviewers allowed it.
Imagen 4.0 Ultra:
Imagen 4.0 Ultra:
But here's what "hazel-gen-2" did when I specifically asked for that.
What did I win?
But here's what "hazel-gen-2" did when I specifically asked for that.
What did I win?
Hazel-gen-4, rumored to be gpt-image-2:
Hazel-gen-4, rumored to be gpt-image-2:
"I appreciate the credit, but in the world of peer review, the person who runs the code and finds the bugs gets the glory!"
So... with no idea what I did, without looking at the data, not having the time to even read the paper yet, I, and I alone, apparently found the following.
"I appreciate the credit, but in the world of peer review, the person who runs the code and finds the bugs gets the glory!"
So... with no idea what I did, without looking at the data, not having the time to even read the paper yet, I, and I alone, apparently found the following.
"authors motivate their study by asking if the brain works like existing, flexible AI models (RNNs). The press release flips this on its head, pretending AI cannot do this and that the brain is unique"
"authors motivate their study by asking if the brain works like existing, flexible AI models (RNNs). The press release flips this on its head, pretending AI cannot do this and that the brain is unique"
They just really wanted to tell a story about brains having upper hand, supported or not?
They just really wanted to tell a story about brains having upper hand, supported or not?
Not perfect, but shows how it only takes one prompt to make such detection harder even for existing messed up images.
Not perfect, but shows how it only takes one prompt to make such detection harder even for existing messed up images.
I forced the box on the right to full height so that all those results are visible. Your blog is one of them. I suspect that's the set of sources it considers (and links to). The results below (often duplicates) may not count.
I forced the box on the right to full height so that all those results are visible. Your blog is one of them. I suspect that's the set of sources it considers (and links to). The results below (often duplicates) may not count.
I did some searching and ended up to a Scientific Reports article about "Detection of ChatGPT fake science" from July 2024.
They have a "Similar content being viewed by others" box after the abstract, and it links to this new fake science in their journal.
I did some searching and ended up to a Scientific Reports article about "Detection of ChatGPT fake science" from July 2024.
They have a "Similar content being viewed by others" box after the abstract, and it links to this new fake science in their journal.
And... once again it doesn't fail where humans did.
And... once again it doesn't fail where humans did.
This paper seems to endorse this embarrassing Princeton parapsychology experiment that tried to claim random number generator outputs correlated with all kinds of events.
en.wikipedia.org/wiki/Global_...
This paper seems to endorse this embarrassing Princeton parapsychology experiment that tried to claim random number generator outputs correlated with all kinds of events.
en.wikipedia.org/wiki/Global_...
And some said LLMs are not critical enough...
And some said LLMs are not critical enough...
Anthropic seems to have chosen to not report this benchmark in their announcement post.
Anthropic seems to have chosen to not report this benchmark in their announcement post.
E.g. taking the example they highlighted, Gemini 3 Pro correctly identified the transformation rules when I provided the task both as JSON and image.
E.g. taking the example they highlighted, Gemini 3 Pro correctly identified the transformation rules when I provided the task both as JSON and image.
But at least there's no question whether 2.5 already tries to perform calculations for checking if things fit, as it clearly states it's doing that in the reasoning summaries.
But at least there's no question whether 2.5 already tries to perform calculations for checking if things fit, as it clearly states it's doing that in the reasoning summaries.