Nanne van Noord
nanne.bsky.social
Nanne van Noord
@nanne.bsky.social
Assistant Professor of Visual Culture and Multimedia at University of Amsterdam. http://nanne.github.io
Wait, you get summaries of your own papers? That seems like step up from the "I see you work on <insert topic I've not touched in my life>" emails at least
November 19, 2025 at 8:39 AM
Reposted by Nanne van Noord
And lastly, if @neuripsconf.bsky.social would choose to reverse the decisions on the papers affected by space constraints, we would be happy and able to accommodate their presentation
September 19, 2025 at 10:01 AM
You're arguing in bad faith, so this will be my last reply.

But yes, if you actually want to learn about multimodality then you shouldnt read about MLLM.
July 27, 2025 at 8:02 PM
I'm not sure what the point here is, but if you're going to believe Gemini over actual research done by AI researchers there isn't much more to discuss.

If you're willing to actually learn about this then you can start here: arxiv.org/abs/2505.19614, or even here: academic.oup.com/dsh/article/...
July 27, 2025 at 7:14 PM
That's a bit sealion-y, but I'll bite - *artificial* neural networks are a poorly analogy.

Those different details also matter a lot; especially because the brain isn't just floating in a jar, it's part of an embodied system.
July 27, 2025 at 7:05 PM
This is where your misunderstanding is happening, as they are not elementary pieces. For the visual tokens a lot of the semantics have already been determined, and hence the interpretations it can arrive at are limited.

Brain analogy really doesnt hold here. NN != Brains.
July 27, 2025 at 2:59 PM
Its clearly not; neural nets are a poor analogy for the brain, and clearly don't work the same way.
July 27, 2025 at 2:54 PM
This, plus the (initial) interpretation of the modalities should not be independent - even at the pixel/word-level we may want to interpret differently depending on the other modalities (e.g., sense disambiguation)

Partial Information Decomposition has been used to formalise some of this
July 27, 2025 at 8:49 AM
No.. that's not how any of that works 😵‍💫
July 27, 2025 at 8:13 AM
It means I said 'mix' to explain the process, but I obviously know this involves attention - so the Gemini explanation is not meaningfully different.

Potential limited: if key visual info is missing, then attention wont recover that. So alot of 'decisions' about visual are made before fusion
July 26, 2025 at 11:00 PM
Ah, I see how you and Gemini misunderstood. I was talking about extracting visual tokens, and mix referred to attention.

That doesnt make it meaningfully multimodal; potential of visual tokens is still limited by visual encoder.

Anyway, if I wanted to talk to an LLM I would do that directly
July 26, 2025 at 10:37 PM
Please do explain then how whatever you're referring to is different and actually meaningfully multimodal.
July 26, 2025 at 10:08 PM
*all semantic information* is quite the claim; in our experiments they miss a lot of semantics from visual

'text space' in that after the image encoder the visual information is fixed, and mixed with text tokens for seq2text - which is not how multimodality works..
July 26, 2025 at 8:41 PM
Natively is a bit of an exaggeration, as it's mostly just other modalities mapped to text space as input - but this makes their 'understanding' rather shallow
July 26, 2025 at 7:51 PM
If the priority is to dunk on people that know less about AI, instead of being accurate, that could be a conclusion I guess.
July 18, 2025 at 4:09 PM
It would be weird to describe this 2012 system, that is doing search, as an SVM classifier doing search: www.robots.ox.ac.uk/%7Evgg/publi...

Similarly, I wouldn't describe an LLM that translates a query to a destination for a Waymo as an 'LLM driving a car'
Visual Geometry Group - University of Oxford
Computer Vision group from the University of Oxford
www.robots.ox.ac.uk
July 18, 2025 at 3:41 PM
I'm not questioning your definition of searching, I'm questioning your use of "LLMs".

I don't think defining an LLM as a transformer-based NN is inaccurate, in which case it isn't doing search by itself, and then it would be fine to argue that it can only hallucinate.
July 18, 2025 at 3:41 PM
That statement mostly seems to apply to hosted commercial systems. It takes more than just downloading an LLM from huggingface to have a system that does this.

Sure an LLM can be trained to formulate queries and process results, but the system doing the searching is more than 'just' an LLM.
July 18, 2025 at 2:51 PM
Fair, but still meaningful to make the distinction between LLMs and reasoning models, as not all LLMs are reasoning models. Especially if the point is to communicate across silos.
July 18, 2025 at 1:52 PM
Do LLMs do search? Afaik there have been systems built around LLMs that do search, and then send these results back to them (i.e., RAG-like) - but that isn't the same as an LLM doing search.
July 18, 2025 at 1:30 PM
I couldnt find EurIPS registration costs; hopefully they can address this by lowering costs for authors

But yes - this has been absurd; especially for those with visa issues - and I do think for that group this is a (minor) improvement
July 17, 2025 at 8:52 AM
Not my intention to defend the requirement for a full registration, but this has been common practice for a while across multiple conferences.

The main change of new locations seems primarily that those with US visa issues will be able to present somewhere. But it doesnt really change costs
July 17, 2025 at 8:31 AM