But yes, if you actually want to learn about multimodality then you shouldnt read about MLLM.
But yes, if you actually want to learn about multimodality then you shouldnt read about MLLM.
If you're willing to actually learn about this then you can start here: arxiv.org/abs/2505.19614, or even here: academic.oup.com/dsh/article/...
If you're willing to actually learn about this then you can start here: arxiv.org/abs/2505.19614, or even here: academic.oup.com/dsh/article/...
Those different details also matter a lot; especially because the brain isn't just floating in a jar, it's part of an embodied system.
Those different details also matter a lot; especially because the brain isn't just floating in a jar, it's part of an embodied system.
Brain analogy really doesnt hold here. NN != Brains.
Brain analogy really doesnt hold here. NN != Brains.
Partial Information Decomposition has been used to formalise some of this
Partial Information Decomposition has been used to formalise some of this
Potential limited: if key visual info is missing, then attention wont recover that. So alot of 'decisions' about visual are made before fusion
Potential limited: if key visual info is missing, then attention wont recover that. So alot of 'decisions' about visual are made before fusion
That doesnt make it meaningfully multimodal; potential of visual tokens is still limited by visual encoder.
Anyway, if I wanted to talk to an LLM I would do that directly
That doesnt make it meaningfully multimodal; potential of visual tokens is still limited by visual encoder.
Anyway, if I wanted to talk to an LLM I would do that directly
'text space' in that after the image encoder the visual information is fixed, and mixed with text tokens for seq2text - which is not how multimodality works..
'text space' in that after the image encoder the visual information is fixed, and mixed with text tokens for seq2text - which is not how multimodality works..
Similarly, I wouldn't describe an LLM that translates a query to a destination for a Waymo as an 'LLM driving a car'
Similarly, I wouldn't describe an LLM that translates a query to a destination for a Waymo as an 'LLM driving a car'
I don't think defining an LLM as a transformer-based NN is inaccurate, in which case it isn't doing search by itself, and then it would be fine to argue that it can only hallucinate.
I don't think defining an LLM as a transformer-based NN is inaccurate, in which case it isn't doing search by itself, and then it would be fine to argue that it can only hallucinate.
Sure an LLM can be trained to formulate queries and process results, but the system doing the searching is more than 'just' an LLM.
Sure an LLM can be trained to formulate queries and process results, but the system doing the searching is more than 'just' an LLM.
But yes - this has been absurd; especially for those with visa issues - and I do think for that group this is a (minor) improvement
But yes - this has been absurd; especially for those with visa issues - and I do think for that group this is a (minor) improvement
The main change of new locations seems primarily that those with US visa issues will be able to present somewhere. But it doesnt really change costs
The main change of new locations seems primarily that those with US visa issues will be able to present somewhere. But it doesnt really change costs