Phillip Isola
phillipisola.bsky.social
Phillip Isola
@phillipisola.bsky.social
Associate Professor in EECS at MIT. Neural nets, generative models, representation learning, computer vision, robotics, cog sci, AI.

https://web.mit.edu/phillipi/
This reminds me of my favorite talk giving advice, which is from Matt Stone and Trey Parker: www.youtube.com/watch?v=vGUN...
Writing Advice from Matt Stone & Trey Parker @ NYU | MTVU's "Stand In"
YouTube video by Fabian Valdez
www.youtube.com
October 29, 2025 at 9:10 PM
I agree, I just want to push back on this being pseudoscience, I feel like that's too strong a critique.

But just for the chance of a meal in Paris, happy to take that bet and probably end up wrong :)
October 17, 2025 at 7:36 PM
I agree that just knowing a lot of facts is not everything. But it seems like their benchmark includes lots more than that: working memory, reasoning, perception, etc?
October 17, 2025 at 6:40 PM
I get that people might disagree with the framing / marketing. But what makes you feel it is pseudoscience? I only skimmed it.
October 17, 2025 at 5:50 PM
I agree, I think at certain scale modality alignment happens without additional explicit incentives. At smaller scale, explicit alignment can be necessary.

This paper shows some effect of alignment increasing with scale, for a domain closer to remote sensing: www.arxiv.org/abs/2509.19453
The Platonic Universe: Do Foundation Models See the Same Sky?
We test the Platonic Representation Hypothesis (PRH) in astronomy by measuring representational convergence across a range of foundation models trained on different data types. Using spectroscopic and...
www.arxiv.org
October 13, 2025 at 4:59 PM
Right! It's a text only LLM.
October 13, 2025 at 4:02 PM
This work is with an amazing team including @sophielwang.bsky.social, @thisismyhat.bsky.social, Sharut Gupta, @shobsund.bsky.social, Chenyu Wang, and Stefanie Jegelka.

9/9
October 10, 2025 at 10:13 PM
More broadly, I think confusion has been created by forming hard distinctions between different modalities, especially between text and sensory data. These distinctions can obscure commonalities. We take the rhetorical stance of erasing the distinctions, and seeing where this leads.

8/9
October 10, 2025 at 10:13 PM
This work was partially inspired by Ilya Sutskever's talk here: www.youtube.com/watch?v=AKMu...

If you concatenate datasets, the model “should” figure out all the synergies and cross-modal relationships, then exploit them to make better inferences. We now have some evidence this can happen.

7/9
An Observation on Generalization
YouTube video by Simons Institute for the Theory of Computing
www.youtube.com
October 10, 2025 at 10:13 PM
Suppose you have separate datasets X, Y, Z, without known correspondences.

We do the simplest thing: just train a model (e.g., a next-token predictor) on all elements of the concatenated dataset [X,Y,Z].

You end up with a better model of dataset X than if you had trained on X alone!

6/9
October 10, 2025 at 10:13 PM
In “Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models,” we study a question I’ve wanted to make progress on for years: can you learn useful multimodal representations from *unpaired* data?

5/9
October 10, 2025 at 10:13 PM
In short: you can “just ask” an LLM to act (a bit) like an image model or an audio model.

This tells us that LLMs know more about the sensory world than we might suspect; you just have to find ways to elicit the knowledge.

4/9
October 10, 2025 at 10:13 PM
In “Words That Make Language Models Perceive,” we find if you ask an LLM to “imagine seeing,” then how it processes text becomes more like how a vision system would represent that same scene.

If you ask it to “imagine hearing,” its representation becomes more like that of an auditory model.

3/9
October 10, 2025 at 10:13 PM
For context, this work stems from the idea that all data modalities (images, sounds, text, etc) are views of the same underlying world, and that treating them as such is useful.

We are interested in identifying commonalities between different models and modalities, and providing unifications.

2/9
October 10, 2025 at 10:13 PM
Oh I think you are right about the review process at least. Sometimes it rewards the inverse of my metric: a fancy new technique that doesn't actually achieve any new result / understanding :)
October 9, 2025 at 8:57 PM
I think papers like that are great! One of my personal metrics for paper quality is: delta in capability / delta in technique. A paper that only changes one parameter and achieves much better results should get a best paper award by this metric :)
October 9, 2025 at 2:32 PM
Unless it turns out it that capable intelligence is actually not so simple!
July 31, 2025 at 9:22 PM
Yeah, it helps me to consider that much of the history of science has been about finding a simpler-than-expected explanation of something that previously seemed magical: life (evolution), motion of the planets (law of gravitation), etc. Now those are among our most celebrated discoveries.
July 31, 2025 at 9:10 PM
Of course, personally, I think we need not shy away from this possibility. Maybe intelligence is simpler than we thought, and there's a beauty in that too.
July 31, 2025 at 12:54 AM
I think part of it is that people might be overestimating the complexity of intelligence, and it's hard not to.

How weird it would be if an LLM (a Markov chain!) could explain "thinking".

It feels like it makes us less special, like Copernicus placing the sun at the center, rather than the Earth.
July 31, 2025 at 12:54 AM
I enjoy your posts! I hope you keep at it.
July 27, 2025 at 4:34 AM
Finite: right, you would need to train the student on inputs beyond the GT x's.

Wrong: the teacher could underfit and be more correct than the "GT" y's. This paper is about one version of this: arxiv.org/abs/2206.15477
Denoised MDPs: Learning World Models Better Than the World Itself
The ability to separate signal from noise, and reason with clean abstractions, is critical to intelligence. With this ability, humans can efficiently perform real world tasks without considering all p...
arxiv.org
July 16, 2025 at 4:21 PM