Prev. PhD @Brown, @Google, @GoPro. Crêpe lover.
📍 Boston | 🔗 thomasfel.me
Huge thanks to all collaborators who made this work possible — and especially to @binxuwang.bsky.social , with whom this project was built, experiment after experiment.
🎮 kempnerinstitute.github.io/dinovision/
📄 arxiv.org/pdf/2510.08638
Huge thanks to all collaborators who made this work possible — and especially to @binxuwang.bsky.social , with whom this project was built, experiment after experiment.
🎮 kempnerinstitute.github.io/dinovision/
📄 arxiv.org/pdf/2510.08638
(i) Concepts = points (or regions), not directions
(ii) Probing is bounded: toward archetypes, not vectors
(iii) Can't recover generating hulls from sum: we should look deeper than just a single-layer activations to recover the true latents
(i) Concepts = points (or regions), not directions
(ii) Probing is bounded: toward archetypes, not vectors
(iii) Can't recover generating hulls from sum: we should look deeper than just a single-layer activations to recover the true latents
Activations = multiple convex hulls simultaneously: a rabbit among animals, brown among colors, fluffy among textures.
The Minkowski Representation Hypothesis.
Activations = multiple convex hulls simultaneously: a rabbit among animals, brown among colors, fluffy among textures.
The Minkowski Representation Hypothesis.
We found that pos. information collapses: from high-rank to a near 2-dim sheet. Early layers encode precise location; later ones retain abstract axes.
This compression frees dimensions for features, and *position doesn't explain PCA map smoothness*
We found that pos. information collapses: from high-rank to a near 2-dim sheet. Early layers encode precise location; later ones retain abstract axes.
This compression frees dimensions for features, and *position doesn't explain PCA map smoothness*
This may suggests interpolative geometry: tokens as mixtures between landmarks, shaped by clustering and spreading forces in the training objectives.
This may suggests interpolative geometry: tokens as mixtures between landmarks, shaped by clustering and spreading forces in the training objectives.
Also, co-activation statistics only moderately shape geometry: concepts that fire together aren't necessarily nearby—nor orthogonal when they don't.
Also, co-activation statistics only moderately shape geometry: concepts that fire together aren't necessarily nearby—nor orthogonal when they don't.
Instead, training drives atoms from near-Grassmannian initialization to higher coherence.
Several concepts fire almost always the embedding is partly dense (!), contradicting pure sparse coding.
Instead, training drives atoms from near-Grassmannian initialization to higher coherence.
Several concepts fire almost always the embedding is partly dense (!), contradicting pure sparse coding.
Continuing our interpretation of DINOv2, the second part of our study concerns the *geometry of concepts* and the synthesis of our findings toward a new representational *phenomenology*:
the Minkowski Representation Hypothesis
Continuing our interpretation of DINOv2, the second part of our study concerns the *geometry of concepts* and the synthesis of our findings toward a new representational *phenomenology*:
the Minkowski Representation Hypothesis
Tomorrow, Part II: geometry of concepts and Minkowski Representation Hypothesis.
🕹️ kempnerinstitute.github.io/dinovision
📄 arxiv.org/pdf/2510.08638
Tomorrow, Part II: geometry of concepts and Minkowski Representation Hypothesis.
🕹️ kempnerinstitute.github.io/dinovision
📄 arxiv.org/pdf/2510.08638
DINO seems to use them to encode global invariants: we find concepts (directions) that fire exclusively (!) on registers.
Example of such concepts include motion blur detector and style (game screenshots, drawings, paintings, warped images...)
DINO seems to use them to encode global invariants: we find concepts (directions) that fire exclusively (!) on registers.
Example of such concepts include motion blur detector and style (game screenshots, drawings, paintings, warped images...)
It turns out it has discovered several human-like monocular depth cues: texture gradients resembling blurring or bokeh, shadow detectors, and projective cues.
Most units mix cues, but a few remain remarkably pure.
It turns out it has discovered several human-like monocular depth cues: texture gradients resembling blurring or bokeh, shadow detectors, and projective cues.
Most units mix cues, but a few remain remarkably pure.
For every class, we find two concepts: one fires on the object (e.g., "rabbit"), and another fires everywhere *except* the object -- but only when it's present!
We call them Elsewhere Concepts (credit: @davidbau.bsky.social).
For every class, we find two concepts: one fires on the object (e.g., "rabbit"), and another fires everywhere *except* the object -- but only when it's present!
We call them Elsewhere Concepts (credit: @davidbau.bsky.social).
Archetypal SAE uncovered 32k concepts.
Our first observation: different tasks recruit distinct regions of this conceptual space.
Archetypal SAE uncovered 32k concepts.
Our first observation: different tasks recruit distinct regions of this conceptual space.
𝗔𝗻 𝗶𝗻𝘁𝗲𝗿𝗽𝗿𝗲𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗱𝗲𝗲𝗽 𝗱𝗶𝘃𝗲 𝗶𝗻𝘁𝗼 𝗗𝗜𝗡𝗢𝘃𝟮, one of vision’s most important foundation models.
And today is Part I, buckle up, we're exploring some of its most charming features. :)
𝗔𝗻 𝗶𝗻𝘁𝗲𝗿𝗽𝗿𝗲𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗱𝗲𝗲𝗽 𝗱𝗶𝘃𝗲 𝗶𝗻𝘁𝗼 𝗗𝗜𝗡𝗢𝘃𝟮, one of vision’s most important foundation models.
And today is Part I, buckle up, we're exploring some of its most charming features. :)
In the demo you can explore these bridges (links) and see how multimodality shows up ! :)
with @isabelpapad.bsky.social, @chloesu07.bsky.social, @shamkakade.bsky.social and Stephanie Gil
In the demo you can explore these bridges (links) and see how multimodality shows up ! :)
with @isabelpapad.bsky.social, @chloesu07.bsky.social, @shamkakade.bsky.social and Stephanie Gil
SAEs reveal that VLM embedding spaces aren’t just "image vs. text" cones.
They contain stable conceptual directions, some forming surprising bridges across modalities.
arxiv.org/abs/2504.11695
Demo 👉 vlm-concept-visualization.com
SAEs reveal that VLM embedding spaces aren’t just "image vs. text" cones.
They contain stable conceptual directions, some forming surprising bridges across modalities.
arxiv.org/abs/2504.11695
Demo 👉 vlm-concept-visualization.com
this year, sharing some work on explainability and representations. If you’re attending and want to chat, feel free to reach out !👋
this year, sharing some work on explainability and representations. If you’re attending and want to chat, feel free to reach out !👋
ResNet focused on **fur** patterns, DETR too but also use **paws** (possibly because it helps define bounding boxes), and CLIP **head** concept oddly included human heads — language shaping learned concepts?
ResNet focused on **fur** patterns, DETR too but also use **paws** (possibly because it helps define bounding boxes), and CLIP **head** concept oddly included human heads — language shaping learned concepts?