Alex Gates
banner
complexgates.bsky.social
Alex Gates
@complexgates.bsky.social
Assistant Professor of Data Science @UVA. Network Science, human behavior and the emergent constraints of the organizations and ecosystems we build.
9/
To wrap up: this framework doesn’t solve every debate about clustering similarity…
…but it does finally give us a shared language for understanding why different measures disagree, and how they fit together.
If you’re curious, the preprint is here👇
arxiv.org/abs/2511.03000

Thanks for reading! 🧵✨
Unifying Information-Theoretic and Pair-Counting Clustering Similarity
Comparing clusterings is central to evaluating unsupervised models, yet the many existing similarity measures can produce widely divergent, sometimes contradictory, evaluations. Clustering similarity ...
arxiv.org
November 24, 2025 at 8:58 PM
8/
I'm also able to show that information-theoretic measures can be approximated using higher-order tuple counting (triplets, quads, …) built on top of pair counting.
November 24, 2025 at 8:58 PM
7/
So…I’m excited to share my paper introducing a unified framework for clustering similarity:

It introduces a unified framework where pair-counting and information-theoretic measures both are expressed as algebraic expansions around “independence” and pin-points which terms differ
November 24, 2025 at 8:58 PM
6/
As a community, we have plenty of examples of when the measures differ, but we’ve lacked a principled framework explaining why these measures disagree, how they relate, and whether they’re reconcilable.

And honestly?
It always bothered me.
November 24, 2025 at 8:58 PM
5/
If you’ve ever used examples from both families, there’s a good chance:

The pair-counting score says these clusterings are nearly identical!
The information-theoretic score says they share almost no structure!

…and you’re left thinking:
“How can both be ‘right’?”
November 24, 2025 at 8:58 PM
4/
Pair-counting measures think in terms of pairs of items:
“How many pairs of nodes did both communities put together?… or apart?”

Information-theoretic measures ask instead:
“How much uncertainty remains in one clustering given the other?”
November 24, 2025 at 8:58 PM
3/
Broadly, the community has coalesced around two major families of clustering similarity measures:

Pair-counting measures
(e.g., Rand, Adjusted Rand, Jaccard)

…and

Information-theoretic measures
(e.g., Mutual Information, NMI, Variation of Information)
November 24, 2025 at 8:58 PM
2/
Clustering is everywhere in science: communities in social networks, customer segments in marketing, functional groups in biology

And yet, there’s no universal “right” answer for how similar they are.

Turns out: measuring similarity between clusterings is surprisingly deep, subtle, and messy.
November 24, 2025 at 8:58 PM