Xan Gregg
banner
xangregg.bsky.social
Xan Gregg
@xangregg.bsky.social
1.7K followers 1.8K following 240 posts
Engineering Fellow at JMP, focused on #DataViz, preferring smoothers over fitted lines. Creator of JMP #GraphBuilder and #PackedBars chart type for high-cardinality Pareto data. #TieDye #LessIsMore
Posts Media Videos Starter Packs
New blog post looking at some recently-shared NCAA football player data. The scatterplot is percent drafted to NFL against average player high school rating by college. Also trying out inward-jittered, smoothed dot plots.
rawdatastudies.com/2025/10/26/n...
Dot plot #dataviz comparison: ratings of FIDE chess Grand Masters via Tidy Tuesday.
1 Nearest stacks (Wilkinson)
2 Smoothed stacks
3 Smoothed hexagonal grid
4 Exact position (beeswarm)
Smoothing trades delta-x for spikiness (deviation from kernel density estimate).
JMP 19 is out (free trial available), and I wrote a blog post about the main things I worked on. Constrained smoothers, jitter options, easier arrows, parallel y axes, ... #dataviz
community.jmp.com/t5/JMPer-Cab...
Yay, I was able to reproduce the lines in this chart precisely from the raw data. The original shows summary dots where mine shows raw data dots, and at a couple zoom levels. The power of statistics; signal and noise. www.nature.com/articles/s41...
Right, I assume the mean of the line is based on the data. The paper mentions σ as 3000 steps. Perhaps each line is a hypothetical average of a few thousand such movers, but I can't find any such explanation.
What to make of a paper that shares a ton of well-organized data and code for its charts, but not enough detail for analysis? PII concerns, maybe.
Curiously, these line charts are random data, suggesting steadier step counts. www.nature.com/articles/s41...
Here's the smoothed grid with dots colored by their values's ones digit (walkScore % 10), and a superposition attempt, with smoothed in gray. (I didn't quite get the walk score per dot width to be an exact number of pixels.) Hope these capture the diagnostic you're looking for.
Quick dot plot #dataviz study with 2500 US city Walk Scores. Plain dot plot (exact because scores are integers), with smoothing (±1), and with hexagonal placement (±0.75). Data from www.walkscore.com
It can't be a ratio of the changes since the denominator could be very small, even 0. However, using (total + first)/(total+latest) is no good since base is so much bigger. It seems like some smoothing/annualizing is happening. Closest I could get was a 12-month cumulative error versus the total.
Better alternative?:
The datasets and code have been escrowed with the publisher (checksum xxx). They will be shared upon request to the corresponding author cc publisher for the following purposes: x, y, z. If no response within n days, notify the publisher and the paper will be retracted.
1. Email addresses change.
2. Author becomes unavailable (retire, get busy, ...)
3. No definition of reasonable.
4. No way to verify that any supplied data is the actual data.
5. The supplied data may not be complete.
6. No penalty for breaking promise.
A week ago I was crazy enough to email a paper's corresponding author for the data. No response.

"The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request."

Why is this data availability statement evil? A few reasons: ...
Rare sighting of letter-values plots in the wild. Nicely described in the caption as "plots which first identify the median, then extend boxes outward, each covering half of the remaining data." n=2.9M, so regular box plots would be swamped with outliers. #dataviz
arxiv.org/pdf/2402.14583
That's a good way to put it. That example is mostly for those with a strict rule. I think my internal rule matches yours: the bar origin should be a "meaningful baseline" such that 2x bar height is a 2x effect from the baseline.
The originals could serve as fodder for some #dataviz guides. When the zero-origin rule breaks down or when to use dots/lines instead of bars.
Great improvement sequence, but for me, it's harder to verify which categories are changing after putting their bars in separate groups. I see it's a trade-off with simplifying the coloring. Here's a try at sticking with the original ordering, at the cost of an imperfect time legend.
The Secret of Data Science. I don't know if I'll ever get the chance to present this wisdom in public, so I'm sharing a rehearsal video from my rejected OutlierConf lightning talk submission. It really needs a live audience, though. #dataviz youtu.be/imRSlilIw5k
Secret of Data Science
YouTube video by The Graph Builder
youtu.be
Not mine, just to be clear. But, yes, very nice!
This article by Don Wheeler has a good discussion of Grubbs' test and others. www.qualitydigest.com/inside/stati... [free reg reqd]
He's a control charts expert, which explains the sequence-based context and small data sizes.
Shortest Half is a real thing: the smallest interval containing half the data, as a measure of the densest area. I've tried expanding it to allow a split interval and plotting is like density intervals. Iteratively talking the shortest half produces the "half-sample mode" shown as shortest 0.
I need to do a full write-up. The green ones are my experimental inventions. One idea was that "shortest half" and related intervals would make good data-driven density intervals. They seem better for very skewed distributions like exponential, but maybe not so great in general.
Round 2 of my 1D #dataviz experiment at xangregg.github.io/data-strips/.
I realized my adaptive outlier idea was already done as Grubbs' test, which I've adapted for non-gaussian moments.
Added a couple thirds-based views. Here's 5000 random normal samples plus 2 outliers. The green ones use Grubbs.