Author | Lightnews

Xan Gregg @xangregg.bsky.social · 5d

New blog post looking at some recently-shared NCAA football player data. The scatterplot is percent drafted to NFL against average player high school rating by college. Also trying out inward-jittered, smoothed dot plots.
rawdatastudies.com/2025/10/26/n...

Scatter plot with a fitted regression spline of "average play rating" on X and "NFL draft rate" on the Y.

Dot chart of player rating on the X axis and Draft status (true or false) on the Y axis. values are jittered inward in the Y direction, resulting in two smoothed Wilkinson-like hexagonal-grid dot plots with the top one inverted.

1 3

Xan Gregg @xangregg.bsky.social · Sep 27

Dot plot #dataviz comparison: ratings of FIDE chess Grand Masters via Tidy Tuesday.
1 Nearest stacks (Wilkinson)
2 Smoothed stacks
3 Smoothed hexagonal grid
4 Exact position (beeswarm)
Smoothing trades delta-x for spikiness (deviation from kernel density estimate).

Dot plot of FIDE Grand Master chess ELO ratings. Most are between 2400 and 2600. Each dot is placed in the nearest stack.
Source data: https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-09-23/readme.md

Dot plot of FIDE Grand Master chess ELO ratings. Most are between 2400 and 2600. Each dot is placed in the nearest stack or the next nearest if it helps avoid spikes.

Dot plot of FIDE Grand Master chess ELO ratings. Most are between 2400 and 2600. Each dot is placed at its true x position, adjusted vertically to avoid overlap.

1 1 11

Xan Gregg @xangregg.bsky.social · Sep 19

JMP 19 is out (free trial available), and I wrote a blog post about the main things I worked on. Constrained smoothers, jitter options, easier arrows, parallel y axes, ... #dataviz
community.jmp.com/t5/JMPer-Cab...

Image from linked blog post showing hexagonal jitter example. Penguin body mass colored by sex.

5

Xan Gregg @xangregg.bsky.social · Sep 11

‪Nice wxdu.org set tonight from @gavinyamey.bsky.social‬, especially the spoken word pieces.

wxdu.org

WXDU 88.7 FM Durham - Duke University Radio

wxdu.org

2

Xan Gregg @xangregg.bsky.social · Aug 22

Yay, I was able to reproduce the lines in this chart precisely from the raw data. The original shows summary dots where mine shows raw data dots, and at a couple zoom levels. The power of statistics; signal and noise. www.nature.com/articles/s41...

Figure 2a from Nature paper https://www.nature.com/articles/s41586-025-09321-3. Shows a line chart over a bubble chart.

Recreation of previous chart but showing raw data dots instead of summarized bubbles.

Same as previous chart, but with the Y axis zoomed out a bit.

Same as previous chart, but with the Y axis zoomed out to show all the data (and fitted lines look almost flat).

5

Xan Gregg @xangregg.bsky.social · Aug 20

Right, I assume the mean of the line is based on the data. The paper mentions σ as 3000 steps. Perhaps each line is a hypothetical average of a few thousand such movers, but I can't find any such explanation.

1

Xan Gregg @xangregg.bsky.social · Aug 20

What to make of a paper that shares a ton of well-organized data and code for its charts, but not enough detail for analysis? PII concerns, maybe.
Curiously, these line charts are random data, suggesting steadier step counts. www.nature.com/articles/s41...

Excerpt from Figure 1 of https://www.nature.com/articles/s41586-025-09321-3, showing part of a US map with two inset line charts indicating the daily step counts before and after a move from Seattle to San Francisco, The lines are relatively flat, at around 6000 steps for Seattle and 6700 steps for San Francisco.

$Screenshot from a Python notebook shared with the paper. Code part reads: with plt.rc_context({'figure.autolayout': True}): fig, ax = plt.subplots(figsize=(4, 2)); pre_x = range(-35, -5); y = np.random.normal(loc=from_df.loc[from_df['from_loc'] == 'Seattle, WA', 'pre_avg'], scale=50., size=(len(pre_x), )); plt.plot(pre_x, y, lw=5., c='#aa3939'); plt.ylim(5800, 7000); ax.grid(False); for item in ([ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() + ax.get_yticklabels()): item.set_fontsize(axis_fontsize); ax.set(xlabel=r'Days from Move $\left(t - t_{move}\right)$', ylabel='Daily Steps', xticks=range(-35, -4, 10)); fig.tight_layout() plt.savefig('../output/fig1b_subplots/seattle_from.png', dpi=600);$

1 3

Xan Gregg @xangregg.bsky.social · Aug 17

Here's the smoothed grid with dots colored by their values's ones digit (walkScore % 10), and a superposition attempt, with smoothed in gray. (I didn't quite get the walk score per dot width to be an exact number of pixels.) Hope these capture the diagnostic you're looking for.

Smoothed dot plot with dots colored according to their true values.

Overlaid smoothed and unsmoothed dot plots

1 2

Xan Gregg @xangregg.bsky.social · Aug 17

Quick dot plot #dataviz study with 2500 US city Walk Scores. Plain dot plot (exact because scores are integers), with smoothing (±1), and with hexagonal placement (±0.75). Data from www.walkscore.com

Dot plot of Walk Scores (0-100 scale) of 2500 US cities. The bulk is centered around 25-45 range. Occasional spikes with one big one at 37.

Dot plot of Walk Scores (0-100 scale) of 2500 US cities. The bulk is centered around 25-45 range. Smoothed pile heights.

Dot plot of Walk Scores (0-100 scale) of 2500 US cities. The bulk is centered around 25-45 range. Smoothed pile heights on a hexagonal grid.

1 6

Xan Gregg @xangregg.bsky.social · Aug 13

It can't be a ratio of the changes since the denominator could be very small, even 0. However, using (total + first)/(total+latest) is no good since base is so much bigger. It seems like some smoothing/annualizing is happening. Closest I could get was a 12-month cumulative error versus the total.

1 1 1

Xan Gregg @xangregg.bsky.social · Aug 10

Better alternative?:
The datasets and code have been escrowed with the publisher (checksum xxx). They will be shared upon request to the corresponding author cc publisher for the following purposes: x, y, z. If no response within n days, notify the publisher and the paper will be retracted.

2 8

Xan Gregg @xangregg.bsky.social · Aug 10

1. Email addresses change.
2. Author becomes unavailable (retire, get busy, ...)
3. No definition of reasonable.
4. No way to verify that any supplied data is the actual data.
5. The supplied data may not be complete.
6. No penalty for breaking promise.

1 9

Xan Gregg @xangregg.bsky.social · Aug 10

A week ago I was crazy enough to email a paper's corresponding author for the data. No response.

"The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request."

Why is this data availability statement evil? A few reasons: ...

4 4 24

Xan Gregg @xangregg.bsky.social · Aug 1

Rare sighting of letter-values plots in the wild. Nicely described in the caption as "plots which first identify the median, then extend boxes outward, each covering half of the remaining data." n=2.9M, so regular box plots would be swamped with outliers. #dataviz
arxiv.org/pdf/2402.14583

Chart with 5 letter-values plots, a variant of box plots.

3 8

Xan Gregg @xangregg.bsky.social · Jul 23

That's a good way to put it. That example is mostly for those with a strict rule. I think my internal rule matches yours: the bar origin should be a "meaningful baseline" such that 2x bar height is a 2x effect from the baseline.

3

Xan Gregg @xangregg.bsky.social · Jul 23

The originals could serve as fodder for some #dataviz guides. When the zero-origin rule breaks down or when to use dots/lines instead of bars.

Bar charts from https://www.dailymail.co.uk/sciencetech/article-13739705/london-underground-hottest-line.html, showing average temperature for 10 years using bar charts with origins at 0°C. All bars in the 25 to 30 range, showing little variation at the scale.

1 2

Xan Gregg @xangregg.bsky.social · Jul 21

Need to get the raw data behind a chart? Here's a walk-through of my PDF → SVG → CSV → Data techinque. #dataviz
rawdatastudies.com/2025/07/21/d...

Data extraction challenge – Raw Data Studies

Throughout my quests for raw data, I've learned a few techniques for find data lurking behind the charts. This walk-through shows a few of them,

rawdatastudies.com

2 2 25

Xan Gregg @xangregg.bsky.social · Jul 16

Great improvement sequence, but for me, it's harder to verify which categories are changing after putting their bars in separate groups. I see it's a trade-off with simplifying the coloring. Here's a try at sticking with the original ordering, at the cost of an imperfect time legend.

bar chart with 15 bars for 5 categories across 3 time periods each.

1

Xan Gregg @xangregg.bsky.social · Jul 11

The Secret of Data Science. I don't know if I'll ever get the chance to present this wisdom in public, so I'm sharing a rehearsal video from my rejected OutlierConf lightning talk submission. It really needs a live audience, though. #dataviz youtu.be/imRSlilIw5k

Secret of Data Science

YouTube video by The Graph Builder

youtu.be

1 7

Xan Gregg @xangregg.bsky.social · Jul 6

I've written an explainer blog post for my "data strips" #dataviz workbench web app, trying alternative 1-D distribution summary plots. rawdatastudies.com/2025/07/05/d...

Data Strips Experiment

I built a “Data Strips” app to experiment with new ways of graphically summarizing the distribution of a single variable.. You can try it out or access the code on GitHub. This post wil…

rawdatastudies.com

8 38