Tavian Barnes
tavianator.com
Tavian Barnes
@tavianator.com
Aspiring computer scientist
Nah man rent used to actually be lower
July 1, 2025 at 4:37 PM
imo the third post is better: tavianator.com/2022/ray_box...
Fast, Branchless Ray/Bounding Box Intersections, Part 3: Boundaries - tavianator.com
tavianator.com
June 27, 2025 at 5:22 PM
38 seconds total, if you read the whole line.

Anyway I'm not suggesting that an "actual project" spend time on this, but maybe a build system should?
April 27, 2025 at 5:35 PM
Keep going? No, fail fast!
April 14, 2025 at 3:52 PM
And the explanation: tavianator.com/2025/shlxpla...
The Alder Lake anomaly, explained - tavianator.com
tavianator.com
January 4, 2025 at 7:08 PM
Re-measured it: the prefetch is still a significant improvement, ~11% higher throughput at 8 threads, ~30% higher at 12 threads. I should probably write this up somewhere
January 2, 2025 at 3:36 PM
In both cases the counters claim 1 uop, so I don't think so
December 31, 2024 at 10:55 PM
It's fast for me in practice on Zen 2 at least. Maybe someday I'll microbenchmark it. But also I've dramatically optimized the MPMC queue recently, not sure the prefetch still makes a difference
December 31, 2024 at 10:53 PM
Okay new discovery: "MOV R10D, 1" also gives 1c latency. But "MOV R10, 1" gives 3c latency. Something to do with whether the top half of the count register is zeroed and how.
December 31, 2024 at 7:42 PM
Maybe a register renaming issue?
December 31, 2024 at 7:07 PM
So I suspect some code alignment issue or something is at fault
December 31, 2024 at 6:50 PM
Hmm, something very strange is going on. I can reproduce their benchmark. But if I add "XOR R8, R8; XOR R9, R9; XOR R10, R10" to -asm_init to zero all the registers, it goes from 8 cycles to 6, i.e. 1-cycle latency.

*But*, if I instead use "MOV R10, 0" to zero it out, it's back to 8 cycles!
December 31, 2024 at 6:49 PM