Arseny Kapoulkine
@zeux.io
Previously: technical fellow at Roblox
meshoptimizer, pugixml, volk, calm, niagara, qgrep, Luau
https://github.com/zeux
https://zeux.io
meshoptimizer, pugixml, volk, calm, niagara, qgrep, Luau
https://github.com/zeux
https://zeux.io
It's uarch dependent of course; perhaps some older CPUs would always insert an entry into BTB? Unsure!
October 23, 2025 at 6:46 PM
It's uarch dependent of course; perhaps some older CPUs would always insert an entry into BTB? Unsure!
Branches that are never taken generally don't occupy BTB space. They do eat bits in the history buffer.
October 23, 2025 at 3:42 PM
Branches that are never taken generally don't occupy BTB space. They do eat bits in the history buffer.
Black Square - Wikipedia
en.wikipedia.org
October 20, 2025 at 11:16 PM
Neat!
There is some latency penalty to read the discriminator to decode the instance, and the resulting offset for actual data might make resulting unaligned loads not ideal. I'm wondering if it makes sense to have baseline alignment - e.g. 16 bytes - and encode id into lower bits of instance ref.
There is some latency penalty to read the discriminator to decode the instance, and the resulting offset for actual data might make resulting unaligned loads not ideal. I'm wondering if it makes sense to have baseline alignment - e.g. 16 bytes - and encode id into lower bits of instance ref.
October 16, 2025 at 4:47 AM
Neat!
There is some latency penalty to read the discriminator to decode the instance, and the resulting offset for actual data might make resulting unaligned loads not ideal. I'm wondering if it makes sense to have baseline alignment - e.g. 16 bytes - and encode id into lower bits of instance ref.
There is some latency penalty to read the discriminator to decode the instance, and the resulting offset for actual data might make resulting unaligned loads not ideal. I'm wondering if it makes sense to have baseline alignment - e.g. 16 bytes - and encode id into lower bits of instance ref.
Ideally set to “tada.wav” of Win 3.11 era youtu.be/QDUv_8Dw-Mw?...
Windows 3.1 - Tada
YouTube video by ProductDesignsYT
youtu.be
October 15, 2025 at 3:39 PM
Ideally set to “tada.wav” of Win 3.11 era youtu.be/QDUv_8Dw-Mw?...
At this point I'm curious, what's the end game here? :)
October 10, 2025 at 6:39 AM
At this point I'm curious, what's the end game here? :)
A fun reformulation that’s used a lot these days: when computing sofmax of a vector (pointwise exp(x) divided by sum of exp(x)), to avoid exponent overflow you can compute max(x) and compute softmax(x-max(x)) instead. It’s equivalent to dividing but done before exponentiation, it makes all exp <= 1.
October 9, 2025 at 4:18 PM
A fun reformulation that’s used a lot these days: when computing sofmax of a vector (pointwise exp(x) divided by sum of exp(x)), to avoid exponent overflow you can compute max(x) and compute softmax(x-max(x)) instead. It’s equivalent to dividing but done before exponentiation, it makes all exp <= 1.
Yeah if you can process two triangles at once you can do this without weird offset math. However, for mesh shaders, certain *ahemd* vendors really want you to output one triangle per thread, so that option stops being appealing...
September 23, 2025 at 2:13 AM
Yeah if you can process two triangles at once you can do this without weird offset math. However, for mesh shaders, certain *ahemd* vendors really want you to output one triangle per thread, so that option stops being appealing...
~3m instead of ~4m now :) It's actually ~2m40s when not ran under the profiler for whatever reason.
While I *can* make the green bars completely solid I've already spent way longer than I should on this exercise so this will have to do!
While I *can* make the green bars completely solid I've already spent way longer than I should on this exercise so this will have to do!
September 19, 2025 at 10:09 PM
~3m instead of ~4m now :) It's actually ~2m40s when not ran under the profiler for whatever reason.
While I *can* make the green bars completely solid I've already spent way longer than I should on this exercise so this will have to do!
While I *can* make the green bars completely solid I've already spent way longer than I should on this exercise so this will have to do!
Not just from tiny_ocl.h, no.
September 2, 2025 at 8:33 PM
Not just from tiny_ocl.h, no.
I would recommend updating documentation as tiny_bvh requires C++17 now (thousands separators are C++14, static inline variables are C++17); this is not very obvious from the readme.
September 2, 2025 at 6:09 PM
I would recommend updating documentation as tiny_bvh requires C++17 now (thousands separators are C++14, static inline variables are C++17); this is not very obvious from the readme.
It's kinda ironic: on one hand, it's probably enough; on the other hand, Epic had to write a custom compute micropoly rasterizer because it was in fact not enough :)
2080 had 6 GPCs @ 1.5 GHz, 5070 has 5 GPCs @ 2.3 GHz. So just in general fairly close, as long as they didn't increase tri/GPC rate.
2080 had 6 GPCs @ 1.5 GHz, 5070 has 5 GPCs @ 2.3 GHz. So just in general fairly close, as long as they didn't increase tri/GPC rate.
August 26, 2025 at 4:29 AM
It's kinda ironic: on one hand, it's probably enough; on the other hand, Epic had to write a custom compute micropoly rasterizer because it was in fact not enough :)
2080 had 6 GPCs @ 1.5 GHz, 5070 has 5 GPCs @ 2.3 GHz. So just in general fairly close, as long as they didn't increase tri/GPC rate.
2080 had 6 GPCs @ 1.5 GHz, 5070 has 5 GPCs @ 2.3 GHz. So just in general fairly close, as long as they didn't increase tri/GPC rate.
Obviously if I ran this on a 5090 I'd expect higher performance... and many more watts. But I don't have a 5090.
And probably from the architectural perspective, pure rasterization bottlenecks have been squeezed dry 7 years ago and there's not much else to do, and not much need - 19B/sec is enough
And probably from the architectural perspective, pure rasterization bottlenecks have been squeezed dry 7 years ago and there's not much else to do, and not much need - 19B/sec is enough
August 26, 2025 at 3:50 AM
Obviously if I ran this on a 5090 I'd expect higher performance... and many more watts. But I don't have a 5090.
And probably from the architectural perspective, pure rasterization bottlenecks have been squeezed dry 7 years ago and there's not much else to do, and not much need - 19B/sec is enough
And probably from the architectural perspective, pure rasterization bottlenecks have been squeezed dry 7 years ago and there's not much else to do, and not much need - 19B/sec is enough
So this suggests no progress in pure rasterization performance in... 7 years? In fact a noticeable regression in tri/sec/W.
Of course, when people say "rasterization", they usually mean modern ALU heavy rendering pipelines - not pure geometry stress test. Still!
Caveat: no 2080 to retest again :)
Of course, when people say "rasterization", they usually mean modern ALU heavy rendering pipelines - not pure geometry stress test. Still!
Caveat: no 2080 to retest again :)
August 26, 2025 at 3:45 AM
So this suggests no progress in pure rasterization performance in... 7 years? In fact a noticeable regression in tri/sec/W.
Of course, when people say "rasterization", they usually mean modern ALU heavy rendering pipelines - not pure geometry stress test. Still!
Caveat: no 2080 to retest again :)
Of course, when people say "rasterization", they usually mean modern ALU heavy rendering pipelines - not pure geometry stress test. Still!
Caveat: no 2080 to retest again :)
... couple minor tweaks to meshlet function interfaces, as they were extremely experimental at the time. Everything else worked as is.
- Curiously, the commit log said it ran at ~19B tri/sec on RTX 2080. On my RTX 5070 now, I get ~17B tri/sec on the same mesh. 5070 is 250W, my 2080 was a 215W model.
- Curiously, the commit log said it ran at ~19B tri/sec on RTX 2080. On my RTX 5070 now, I get ~17B tri/sec on the same mesh. 5070 is 250W, my 2080 was a 215W model.
August 26, 2025 at 3:45 AM
... couple minor tweaks to meshlet function interfaces, as they were extremely experimental at the time. Everything else worked as is.
- Curiously, the commit log said it ran at ~19B tri/sec on RTX 2080. On my RTX 5070 now, I get ~17B tri/sec on the same mesh. 5070 is 250W, my 2080 was a 215W model.
- Curiously, the commit log said it ran at ~19B tri/sec on RTX 2080. On my RTX 5070 now, I get ~17B tri/sec on the same mesh. 5070 is 250W, my 2080 was a 215W model.
Starting in ~20 minutes!
August 23, 2025 at 5:38 PM
Starting in ~20 minutes!