Lightnews — Scholar-powered news

Neat!

There is some latency penalty to read the discriminator to decode the instance, and the resulting offset for actual data might make resulting unaligned loads not ideal. I'm wondering if it makes sense to have baseline alignment - e.g. 16 bytes - and encode id into lower bits of instance ref.

October 16, 2025 at 4:47 AM

Arseny Kapoulkine

@zeux.io

Ideally set to “tada.wav” of Win 3.11 era youtu.be/QDUv_8Dw-Mw?...

Windows 3.1 - Tada

YouTube video by ProductDesignsYT

youtu.be

October 15, 2025 at 3:39 PM

Arseny Kapoulkine

@zeux.io

At this point I'm curious, what's the end game here? :)

October 10, 2025 at 6:39 AM

Arseny Kapoulkine

@zeux.io

A fun reformulation that’s used a lot these days: when computing sofmax of a vector (pointwise exp(x) divided by sum of exp(x)), to avoid exponent overflow you can compute max(x) and compute softmax(x-max(x)) instead. It’s equivalent to dividing but done before exponentiation, it makes all exp <= 1.

October 9, 2025 at 4:18 PM

Arseny Kapoulkine

@zeux.io

Yeah if you can process two triangles at once you can do this without weird offset math. However, for mesh shaders, certain *ahemd* vendors really want you to output one triangle per thread, so that option stops being appealing...

September 23, 2025 at 2:13 AM

Arseny Kapoulkine

@zeux.io

~3m instead of ~4m now :) It's actually ~2m40s when not ran under the profiler for whatever reason.

While I *can* make the green bars completely solid I've already spent way longer than I should on this exercise so this will have to do!

September 19, 2025 at 10:09 PM

Arseny Kapoulkine

@zeux.io

Not just from tiny_ocl.h, no.

September 2, 2025 at 8:33 PM

Arseny Kapoulkine

@zeux.io

I would recommend updating documentation as tiny_bvh requires C++17 now (thousands separators are C++14, static inline variables are C++17); this is not very obvious from the readme.

September 2, 2025 at 6:09 PM

Arseny Kapoulkine

@zeux.io

It's kinda ironic: on one hand, it's probably enough; on the other hand, Epic had to write a custom compute micropoly rasterizer because it was in fact not enough :)

2080 had 6 GPCs @ 1.5 GHz, 5070 has 5 GPCs @ 2.3 GHz. So just in general fairly close, as long as they didn't increase tri/GPC rate.

August 26, 2025 at 4:29 AM

Arseny Kapoulkine

@zeux.io

Obviously if I ran this on a 5090 I'd expect higher performance... and many more watts. But I don't have a 5090.

And probably from the architectural perspective, pure rasterization bottlenecks have been squeezed dry 7 years ago and there's not much else to do, and not much need - 19B/sec is enough

August 26, 2025 at 3:50 AM

Arseny Kapoulkine

@zeux.io

So this suggests no progress in pure rasterization performance in... 7 years? In fact a noticeable regression in tri/sec/W.

Of course, when people say "rasterization", they usually mean modern ALU heavy rendering pipelines - not pure geometry stress test. Still!

Caveat: no 2080 to retest again :)

August 26, 2025 at 3:45 AM

Arseny Kapoulkine

@zeux.io

... couple minor tweaks to meshlet function interfaces, as they were extremely experimental at the time. Everything else worked as is.
- Curiously, the commit log said it ran at ~19B tri/sec on RTX 2080. On my RTX 5070 now, I get ~17B tri/sec on the same mesh. 5070 is 250W, my 2080 was a 215W model.

August 26, 2025 at 3:45 AM

Arseny Kapoulkine

@zeux.io

Starting in ~20 minutes!

August 23, 2025 at 5:38 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news