Lightnews — Scholar-powered news

Tim Mastny

@timmastny.bsky.social

Word is that it's just a wrapper around Claude 3.7! www.reddit.com/r/LocalLLaMA...

Manus turns out to be just Claude Sonnet + 29 other tools, Reflection 70B vibes ngl

www.reddit.com

March 10, 2025 at 5:03 PM

Tim Mastny

@timmastny.bsky.social

This is called false sharing - when CPUs coordinate over variables they don't actually share.

In this case the penalty was small, but in other cases it can cause 5%, 25%, or even 8000% performance degradation. If you want to learn more, check out my post!
timmastny.com/blog/false-s...

False Sharing | Tim Mastny

timmastny.com

February 18, 2025 at 6:28 PM

Tim Mastny

@timmastny.bsky.social

The cache line is the smallest unit of data unique to each CPU cache, commonly 64 or 128 bytes. If a CPU wants to modify a variable on a cache line, it must coordinate with all other CPUs to invalidate their copy, even if they some of the variables on that line are not shared.

February 18, 2025 at 6:28 PM

Tim Mastny

@timmastny.bsky.social

Variables `a` and `b` are not shared between the two threads: thread 0 uses a and thread 1 uses b. But since they are initialized next to each other in the program on the left, they share the same cache line.

February 18, 2025 at 6:28 PM

Tim Mastny

@timmastny.bsky.social

The program on the right is 0.7% faster measured over millions of iterations. Why?

February 18, 2025 at 6:28 PM

Tim Mastny

@timmastny.bsky.social

Check out problem 2: sirupsen.com/napkin/probl...
And my napkin math: timmastny.com/projects/nap...

Expected Database Query Latency

sirupsen.com

February 12, 2025 at 5:48 PM

Tim Mastny

@timmastny.bsky.social

It’s also random for main memory, but we only need to load one new cache line from main memory per page: 50 * 50 ns = 2.5 us. The prefetcher should ensure the sequential read of the rest of the page is fast.

Even if we have to fetch more cache lines, it’s much faster than 1ms!

February 12, 2025 at 5:48 PM

Tim Mastny

@timmastny.bsky.social

Solution: I didn’t consider that each page access is random! A random SSD of 8KB is 100 us, so 101 us ~ 100 us for 16KB, 1 ms for 10 pages.

When doing random reads, SSDs are 100x slower, not 10x!

February 12, 2025 at 5:48 PM

Tim Mastny

@timmastny.bsky.social

Total:
50 pages * .25 us / page = 12.5 us
10 pages * 2 us / page = 20 us
12.5 us + 20 us = 32.5 us

Seems reasonable: SSD is 10x slower than main memory.

February 12, 2025 at 5:48 PM

Tim Mastny

@timmastny.bsky.social

50 pages / query * 80% of pages in cache = 40 pages in cache, 10 on disk.

Sequential SSD read: 1 us / 8KB * 16KB / page = 2 us / page.

Page scan time: 1 ns / 64 bytes * 16KB / page = 250 ns / page = .25 us / page

February 12, 2025 at 5:48 PM

Tim Mastny

@timmastny.bsky.social

The blue line in the first plot is our theoretical model and the black points are average search times on my implementation of B+ tree search. Our theoretical model agrees very closely with our benchmark!

Check out my blog post for more details: timmastny.com/blog/tuning-...

Tuning B+ Trees for Memory Performance | Tim Mastny

timmastny.com

February 11, 2025 at 6:54 PM

Tim Mastny

@timmastny.bsky.social

Pointer chasing is slow, but the number is limited to the height of the tree. Scanning is fast, but quickly adds up. Our model combines these costs: for any number of total keys, we can calculate the optimal number of keys per node 'b' that minimizes total search time.

February 11, 2025 at 6:54 PM

Tim Mastny

@timmastny.bsky.social

On the other hand, a B+ tree with infinite keys per node is an array. A sequential scan of 1 trillion keys would take about 16 minutes at 1ns per key.

February 11, 2025 at 6:54 PM

Tim Mastny

@timmastny.bsky.social

Let’s start with the extremes. With one key per node, our B+ tree is a binary search tree. Even with 1,000,000,000 keys, the maximum height is <30. We can find any key in less 3 us: 30 pointer jumps * 100 ns per jump = 3 us

February 11, 2025 at 6:54 PM

Tim Mastny

@timmastny.bsky.social

Here's problem one from the newsletter:
sirupsen.com/napkin/probl...

And my napkin math:
timmastny.com/projects/nap...

Problem 1 | Tim Mastny

timmastny.com

February 4, 2025 at 11:52 PM

Tim Mastny

@timmastny.bsky.social

The solution estimated the storage cost to be $0.01 GB/month, but didn't mention write costs. Simon mentions disk storage rather than blob, which I did not consider 🤔

February 4, 2025 at 11:52 PM

Tim Mastny

@timmastny.bsky.social

Tips from solution: calculate and estimate with powers of 10 to simplify the math, e.g. 1 day = 9 * 10^4 seconds. I also like to use powers of 10 that are multiples of three to match SI prefixes: 5 * 10^5 bytes = 0.5 * 10^6 = 0.5 MB

February 4, 2025 at 11:52 PM

Tim Mastny

@timmastny.bsky.social

At $5 / 1 million writes, we have to batch requests together before writing to the blob, or we would 250x our storage cost. Batching 1000 requests costs $15,000 / year. Batching 1,000,000 is only $15.

February 4, 2025 at 11:52 PM

Tim Mastny

@timmastny.bsky.social

Data size: 1 character per byte, 5 characters per word, 200 words per request = 1KB per request. That works out to be 9TB / day. Blob storage at $0.02 GB / month is $65,000 per year. But we also have to pay for writes to blob storage!

February 4, 2025 at 11:52 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news