Tim Mastny
banner
timmastny.bsky.social
Tim Mastny
@timmastny.bsky.social
Software engineer bringing digital finance to Africa. Writes about low-level programming, data science, and algorithms at https://timmastny.com
Word is that it's just a wrapper around Claude 3.7! www.reddit.com/r/LocalLLaMA...
Manus turns out to be just Claude Sonnet + 29 other tools, Reflection 70B vibes ngl
www.reddit.com
March 10, 2025 at 5:03 PM
This is called false sharing - when CPUs coordinate over variables they don't actually share.

In this case the penalty was small, but in other cases it can cause 5%, 25%, or even 8000% performance degradation. If you want to learn more, check out my post!
timmastny.com/blog/false-s...
False Sharing | Tim Mastny
timmastny.com
February 18, 2025 at 6:28 PM
The cache line is the smallest unit of data unique to each CPU cache, commonly 64 or 128 bytes. If a CPU wants to modify a variable on a cache line, it must coordinate with all other CPUs to invalidate their copy, even if they some of the variables on that line are not shared.
February 18, 2025 at 6:28 PM
Variables `a` and `b` are not shared between the two threads: thread 0 uses a and thread 1 uses b. But since they are initialized next to each other in the program on the left, they share the same cache line.
February 18, 2025 at 6:28 PM
The program on the right is 0.7% faster measured over millions of iterations. Why?
February 18, 2025 at 6:28 PM
Check out problem 2: sirupsen.com/napkin/probl...
And my napkin math: timmastny.com/projects/nap...
Expected Database Query Latency
sirupsen.com
February 12, 2025 at 5:48 PM
It’s also random for main memory, but we only need to load one new cache line from main memory per page: 50 * 50 ns = 2.5 us. The prefetcher should ensure the sequential read of the rest of the page is fast.

Even if we have to fetch more cache lines, it’s much faster than 1ms!
February 12, 2025 at 5:48 PM
Solution: I didn’t consider that each page access is random! A random SSD of 8KB is 100 us, so 101 us ~ 100 us for 16KB, 1 ms for 10 pages.

When doing random reads, SSDs are 100x slower, not 10x!
February 12, 2025 at 5:48 PM
Total:
50 pages * .25 us / page = 12.5 us
10 pages * 2 us / page = 20 us
12.5 us + 20 us = 32.5 us

Seems reasonable: SSD is 10x slower than main memory.
February 12, 2025 at 5:48 PM
50 pages / query * 80% of pages in cache = 40 pages in cache, 10 on disk.

Sequential SSD read: 1 us / 8KB * 16KB / page = 2 us / page.

Page scan time: 1 ns / 64 bytes * 16KB / page = 250 ns / page = .25 us / page
February 12, 2025 at 5:48 PM
The blue line in the first plot is our theoretical model and the black points are average search times on my implementation of B+ tree search. Our theoretical model agrees very closely with our benchmark!

Check out my blog post for more details: timmastny.com/blog/tuning-...
Tuning B+ Trees for Memory Performance | Tim Mastny
timmastny.com
February 11, 2025 at 6:54 PM
Pointer chasing is slow, but the number is limited to the height of the tree. Scanning is fast, but quickly adds up. Our model combines these costs: for any number of total keys, we can calculate the optimal number of keys per node 'b' that minimizes total search time.
February 11, 2025 at 6:54 PM
On the other hand, a B+ tree with infinite keys per node is an array. A sequential scan of 1 trillion keys would take about 16 minutes at 1ns per key.
February 11, 2025 at 6:54 PM
Let’s start with the extremes. With one key per node, our B+ tree is a binary search tree. Even with 1,000,000,000 keys, the maximum height is <30. We can find any key in less 3 us: 30 pointer jumps * 100 ns per jump = 3 us
February 11, 2025 at 6:54 PM
Here's problem one from the newsletter:
sirupsen.com/napkin/probl...

And my napkin math:
timmastny.com/projects/nap...
Problem 1 | Tim Mastny
timmastny.com
February 4, 2025 at 11:52 PM
The solution estimated the storage cost to be $0.01 GB/month, but didn't mention write costs. Simon mentions disk storage rather than blob, which I did not consider 🤔
February 4, 2025 at 11:52 PM
Tips from solution: calculate and estimate with powers of 10 to simplify the math, e.g. 1 day = 9 * 10^4 seconds. I also like to use powers of 10 that are multiples of three to match SI prefixes: 5 * 10^5 bytes = 0.5 * 10^6 = 0.5 MB
February 4, 2025 at 11:52 PM
At $5 / 1 million writes, we have to batch requests together before writing to the blob, or we would 250x our storage cost. Batching 1000 requests costs $15,000 / year. Batching 1,000,000 is only $15.
February 4, 2025 at 11:52 PM
Data size: 1 character per byte, 5 characters per word, 200 words per request = 1KB per request. That works out to be 9TB / day. Blob storage at $0.02 GB / month is $65,000 per year. But we also have to pay for writes to blob storage!
February 4, 2025 at 11:52 PM