Weston Pace
@westonpace.bsky.social
120 followers 360 following 250 posts
Software developer working on all things arrow and columnar storage, currently, Lance.
Posts Media Videos Starter Packs
westonpace.bsky.social
Douglas squirrels are 1/3 the size of gray squirrels but six times more ferocious.
westonpace.bsky.social
I suspect this will change as caching layers become more mature. The selectivity threshold for cloud storage is something like "one in a million" but more like "one in a thousand" for NVMe.

Also, a self-promotional shout out that you might want to look at lance (lancedb.github.io/lance/format...)
westonpace.bsky.social
They do a bit of both. The base model is unsupervised and is generally described as "learning the language". The model is then fine tuned with supervision for a specific task.

The "suck up as much data as you can" is for the first part.
westonpace.bsky.social
Yesterday, OP responded to my 11 year old comment on their 13 year old post with a pedantic correction.
westonpace.bsky.social
Though I think the "we can't change Parquet" problem is a bit of a false problem. 90% of Parquet users are probably fine to just keep using Parquet. I'm not sure I agree that "the long time archival format" and the "database storage format" need to be the same thing.
westonpace.bsky.social
That might be next week's blog post ;). Short answer is I see it as a table format problem and not a file format problem. Change "decoder" to "file reader". Change "stored in the page" to "stored in a folder on the table" and change "wasm" to "pluggable" (native or wasm).
westonpace.bsky.social
Hope this helps, it's fun to see so much exciting innovation in a space that's been relatively quiet for many years!
westonpace.bsky.social
F3 is from a joint project between CMU and Tsinghua University. They have tackled the "forwards compatibility" problem by storing WASM decoders with the data so that old readers can read data written by futuristic writers.
westonpace.bsky.social
FastLanes comes from CWI. They're the group that's designed some of the new lightweight compression algorithms (e.g. FSST). They definitely focus on compression and they likely have the best layout for processing data already in memory.
westonpace.bsky.social
Vortex comes from SpiralDB. They've done a good job explaining what they do and writing about it. They've made a big focus on compression but, especially, on pushing down compute to run against compressed data.
westonpace.bsky.social
Nimble comes from Meta, and there has sadly not been much written about it publicly. The best I can say at the moment is that Nimble has made perhaps the biggest emphasis of all the formats on extremely wide schemas (again, all formats have done some here).
westonpace.bsky.social
I work on Lance! So I'm most biased here. We focus on balancing random access and full scans. All formats have focused on better random access / large data, but none to the extent that we have, especially for tensors / embeddings.
westonpace.bsky.social
Lot's of work being done on columnar file formats lately. I count 5 new formats so far (Lance, Nimble, Vortex, FastLanes, F3).

It's definitely something we follow at LanceDB and it can be confusing to track. So here is my very biased head-canon (trying to stay positive)
westonpace.bsky.social
Newest house mate is an industrious spider that spends every day building a beautiful web right at eye level so I can blearily walk face first into it every morning.
westonpace.bsky.social
Son got mad and told me he wouldn't take me to the creamery when I died. I have some questions.
westonpace.bsky.social
Discussion points so far...

Should we slap `urn:` in the front so that users get a free parser?

Should the coordinates be repeated in the file itself?
westonpace.bsky.social
We're trying to figure out "Substrait coordinates" (e.g. organization, name, version tuples) for Substrait functions. Is anyone out there actually passionate about the topic or have any lessons or advice?

At the moment, leaning towards `organization:name:version` (and forbid colon in each field)
westonpace.bsky.social
Hmm, it shouldn't be _that_ slow. DuckDb is going to do one query to get column values (O(N), pretty fast) and another with a "case when" for each possible value (O(C*N)). I wonder if there's some optimization opportunity for hundreds of "case when" statements collapsing into a dict lookup.
westonpace.bsky.social
768 is very common, probably the most common I see from users. 128 is still around but rare. One user even has 1536.
westonpace.bsky.social
Tike to double check my reservation
westonpace.bsky.social
Sounds like you got yourself a new DIY project 😉
westonpace.bsky.social
Oof, that is definitely "reevaluate my life choices" territory 😆
westonpace.bsky.social
Enjoying the lesser known feasting holiday "fridge fails in the middle of summer"
westonpace.bsky.social
Anyone else getting strange segfaults from OpenSSL 3.0.17 on bookworm in docker?