Lightnews — Scholar-powered news

LightNews

Weston Pace

@westonpace.bsky.social

120 followers 360 following 250 posts

Software developer working on all things arrow and columnar storage, currently, Lance.

Posts Media Videos Starter Packs

Weston Pace @westonpace.bsky.social · 5d

Douglas squirrels are 1/3 the size of gray squirrels but six times more ferocious.

Weston Pace @westonpace.bsky.social · 7d

I suspect this will change as caching layers become more mature. The selectivity threshold for cloud storage is something like "one in a million" but more like "one in a thousand" for NVMe.

Also, a self-promotional shout out that you might want to look at lance (lancedb.github.io/lance/format...)

Weston Pace @westonpace.bsky.social · 8d

They do a bit of both. The base model is unsupervised and is generally described as "learning the language". The model is then fine tuned with supervision for a specific task.

The "suck up as much data as you can" is for the first part.

Weston Pace @westonpace.bsky.social · 8d

Yesterday, OP responded to my 11 year old comment on their 13 year old post with a pedantic correction.

Weston Pace @westonpace.bsky.social · 12d

Though I think the "we can't change Parquet" problem is a bit of a false problem. 90% of Parquet users are probably fine to just keep using Parquet. I'm not sure I agree that "the long time archival format" and the "database storage format" need to be the same thing.

Weston Pace @westonpace.bsky.social · 12d

That might be next week's blog post ;). Short answer is I see it as a table format problem and not a file format problem. Change "decoder" to "file reader". Change "stored in the page" to "stored in a folder on the table" and change "wasm" to "pluggable" (native or wasm).

Weston Pace @westonpace.bsky.social · 12d

Hope this helps, it's fun to see so much exciting innovation in a space that's been relatively quiet for many years!

Weston Pace @westonpace.bsky.social · 12d

F3 is from a joint project between CMU and Tsinghua University. They have tackled the "forwards compatibility" problem by storing WASM decoders with the data so that old readers can read data written by futuristic writers.

Weston Pace @westonpace.bsky.social · 12d

FastLanes comes from CWI. They're the group that's designed some of the new lightweight compression algorithms (e.g. FSST). They definitely focus on compression and they likely have the best layout for processing data already in memory.

Weston Pace @westonpace.bsky.social · 12d

Vortex comes from SpiralDB. They've done a good job explaining what they do and writing about it. They've made a big focus on compression but, especially, on pushing down compute to run against compressed data.

Weston Pace @westonpace.bsky.social · 12d

Nimble comes from Meta, and there has sadly not been much written about it publicly. The best I can say at the moment is that Nimble has made perhaps the biggest emphasis of all the formats on extremely wide schemas (again, all formats have done some here).

Weston Pace @westonpace.bsky.social · 12d

I work on Lance! So I'm most biased here. We focus on balancing random access and full scans. All formats have focused on better random access / large data, but none to the extent that we have, especially for tensors / embeddings.

Weston Pace @westonpace.bsky.social · 12d

Lot's of work being done on columnar file formats lately. I count 5 new formats so far (Lance, Nimble, Vortex, FastLanes, F3).

It's definitely something we follow at LanceDB and it can be confusing to track. So here is my very biased head-canon (trying to stay positive)

Weston Pace @westonpace.bsky.social · 12d

Newest house mate is an industrious spider that spends every day building a beautiful web right at eye level so I can blearily walk face first into it every morning.

Weston Pace @westonpace.bsky.social · Aug 14

Son got mad and told me he wouldn't take me to the creamery when I died. I have some questions.

Weston Pace @westonpace.bsky.social · Aug 13

Discussion points so far...

Should we slap `urn:` in the front so that users get a free parser?

Should the coordinates be repeated in the file itself?

Weston Pace @westonpace.bsky.social · Aug 13

We're trying to figure out "Substrait coordinates" (e.g. organization, name, version tuples) for Substrait functions. Is anyone out there actually passionate about the topic or have any lessons or advice?

At the moment, leaning towards `organization:name:version` (and forbid colon in each field)

Weston Pace @westonpace.bsky.social · Aug 12

Hmm, it shouldn't be _that_ slow. DuckDb is going to do one query to get column values (O(N), pretty fast) and another with a "case when" for each possible value (O(C*N)). I wonder if there's some optimization opportunity for hundreds of "case when" statements collapsing into a dict lookup.

Weston Pace @westonpace.bsky.social · Aug 9

768 is very common, probably the most common I see from users. 128 is still around but rare. One user even has 1536.

Weston Pace @westonpace.bsky.social · Aug 6

I'm not crazy! Just unlucky...

github.com/openssl/open...

concurrent TLS connection segfault in x509 storage (regression on 3.0.17) · Issue #28171 · openssl/openssl

During multiple connection the TLS serving side of openssl crashes: ref: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1110254 ref: https://jira.mariadb.org/browse/MDEV-37361 The debian bug has...

Weston Pace @westonpace.bsky.social · Aug 5

Tike to double check my reservation

Weston Pace @westonpace.bsky.social · Aug 4

Sounds like you got yourself a new DIY project 😉

Weston Pace @westonpace.bsky.social · Aug 4

Oof, that is definitely "reevaluate my life choices" territory 😆

Weston Pace @westonpace.bsky.social · Aug 3

Enjoying the lesser known feasting holiday "fridge fails in the middle of summer"

Weston Pace @westonpace.bsky.social · Aug 1

Anyone else getting strange segfaults from OpenSSL 3.0.17 on bookworm in docker?