Lightnews — Scholar-powered news

Anders

@dataders.bsky.social

👋

April 3, 2025 at 6:16 PM

Anders

@dataders.bsky.social

yeah the multi-cloud story is far from over. you can imagine that without egress costs, it might actually be performant to move data b/w AWS and Azure it the data centers are close enough.

Iceberg kinda makes DWH on Cloudflare R2 feasible given it has S3-compatible API. Sippy seems cool

Sippy · Cloudflare R2 docs

Sippy is a data migration service that allows you to copy data from other cloud providers to R2 as the data is requested, without paying unnecessary cloud egress fees typically associated with moving ...

developers.cloudflare.com

April 1, 2025 at 8:40 PM

Anders

@dataders.bsky.social

p.s. also TI[F]L what "serde" stands for after seeing the word for years 🤦

March 17, 2025 at 2:44 PM

Anders

@dataders.bsky.social

11/11) p.s. forgot to link to the repo!
github.com/deepseek-ai/...

GitHub - deepseek-ai/smallpond: A lightweight data processing framework built on DuckDB and 3FS.

A lightweight data processing framework built on DuckDB and 3FS. - deepseek-ai/smallpond

github.com

March 3, 2025 at 3:31 PM

Anders

@dataders.bsky.social

10) So sick that a smallpond pipeline returns a LogicalPlan representing a DAG where each node is a distinct data processing task.

Imagine if a dbt DAG resulted in a single logical plan that operates across multiple engines.

and then they can optimize the plan before execution as well! 🤯🤯🤯

March 3, 2025 at 3:31 PM

Anders

@dataders.bsky.social

9) Arrow is the unsung hero of this project (and arguably all innovation in data ecosystem).

it's what enables:
1. all this interchangeability of query engines
2. (likely) using duckdb in a distributed environment in the first place

March 3, 2025 at 3:27 PM

Anders

@dataders.bsky.social

8) this HN called out that smallpond abstracts supports using different query engines for different jobs (shuffling vs sorting).

Very bullish on this future of right tool for right job and making it as simple as a config

news.ycombinator.com/item?id=4323...

One thing I found peculiar is that for the GraySort benchmark it dispatches to P... | Hacker News

news.ycombinator.com

March 3, 2025 at 3:23 PM

Anders

@dataders.bsky.social

7)
TIRED: "big vs. small" & "distributed vs single-node"
WIRED: tactical deployment of single-node query engines within distributed frameworks.

another great example is Apache Comet which plugs DataFusion into Spark to accelerate single-node operations resulting in overal Spark performance speedups

March 3, 2025 at 3:20 PM

Anders

@dataders.bsky.social

6) Making smallpond 5 years ago would have been very difficult! but the emergence of lower-level, off-the-shelf components greatly accelerate the development time.

Within the year, we'll to see this new paradigm catch on. Future examples will probably be using DataFusion not DuckDB.

March 3, 2025 at 3:18 PM

Anders

@dataders.bsky.social

5) there's been previous discussion on DeepSeek's scrappiness and I think it shows here. They had a vision of what they wanted and rather than paying for software or forcing their vision into an existing tool, were able to ship exactly what they wanted

March 3, 2025 at 3:12 PM

Anders

@dataders.bsky.social

4) smallpond is a bespoke data processing framework using off-the-shelf, OSS components (ray, arrow, duckdb, polars).

Why didn't they use {TOOL}? My guesses
❌ dbt: they wanted Python Dataframe API
❌ Airflow: not as close to metal as Ray
❌ pytorch or ray[data]: idk tbh

March 3, 2025 at 3:10 PM

Anders

@dataders.bsky.social

3) ray.io is the foundation of any training and inference infrastructure. I haven't had much exposure to Ray as a SQL monkey using DWHs, but it's just recently clicked for me how big of a deal it is

Scale Machine Learning & AI Computing | Ray by Anyscale

Ray is an open source framework for managing, executing, and optimizing compute needs. Unify AI workloads with Ray by Anyscale. Try it for free today.

ray.io

March 3, 2025 at 3:01 PM

Anders

@dataders.bsky.social

2) so cool to begin to see the data infra that supports training LLMs. Open weights is cool, and RAG makes sense, but as a former XGBooster turned "data engineer", seeing the data cleaning pipelines is what I've most wanted to see.

March 3, 2025 at 2:58 PM

Anders

@dataders.bsky.social

dude -- so cool! one of my self-described superpowers is being very "plugged in" but, this doesn't happen without significant time and attention costs.

what you've made changes the game imho. now I need the same for all the Slacks & Discords I'm in.

February 19, 2025 at 4:47 PM

Anders

@dataders.bsky.social

Yeah that ADP chart hurt my friend y-axis so bad I think axisslaughter should be a punishable crime

January 26, 2025 at 4:19 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news