Lightnews — Scholar-powered news

Skrub

@skrub-data.bsky.social

ApplyToFrame selects columns in the same way, but then uses all of them at the same time as input to the transformer: this is useful for dimensionality reduction.
SelectCols and DropCols can be used as "filtering blocks" in a pipeline.

October 8, 2025 at 12:43 PM

Skrub

@skrub-data.bsky.social

Skrub includes a powerful set of transformers and selectors that allow to transform columns based on various conditions.

ApplyToCols lets you select a subset of columns in your dataframe, then applies a transformer to each selected column separately.

October 8, 2025 at 12:43 PM

Skrub

@skrub-data.bsky.social

@pydataparis.bsky.social 2025 is over, and it was a big success!

Our talk was very well received, and we got a lot of great questions, especially about scalability and how to interface with other libraries in production environments.

The skrub sticker on the back of a laptop

October 7, 2025 at 2:36 PM

Skrub

@skrub-data.bsky.social

skrub DataOps help you construct complex and extensive hyperparameter search spaces. However, interpreting results from large grids can be challenging.
To address this, skrub generates a parallel coordinate plot that visualizes all runs and the parameters used to achieve specific results.

September 12, 2025 at 12:56 PM

Skrub

@skrub-data.bsky.social

Do you have to deal with numerical features that involve large outliers, and need to train linear models or neural networks?

Then you might want to try the skrub SquashingScaler. The SquashingScaler behaves like scikit-learn RobustScaler, but smoothly clips outliers to predefined boundaries.

September 5, 2025 at 8:47 AM

Skrub

@skrub-data.bsky.social

Form complex DataOps plans to train and tune machine learning models, then export the plans as learners, standalone objects that can be used on new data.

Tune hyperparameters where they're defined, and explore the resulting space with a parallel coordinate plot

July 24, 2025 at 3:55 PM

Skrub

@skrub-data.bsky.social

🌟 Major feature! Skrub DataOps are a powerful new way of combining dataframe transformations over multiple tables with machine learning pipelines.

July 24, 2025 at 3:55 PM

Skrub

@skrub-data.bsky.social

⚡ Release 0.6.0 is now out! ⚡

🚀 Major update! Skrub DataOps, various improvements for the TableReport, new tools for applying transformers to the columns, and a new robust transformer for numerical features are only some of the features included in this release.

July 24, 2025 at 3:55 PM

Skrub

@skrub-data.bsky.social

📅 The skrub API includes various functions and objects that help with dealing with datetime strings. 1/

June 19, 2025 at 12:45 PM

Skrub

@skrub-data.bsky.social

Finally, results can be shown with a parallel coordinate plot to find out the impact of different hyperparameters on the prediction task.

June 4, 2025 at 12:46 PM

Skrub

@skrub-data.bsky.social

👀 This week's post will be another sneak peek into skrub expressions, an upcoming feature that will ease the preparation and execution of machine learning pipelines on dataframes.

This time we will focus on how expressions can simplify the construction of complex hyperparameter grids.

June 4, 2025 at 12:46 PM

Skrub

@skrub-data.bsky.social

👀 This week's post is a sneak peek into the next major Skrub feature, Skrub expressions 🚀

As this is a preview of an upcoming feature, we are looking for your thoughts and feedback before release.

April 30, 2025 at 10:00 AM

Skrub

@skrub-data.bsky.social

The Skrub TableReport is a lightweight tool that allows to get a rich overview of a table quickly and easily.

✅ Filter columns
🔎 Look at each column's distribution
📊 Get a high level view of the distributions through stats and plots, including correlated columns
🌐 Export the report as html

April 23, 2025 at 11:49 AM

Skrub

@skrub-data.bsky.social

And if you're not familiar with what Skrub is all about, you might want to check out our introductory slide deck here:

skrub-data.org/skrub-materi...

April 9, 2025 at 9:08 AM

Skrub

@skrub-data.bsky.social

🚀⚡ Release: 0.5.3

Check out the release notes:
skrub-data.org/stable/CHANG...

Highlights below ⤵️

April 3, 2025 at 4:49 PM

Skrub

@skrub-data.bsky.social

🚀 The Skrub workshop at Campus Cyber in La Défense was a great success! Connecting with professionals from both startups and large companies has given us valuable insights for Skrub's next steps. Stay tuned for more!

January 31, 2025 at 10:38 AM

Skrub

@skrub-data.bsky.social

🎉⚡️Release 0.5.1:
◼ Encode strings faster and better with StringEncoder!

StringEncoder applies a tf-idf vectorization followed by SVD to produce high quality and FAST embeddings of textual and categorical features.

January 28, 2025 at 5:19 PM

Skrub

@skrub-data.bsky.social

There is much more:
skrub.patch_display() adds the TableReport as a default representation for all dataframes

skrub.column_association to check which columns are linked...

Check out the changelog:
skrub-data.org/stable/CHANG...

November 27, 2024 at 8:46 PM

Skrub

@skrub-data.bsky.social

Improved TableReport:
◼ tighter layout
◼ support any script (any alphabet حب माया) in the plots
◼ robust to outliers

It works without dependencies, in any html-based environment (Jupyter notebooks, @vscode.dev, a simple web page...)

Check it out on skrub-data.org
4/5

November 27, 2024 at 8:46 PM

Skrub

@skrub-data.bsky.social

Skrub can now easily drop columns with too many missing values.

As always the TableVectorizer is very handy for preparation of data-frames, and it now comes with an option to drop those pesky columns
skrub-data.org/stable/refer...
3/5

November 27, 2024 at 8:46 PM

Skrub

@skrub-data.bsky.social

Easily combine deep learning (language models on huggingface @hf.co ) for text entries with @scikit-learn.bsky.social gradient-boosted trees

for pipelines that predict great on dataframes of mixed types.

Skrub ensure the language model is downloaded, cached, picklable, everything for easy ops
2/5

November 27, 2024 at 8:46 PM

Skrub

@skrub-data.bsky.social

🎉⚡️Release 0.4:
◼ Easily use deep learning for text entries
◼ TableVectorizer can remove columns with too many missing values
◼ TableReport more robust and prettier
...

1/5