Skrub
banner
skrub-data.bsky.social
Skrub
@skrub-data.bsky.social
skrub is a Python library to ease preprocessing and feature engineering for tabular machine learning.
Our long-term goal is to directly connect database tables to machine learning estimators.

https://skrub-data.org
https://discord.gg/ABaPnm7fDC
For even more control over column selection, skrub provides a collection of selectors that let you partition dataframes by data type, column name, or user-specified functions.
October 8, 2025 at 12:43 PM
All these transformers can be concatenated and inserted in a scikit-learn pipeline to build a feature matrix with complex column selection operation, and can be seen as an alternative for the scikit-learn ColumnTransformer.
October 8, 2025 at 12:43 PM
ApplyToFrame selects columns in the same way, but then uses all of them at the same time as input to the transformer: this is useful for dimensionality reduction.
SelectCols and DropCols can be used as "filtering blocks" in a pipeline.
October 8, 2025 at 12:43 PM
Skrub learning materials – Skrub
skrub-data.org
October 7, 2025 at 2:36 PM
Thanks to @riccardocappuzzo.com , @glemaitre58.bsky.social and Jérôme Dockès for preparing the talk, and mentoring at the sprint!
October 7, 2025 at 2:36 PM
The sprint was also a big hit, with both new and old contributors working on issues and getting to know the repository.

And to cap it all off, thanks to P16 we have stickers now 🚀
October 7, 2025 at 2:36 PM
Reposted by Skrub
What a banger is skrub @skrub-data.bsky.social !

Big thumbs up for the sklearn team & the maintainer of this package
October 1, 2025 at 8:24 AM
🛠️ Main bugfixes
- Fixed the display of DataOp objects in Google Colab cell outputs.
- Fixed the range from which choose_float and choose_int sample values when log=False and n_steps is None.
- The SkrubLearner used to do a prediction on the train set during fit(), this has been fixed.
September 26, 2025 at 8:48 AM
👀 Changes and deprecations
- Ken embeddings are now deprecated.
- The accepted values for the parameter how of .skb.apply() have changed. The new values are "auto", "cols", "frame", and "no_wrap".
- The parameter splitter of .skb.train_test_split() has been renamed split_func.
September 26, 2025 at 8:48 AM
🚀 New features
- The DataOp.skb.full_report() now displays the time each node took to evaluate.
- The User guide has been reworked and expanded.
September 26, 2025 at 8:48 AM
Here's another example on how to tune ML models with skrub Data Ops: skrub-data.org/stable/auto_...
Hyperparameter tuning with DataOps
A machine-learning pipeline typically contains some values or choices which may influence its prediction performance, such as hyperparameters (e.g. the regularization parameter alpha of a RidgeClas...
skrub-data.org
September 12, 2025 at 12:56 PM
The plot in the video was created for our EurosciPy 2025 tutorial on forecasting time series: skrub-data.org/EuroSciPy202...
Skrub DataOps applied to forecasting timeseries — Skrub DataOps applied to forecasting timeseries
skrub-data.org
September 12, 2025 at 12:56 PM
The plot is interactive: you can select a range of results, and it will highlight only the runs within that range, enabling you to refine your search further. It also tracks fit and score times, so you can identify which parameters most impact runtime.
September 12, 2025 at 12:56 PM