Lightnews — Scholar-powered news

Reposted by David Holzmüller

Skrub

@skrub-data.bsky.social

⚡ Release 0.6.2 is out ⚡

github.com/skrub-data/s...

New features The DataOp.skb.full_report() now displays the time each node took to evaluate. #1596 by Jérôme Dockès. The User guide has been reworked and expanded. Changes and deprecations Ken em...

github.com

September 26, 2025 at 8:48 AM

Reposted by David Holzmüller

arXiv cs.LG Machine Learning

@cslg-bot.bsky.social

Daniel Beaglehole, David Holzm\"uller, Adityanarayanan Radhakrishnan, Mikhail Belkin: xRFM: Accurate, scalable, and interpretable feature learning models for tabular data https://arxiv.org/abs/2508.10053 https://arxiv.org/pdf/2508.10053 https://arxiv.org/html/2508.10053

August 15, 2025 at 6:32 AM

David Holzmüller

@dholzmueller.bsky.social

I got 3rd out of 691 in a tabular kaggle competition – with only neural networks! 🥉

My solution is short (48 LOC) and relatively general-purpose – I used skrub to preprocess string and date columns, and pytabkit to create an ensemble of RealMLP and TabM models. Link below👇

July 29, 2025 at 11:10 AM

David Holzmüller

@dholzmueller.bsky.social

Excited to have co-contributed the SquashingScaler, which implements the robust numerical preprocessing from RealMLP!

Skrub @skrub-data.bsky.social · Jul 24

⚡ Release 0.6.0 is now out! ⚡

🚀 Major update! Skrub DataOps, various improvements for the TableReport, new tools for applying transformers to the columns, and a new robust transformer for numerical features are only some of the features included in this release.

July 24, 2025 at 4:00 PM

Reposted by David Holzmüller

Gaël Varoquaux

@gaelvaroquaux.bsky.social

👨‍🎓🧾✨#icml2025 Paper: TabICL, A Tabular Foundation Model for In-Context Learning on Large Data
With Jingang Qu, @dholzmueller.bsky.social, and Marine Le Morvan

TL;DR: a well-designed architecture and pretraining gives best tabular learner, and more scalable
On top, it's 100% open source
1/9

July 9, 2025 at 6:42 PM

Reposted by David Holzmüller

Lennart Purucker

@lennartpurucker.bsky.social

🚨What is SOTA on tabular data, really? We are excited to announce 𝗧𝗮𝗯𝗔𝗿𝗲𝗻𝗮, a living benchmark for machine learning on IID tabular data with:

📊 an online leaderboard (submit!)
📑 carefully curated datasets
📈 strong tree-based, deep learning, and foundation models

🧵

June 23, 2025 at 10:15 AM

Reposted by David Holzmüller

Katharina Eggensperger

@keggensperger.bsky.social

Missed the school? We have uploaded recordings of most talks to our YouTube Channel www.youtube.com/@AutoML_org 🙌

AutoML Freiburg Hannover Tübingen

This channel features videos about automated machine learning (AutoML) from the AutoML groups at the University of Freiburg, Leibniz University Hannover and University of Tübingen. Common topics inclu...

www.youtube.com

June 20, 2025 at 3:36 PM

Reposted by David Holzmüller

Skrub

@skrub-data.bsky.social

📝 The skrub TextEncoder brings the power of HuggingFace language models to embed text features in tabular machine learning, for all those use cases that involve text-based columns.

May 28, 2025 at 8:43 AM

David Holzmüller

@dholzmueller.bsky.social

🚨ICLR poster in 1.5 hours, presented by @danielmusekamp.bsky.social :
Can active learning help to generate better datasets for neural PDE solvers?
We introduce a new benchmark to find out!
Featuring 6 PDEs, 6 AL methods, 3 architectures and many ablations - transferability, speed, etc.!

April 24, 2025 at 12:38 AM

Reposted by David Holzmüller

Skrub

@skrub-data.bsky.social

The Skrub TableReport is a lightweight tool that allows to get a rich overview of a table quickly and easily.

✅ Filter columns
🔎 Look at each column's distribution
📊 Get a high level view of the distributions through stats and plots, including correlated columns
🌐 Export the report as html

April 23, 2025 at 11:49 AM

Reposted by David Holzmüller

Gaël Varoquaux

@gaelvaroquaux.bsky.social

#ICLR2025 Marine Le Morvan presents "Imputation for prediction: beware of diminishing returns": poster Thu 24th
arxiv.org/abs/2407.19804

Concludes 6 years of research on prediction with missing values: Imputation is useful but improvements are expensive, while better learners yield easier gains.

April 23, 2025 at 7:55 AM

Reposted by David Holzmüller

Madelon Hulsebos

@madelonhulsebos.bsky.social

Excited to share the new monthly Table Representation Learning (TRL) Seminar under the ELLIS Amsterdam TRL research theme! To recur every 2nd Friday.

Who: Marine Le Morvan, Inria (in-person)
When: Friday 11 April 4-5pm (+drinks)
Where: L3.36 Lab42 Science Park / Zoom

trl-lab.github.io/trl-seminar/

Details about the seminar talk titled TabICL: A Tabular Foundation Model for In-Context Learning on Large Data by Marine Le Morvan

April 2, 2025 at 9:42 AM

Reposted by David Holzmüller

Nick Erickson

@nickerickson.bsky.social

We are excited to announce #FMSD: "1st Workshop on Foundation Models for Structured Data" has been accepted to #ICML 2025!
Call for Papers: icml-structured-fm-workshop.github.io

March 25, 2025 at 5:59 PM

Reposted by David Holzmüller

Evan Peck

@peck.phd

Trying something new:
A 🧵 on a topic I find many students struggle with: "why do their 📊 look more professional than my 📊?"

It's *lots* of tiny decisions that aren't the defaults in many libraries, so let's break down 1 simple graph by @jburnmurdoch.bsky.social

🔗 www.ft.com/content/73a1...

November 20, 2024 at 5:09 PM

Reposted by David Holzmüller

Multiscale AI @ ICLR 2025

@multiscaleai.bsky.social

🚀Continuing the spotlight series with the next @iclr-conf.bsky.social MLMP 2025 Oral presentation!
📝LOGLO-FNO: Efficient Learning of Local and Global Features in Fourier Neural Operators
📷 Join us on April 27 at #ICLR2025!
#AI #ML #ICLR #AI4Science

March 18, 2025 at 8:12 AM

David Holzmüller

@dholzmueller.bsky.social

Practitioners are often sceptical of academic tabular benchmarks, so I am elated to see that our RealMLP model outperformed boosted trees in two 2nd place Kaggle solutions, for a $10,000 forecasting challenge and a research competition on survival analysis.

Rohlik Sales Forecasting Challenge

Use historical product sales data to predict future sales.

www.kaggle.com

March 10, 2025 at 3:53 PM

David Holzmüller

@dholzmueller.bsky.social

A new tabular classification benchmark provides another independent evaluation of our RealMLP. RealMLP is the best classical DL model, although some other recent baselines are missing. TabPFN is better on small datasets and boosted trees on larger datasets, though.

March 4, 2025 at 2:25 PM

Reposted by David Holzmüller

Fabian Schaipp

@fschaipp.bsky.social

Learning rate schedules seem mysterious? Why is the loss going down so fast during cooldown?
Turns out that this behaviour can be described with a bound from *convex, nonsmooth* optimization.

A short thread on our latest paper 🚞

arxiv.org/abs/2501.18965

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedul...

arxiv.org

February 5, 2025 at 10:13 AM

Reposted by David Holzmüller

Eugene Berta

@eberta.bsky.social

Early stopping on validation loss? This leads to suboptimal calibration and refinement errors—but you can do better!
With @dholzmueller.bsky.social, Michael I. Jordan, and @bachfrancis.bsky.social, we propose a method that integrates with any model and boosts classification performance across tasks.

February 3, 2025 at 1:03 PM

David Holzmüller

@dholzmueller.bsky.social

The first independent evaluation of our RealMLP is here!

On a recent 300-dataset benchmark with many baselines, RealMLP takes a shared first place overall. 🔥

Importantly, RealMLP is also relatively CPU-friendly, unlike other SOTA DL models (including TabPFNv2 and TabM). 🧵 1/

Plots from the benchmark of Ye et al. (2024)

January 16, 2025 at 12:05 PM

Reposted by David Holzmüller

Madelon Hulsebos

@madelonhulsebos.bsky.social

Join us on 27 Feb in Amsterdam for the ELLIS workshop on Representation Learning and Generative Models for Structured Data ✨

sites.google.com/view/rl-and-...

Inspiring talks by @eisenjulian.bsky.social, @neuralnoise.com, Frank Hutter, Vaishali Pal, TBC.

We welcome extended abstracts until 31 Jan!

January 7, 2025 at 6:46 PM

Reposted by David Holzmüller

Francis Bach

@bachfrancis.bsky.social

My book is (at last) out, just in time for Christmas!
A blog post to celebrate and present it: francisbach.com/my-book-is-o...

December 21, 2024 at 3:23 PM

David Holzmüller

@dholzmueller.bsky.social

I'll present our paper in the afternoon poster session at 4:30pm - 7:30 pm in East Exhibit Hall A-C, poster 3304!

David Holzmüller @dholzmueller.bsky.social · Nov 18

Can deep learning finally compete with boosted trees on tabular data? 🌲
In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters.
Some insights about RealMLP and other models on large benchmarks (>200 datasets): 🧵

Paper screenshot and Figure 1 (c) with cumulative ablations for components of RealMLP-TD.

December 12, 2024 at 6:54 PM

David Holzmüller

@dholzmueller.bsky.social

We wrote a benchmark paper with many practical insights on (the benefits of) active learning for training neural PDE solvers. 🚀

I was happy to be a co-advisor on this project - most of the credit goes to Daniel and Marimuthu.

Daniel Musekamp @danielmusekamp.bsky.social · Dec 11

Neural surrogates can accelerate PDE solving but need expensive ground-truth training data. Can we reduce the training data size with active learning (AL)? In our NeurIPS D3S3 poster, we introduce AL4PDE, an extensible AL benchmark for autoregressive neural PDE solvers. 🧵

December 11, 2024 at 6:50 PM

Reposted by David Holzmüller

Moritz Haas

@mohaas.bsky.social

Stable model scaling with width-independent dynamics?

Thrilled to present 2 papers at #NeurIPS 🎉 that study width-scaling in Sharpness Aware Minimization (SAM) (Th 16:30, #2104) and in Mamba (Fr 11, #7110). Our scaling rules stabilize training and transfer optimal hyperparams across scales.

🧵 1/10

December 10, 2024 at 7:08 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news