Madelon Hulsebos
madelonhulsebos.bsky.social
Madelon Hulsebos
@madelonhulsebos.bsky.social
Faculty at CWI & ELLIS Amsterdam https://trl-lab.github.io. Research on AI and tabular data to democratize insights from structured data. Prev at UC Berkeley and the University of Amsterdam.

https://www.madelonhulsebos.com
We consider NL tabular query ambiguity through the lens of cooperative interaction. Read more about the insights and implications in the paper led by Daniel, which will be presented at the EurIPS workshop on AI for Tabular Data! arxiv.org/pdf/2511.04584

Really enjoyed this reflection exercise 🙂
arxiv.org
November 17, 2025 at 3:15 PM
For a long time, we reviewed and debated properties of queries in tabular QA datasets and practice, particularly in an “open-domain” context. Expecting “platonic queries”, map to necessary and sufficient data items and only the exact operational procedure to address the query intent, is unrealistic.
November 17, 2025 at 3:15 PM
This exciting workshop is organized by @effylix.bsky.social, @lennartpurucker.bsky.social, Peter Baile Chen, Frank Hutter and me.

With talks by Marine Le Morvan (Inria), Floris Geerts (University of Antwerp) and @akhtarmubashara.bsky.social (ETH Zurich), and more TBA.

We hope to meet you there!!
September 23, 2025 at 10:44 AM
Also hope attendees will enjoy the tutorial program, which I helped compile this year, with exciting tutorials on topics such as vector search, data discovery, graph databases, AI and relational data! Full list at vldb.org/2025/?papers...
VLDB 2025 - Industrial Track Papers
List of accepted industrial track papers.
vldb.org
September 1, 2025 at 11:05 AM
We show that metrics like sacrebleu and bertscore aren't fit for tabular QA eval of LLMs as scores are inseparable. We also find a large gap between multiple-choice eval as in TQA-Bench and an LLM-judge which aligns with human annotation.

Bottomline: LLMs aren't robust for real-world multi-table QA
July 30, 2025 at 10:24 PM
In our contribution led by @cowolff.bsky.social from the TRL Lab (trl-lab.github.io) we wondered: how well do LLMs reason over tabular data, really? We find that LLMs don't ack nor handle tables with disturbances (eg missing vals or duplicates) necessitating explicit prompting or cleaning pipelines.
TRL Lab
trl-lab.github.io
July 30, 2025 at 10:24 PM
The full program of the workshop is at: table-representation-learning.github.io/ACL2025/. Besides excellent contributed work (proceedings: aclanthology.org/2025.trl-1.0...), we'll have invited talks by Dan Roth, Tao Yu, Edward Choi and Julian Eisenschlos!
How well do LLMs reason over tabular data, really?
Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an...
arxiv.org
July 30, 2025 at 10:24 PM
Reposted by Madelon Hulsebos
The paper's called:
"How well do LLMs reason over tabular data, really?" 📊

We dig into two important questions:
1️⃣ Are general-purpose LLMs robust with real-world tables?
2️⃣ How should we actually evaluate them? (2/4)
July 25, 2025 at 3:06 PM
Very interesting! I’ve seen synthetic data, with certain patterns (eg TabPFN with SCMs), come real far, but not entirely random compositions of tokens. Do you understand or have an hypothesis for what leads to this observation?
June 26, 2025 at 11:17 AM
Absolutely! Please send me an email then we can arrange a chat :)
June 9, 2025 at 6:23 AM