Lightnews — Scholar-powered news

Mingxuan (Aldous) Li

@itea1001.bsky.social

https://itea1001.github.io/
Rising third-year undergrad at the University of Chicago, working on LLM tool use, evaluation, and hypothesis generation.

Posts Replies Media Videos

Mingxuan (Aldous) Li

@itea1001.bsky.social

12/n Acknowledgments:
Great thanks to my wonderful collaborators Hanchen Li and my advisor @chenhaotan.bsky.social!
Check out full paper here at (arxiv.org/abs/2504.07174)

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either u...

arxiv.org

May 12, 2025 at 7:29 PM

Mingxuan (Aldous) Li

@itea1001.bsky.social

11/n Closing thoughts:
This is a sample-efficient method for LLM-as-a-judge, grounded upon human judgments — paving the way for personalized evaluators and alignment!

May 12, 2025 at 7:27 PM

Mingxuan (Aldous) Li

@itea1001.bsky.social

10/n Code:
We have released to repositories for HypoEval:
For replicating results/building upon: github.com/ChicagoHAI/H...
For off-the-shelf 0-shot evaluators for summaries and stories🚀: github.com/ChicagoHAI/H...

GitHub - ChicagoHAI/HypoEval-Gen: Repository for HypoEval paper (Hypothesis-Guided Evaluation for Natural Language Generation)

Repository for HypoEval paper (Hypothesis-Guided Evaluation for Natural Language Generation) - ChicagoHAI/HypoEval-Gen

github.com

May 12, 2025 at 7:26 PM

Mingxuan (Aldous) Li

@itea1001.bsky.social

9/n Why HypoEval matters:
We push forward LLM-as-a-judge research by showing you can get:
Sample efficiency
Interpretable automated evaluation
Strong human alignment
…without massive fine-tuning.

May 12, 2025 at 7:26 PM

Mingxuan (Aldous) Li

@itea1001.bsky.social

8/n 🔬 Ablation insights:
Dropping hypothesis generation → performance drops ~7%
Combining all hypotheses into one criterion → performance drops ~8% (Better to let LLMs rate one sub-dimension at a time!)

May 12, 2025 at 7:26 PM

Mingxuan (Aldous) Li

@itea1001.bsky.social

7/n 💪 What’s robust?
✅ Works across out-of-distribution (OOD) tasks
✅ Generated hypothesis can be transferred to different LLMs (e.g., GPT-4o-mini ↔ LLAMA-3.3-70B)
✅ Reduces sensitivity to prompt variations compared to direct scoring

May 12, 2025 at 7:25 PM

Mingxuan (Aldous) Li

@itea1001.bsky.social

6/n 🏆 Where did we test it?
Across summarization (SummEval, NewsRoom) and story generation (HANNA, WritingPrompt)
We show state-of-the-art correlations with human judgments, for both rankings (Spearman correlation) and scores (Pearson correlation)! 📈

May 12, 2025 at 7:25 PM

Mingxuan (Aldous) Li

@itea1001.bsky.social

5/n Why is this better?
By combining small-scale human data + literature + non-binary checklists, HypoEval:
🔹 Outperforms G-Eval by ~12%
🔹 Beats fine-tuned models using 3x more human labels
🔹 Adds interpretable evaluation

May 12, 2025 at 7:24 PM

Mingxuan (Aldous) Li

@itea1001.bsky.social

4/n These hypotheses break down complex evaluation rubric (ex. “Is this summary comprehensive?”) into sub-dimensions an LLM can score clearly. ✅✅✅

May 12, 2025 at 7:24 PM

Mingxuan (Aldous) Li

@itea1001.bsky.social

3/n 🌟 Our solution: HypoEval
Building upon SOTA hypothesis generation methods, we generate hypotheses — decomposed rubrics (similar to checklists, but more systematic and explainable) — from existing literature and just 30 human annotations (scores) of texts.

May 12, 2025 at 7:24 PM

Mingxuan (Aldous) Li

@itea1001.bsky.social

2/n What’s the problem?
Most LLM-as-a-judge studies either:
❌ Achieve lower alignment with humans
⚙️ Requires extensive fine-tuning -> expensive data and compute.
❓ Lack of interpretability

May 12, 2025 at 7:23 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news