Matt Groh
@mattgroh.bsky.social
1K followers 140 following 43 posts
Assistant professor at Northwestern Kellogg | human AI collaboration | computational social science | affective computing
Posts Media Videos Starter Packs
Pinned
When are LLMs-as-judge reliable?

That's a big question for frontier labs and it's a big question for computational social science.

Excited to share our findings (led by @aakriti1kumar.bsky.social!) on how to address this question for any subjective task & specifically for empathic communications
How do we reliably judge if AI companions are performing well on subjective, context-dependent, and deeply human tasks? 🤖

Excited to share the first paper from my postdoc (!!) investigating when LLMs are reliable judges - with empathic communication as a case study 🧐

🧵👇
Happening today!
People handle the everyday physical world with remarkable ease, but how do we do it? This Wednesday, NICO is thrilled to host @tomerullman.bsky.social, to discuss: "Good Enough: Approximations in Mental Simulation and Intuitive Physics".

🗓️ Wed 10/8 at 12pm US Central
🔗 bit.ly/WedatNICO
An event flyer for the Wednesdays@NICO Seminar Series, on October 8, 2025, featuring Tomer Ullman of Harvard University.
On my way to @ic2s2.bsky.social in Norrköping!! Super excited to share this year’s projects in the HAIC lab revealing how (M)LLMs can offer insights into human behavior & cognition

More at human-ai-collaboration-lab.kellogg.northwestern.edu/ic2s2

See you there!

#IC2S2
Thanks! I imagine we'd see similar results in the Novelty Challenge that when experts are reliable we can fine-tune LLMs to be reliable but experts may only be reliable in some disciplines/settings and less reliable in others.

Very cool challenge!!
When are LLMs-as-judge reliable?

That's a big question for frontier labs and it's a big question for computational social science.

Excited to share our findings (led by @aakriti1kumar.bsky.social!) on how to address this question for any subjective task & specifically for empathic communications
How do we reliably judge if AI companions are performing well on subjective, context-dependent, and deeply human tasks? 🤖

Excited to share the first paper from my postdoc (!!) investigating when LLMs are reliable judges - with empathic communication as a case study 🧐

🧵👇
Thank you for sharing your brilliance, quirks, and wisdom. I started reading your work after coming across your Aeon article on Awe many years ago, and I feel inspired everytime I read what you write.
This taxonomy offers a shared language (and see our how to guide on arXiv for many examples) to help people better communicate what looks or feels off.

It's also a framework that can generalize to multimedia.

Consider this, what do you notice at the 16s mark about her legs?
Based on generating thousands of images, reading the AI-generated images and digital forensics literatures (and social media and journalistic commentary), analyzing 30k+ participant comments, we propose a taxonomy for characterizing diffusion model artifacts in images
Scene complexity, artifact types, display time, and human curation of AI-generated images all play significant roles in how accurately people distinguish real and AI-generated images.
We examine photorealism in generative AI by measuring people's accuracy at distinguishing 450 AI-generated and 150 real images

Photorealism varies from image to image and person to person

83% of AI-generated images are identified as AI better than random chance would predict
💡New paper at #CHI2025 💡

Large scale experiment with 750k obs addressing

(1) How photorealistic are today's AI-generated images?

(2) What features of images influence people's ability to distinguish real/fake?

(3) How should we categorize artifacts?
At a high level, it depends on:

- human expertise
- human understanding for what the AI system is capable of
- quality of AI explanations
- task-specific potential for cognitive biases and satisficing constraints to influence humans
- instance-specific potential for OOD data to influence AI
📣 📣 Postdoc Opportunity at Northwestern

Dashun Wang and I are seeking a creative, technical, interdisciplinary researcher for a joint postdoc fellowship between our labs.

If you're passionate about Human-AI Collaboration and Science of Science, this may be for you! 🚀

Please share widely!
You're welcome!! Def makes makers who move between both worlds feel very seen
Impressive on the 20 minute bits approach!

I definitely need 4 hour windows for productive, creative work.

Paul Graham's essay on the Maker/Manager schedule (paulgraham.com/makersschedu...) offers some tips for how to create schedules that address roles where one is both a Maker and Manager
Maker's Schedule, Manager's Schedule
paulgraham.com
V2 of the Human and Machine Intelligence 😊🤖🧠 is in the books!

So many fantastic discussions as we witnessed the frontier of AI shift even further into hyperdrive✨

Props to students for all the hard work and big thanks to teaching assistants and guest speakers 🙏
and present evidence that perception is more than simply transforming light into representations of objects and their features, perception also automatically extracts relations between objects!
What is perception? What do we really see when we look at the world?

And, why does the amodal completion illusion lead us to see a super long reindeer in the image on the right?

This week @chazfirestone.bsky.social joined the NU CogSci seminar series to address these fundamental questions
2024 marks the official launch of the Human-AI Collaboration Lab, so I wrote a one page letter to introduce the lab, share highlights, and begin a lab tradition of reflecting on the year and sharing what we're working on in an easy to digest annual letter to share with friends and colleagues.