aakriti1kumar.bsky.social
@aakriti1kumar.bsky.social
So why the gap between experts/LLMs and crowds?

Crowdworkers often
- have limited attention
- rely on heuristics like “it’s the thought that counts”
- focusing on intentions rather than actual wording
show systematic rating inflation due to social desirability bias
June 17, 2025 at 3:14 PM
And when experts disagree, LLMs struggle to find a consistent signal too.

Here’s how expert agreement (Krippendorff's alpha) varied across empathy sub-components:
June 17, 2025 at 3:14 PM
We analyzed thousands of annotations from LLMs, crowdworkers, and experts on 200 real-world conversations

And specifically looked at 21 sub-components of empathic communication from 4 evaluative frameworks

The result? LLMs consistently matched expert judgments better than crowdworkers did! 🔥
June 17, 2025 at 3:14 PM
How do we reliably judge if AI companions are performing well on subjective, context-dependent, and deeply human tasks? 🤖

Excited to share the first paper from my postdoc (!!) investigating when LLMs are reliable judges - with empathic communication as a case study 🧐

🧵👇
June 17, 2025 at 3:14 PM