Hanna Wallach
banner
hannawallach.bsky.social
Hanna Wallach
@hannawallach.bsky.social

VP and Distinguished Scientist at Microsoft Research NYC. AI evaluation and measurement, responsible AI, computational social science, machine learning. She/her.

One photo a day since January 2018: https://www.instagram.com/logisticaggression/ .. more

Hanna Megan Wallach is a computational social scientist and partner research manager at Microsoft Research. Her work makes use of machine learning models to study the dynamics of social processes. Her current research focuses on issues of fairness, accountability, transparency, and ethics as they relate to AI and machine learning. .. more

Computer science 69%
Physics 8%

Alright, it's that time of year: Who all is going to
@neuripsconf.bsky.social this year??? #NeurIPS2025 🤖☃️

And I'll be at @wimlworkshop.bsky.social and @neuripsconf.bsky.social Tue, Wed, and Thu next week if folks want to chat.

For all three positions, we welcome applicants with backgrounds in technical fields (e.g., ML, AI, NLP, statistics) and sociotechnical fields (e.g., human--computer interaction, information science, law, media studies, philosophy, science and technology studies, sociology).

And you can read more about STAC here: microsoft.com/en-us/resear...

You can read more about FATE here: microsoft.com/en-us/resear...

Reposted by David Rothschild

Three exciting opportunities at
@msftresearch.bsky.social in NYC!!! 🎉

Internship w/ FATE: apply.careers.microsoft.com/careers/job?...

Internship w/ STAC on AI evaluation and measurement: apply.careers.microsoft.com/careers/job?...

Postdoc w/ FATE: apply.careers.microsoft.com/careers/job?...
FATE internships: apply.careers.microsoft.com/careers/job?...

FATE postdocs: apply.careers.microsoft.com/careers/job?...

And internships with our close collaborators at STAC: apply.careers.microsoft.com/careers/job?...

This is happening now!!!
If you're at @icmlconf.bsky.social this week, come check out our poster on "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" presented by the amazing @afedercooper.bsky.social from 11:30am--1:30pm PDT on Weds!!! icml.cc/virtual/2025...
ICML Poster Position: Evaluating Generative AI Systems Is a Social Science Measurement ChallengeICML 2025
icml.cc

Reposted by Hanna Wallach

1) (Tomorrow!) Wed 7/16, 11am-1:30 pm PT poster for "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" (E. Exhibition Hall A-B, E-503)

Work led by @hannawallach.bsky.social + @azjacobs.bsky.social

arxiv.org/abs/2502.00561
Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge
The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges com...
arxiv.org

Oh whoops! You are indeed correct -- it starts at 11am PT!

If you're at @icmlconf.bsky.social this week, come check out our poster on "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" presented by the amazing @afedercooper.bsky.social from 11:30am--1:30pm PDT on Weds!!! icml.cc/virtual/2025...
ICML Poster Position: Evaluating Generative AI Systems Is a Social Science Measurement ChallengeICML 2025
icml.cc

I also want to note that this paper has been in progress for many, many years, so we're super excited it's finally being published. It's also one of the most genuinely interdisciplinary projects I've ever worked on, which has made it particularly challenging and rewarding!!! ❤️

Check out the camera-ready version of our ACL Findings paper ("Taxonomizing Representational Harms using Speech Act Theory") to learn more!!! arxiv.org/pdf/2504.00928
arxiv.org

Why does this matter? You can't mitigate what you can't measure, and our framework and taxonomy help researchers and practitioners design better ways to measure and mitigate representational harms caused by generative language systems.

Using this theoretical grounding, we provide new definitions for stereotyping, demeaning, and erasure, and break them down into a detailed taxonomy of system behaviors. By doing this, we unify many of the different ways representational harms have been previously defined.

We bring some much-needed clarity by turning to speech act theory—a theory of meaning from linguistics that allows us to distinguish between a system output’s purpose and its real-world impacts.

These are often called “representational harms,” and while they’re easy for people to recognize when they see them, definitions of these harms are commonly under-specified, leading to conceptual confusion. This makes them hard to measure and even harder to mitigate.

Generative language systems are everywhere, and many of them stereotype, demean, or erase particular social groups.

Reposted by Joanna Bryson

Check out the camera-ready version of our ICML position paper ("Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge") to learn more!!! arxiv.org/abs/2502.00561

(6/6)
Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge
The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges com...
arxiv.org

Real talk: GenAI systems aren't toys. Bad evaluations don't just waste people's time---they can cause real-world harms. It's time to level up, ditch the apples-to-oranges comparisons, and start doing measurement like we mean it.

(5/6)

We propose a framework that cuts through the chaos: first, get crystal clear on what you're measuring and why (no more vague hand-waving); then, figure out how to measure it; and, throughout the process, interrogate validity like your reputation depends on it---because, honestly, it should.

(4/6)

Here's our hot take: evaluating GenAI systems isn't just some techie puzzle---it's a social science measurement challenge.

(3/6)

But there's a dirty little secret: the ways we evaluate GenAI systems are often sloppy, vague, and quite frankly... not up to the task.

(2/6)

Alright, people, let's be honest: GenAI systems are everywhere, and figuring out whether they're any good is a total mess. Should we use them? Where? How? Do they need a total overhaul?

(1/6)

I'm so excited this paper is finally online!!! 🎉 We had so much fun working on this with @emmharv.bsky.social!!! Thread below summarizing our contributions...
📣 "Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems" is forthcoming at #ACL2025NLP - and you can read it now on arXiv!

🔗: arxiv.org/pdf/2506.04482
🧵: ⬇️

Please spread the word to anyone who you think might be interested! We will begin reviewing applications on June 2.