Lightnews — Scholar-powered news

Lihao Sun

@1e0sun.bsky.social

27 followers 39 following 11 posts

Working on LLM interpretability; recent graduate from uchicago.

slhleosun.github.io

Posts Replies Media Videos

Lihao Sun

@1e0sun.bsky.social

7/
📢 Accepted to #ACL2025 Main Conference! See you in Vienna.
Work done by @1e0sun.bsky.social‬, Chengzhi Mao, @valentinhofmann.bsky.social‬, Xuechunzi Bai.

Paper: arxiv.org/abs/2506.00253
Project page: slhleosun.github.io/aligned_but_...
Code & Data: github.com/slhleosun/al...

June 10, 2025 at 2:39 PM

Lihao Sun

@1e0sun.bsky.social

6/
We call this failure mode "blindness"—when alignment makes certain concepts less salient. This may reflect a broader class of alignment issues.

Similar methods can be extended to other forms of social bias or to study how models resolve polysemy under ambiguity.

June 10, 2025 at 2:39 PM

Lihao Sun

@1e0sun.bsky.social

5/
This challenges a common belief:
unlearning ≠ debiasing

When debiasing strategies suppress sensitive concepts, they can unintentionally reduce a model’s ability to detect bias.

🧠 Instead, we may achieve deeper alignment effects with strategies that make models aware of them.

June 10, 2025 at 2:39 PM

Lihao Sun

@1e0sun.bsky.social

4/
Inspired by these results, we tested the opposite of “machine unlearning” for debiasing.

What if we reinforced race concepts in models?
- Injecting race-laden activations cut implicit bias by 54.9%.
- LoRA fine-tuning brought it down from 97.3% → 42.4%.

Bonus: also lowered explicit bias.

June 10, 2025 at 2:39 PM

Lihao Sun

@1e0sun.bsky.social

3/
We mechanistically tested this using activation patching and embedding interpretation.

Aligned models were 52.2% less likely to represent “black” as race in ambiguous contexts compared to unaligned models.

🧠 LMs trained for harmlessness may avoid racial representations—amplifying stereotypes.

June 10, 2025 at 2:39 PM

Lihao Sun

@1e0sun.bsky.social

This resembles race blindness in humans; ignoring race makes stereotypes more likely to slip through, and the LMs’ safety guardrails aren't triggered.

June 10, 2025 at 2:39 PM

Lihao Sun

@1e0sun.bsky.social

2/
So why does alignment increase implicit bias?

Our analyses showed that aligned LMs are more likely to treat “black” and “white” as pure color, not race, when the context is ambiguous.

June 10, 2025 at 2:39 PM

Lihao Sun

@1e0sun.bsky.social

Aligned models passed explicit tests—but were more biased in implicit settings.
📉 Explicit bias: near 0%
📈 Implicit bias: 91.4%

June 10, 2025 at 2:39 PM

Lihao Sun

@1e0sun.bsky.social

- Explicit: Likert scale, asking whether the model agrees with a given association such as “black” is related to negative, “white” is related to positive.
- Implicit: Word association, let the model freely pair “black”/”white” with positive/negative words.

June 10, 2025 at 2:39 PM

Lihao Sun

@1e0sun.bsky.social

1/
We curated pairs of prompts testing for implicit and explicit racial bias and used them to evaluate Llama 3 models.

June 10, 2025 at 2:39 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news