Kshitish Ghate
@kghate.bsky.social
110 followers 190 following 21 posts
PhD student @ UWCSE; MLT @ CMU-LTI; Responsible AI https://kshitishghate.github.io/
Posts Media Videos Starter Packs
kghate.bsky.social
🚨Current RMs may systematically favor certain cultural/stylistic perspectives. EVALUESTEER enables measuring this steerability gap. By controlling values and styles independently, we isolate where models fail due to biases and inability to identify/steer to diverse preferences.
kghate.bsky.social
Finding 3: All RMs exhibit style-over-substance bias. In value-style conflict scenarios:
• Models choose style-aligned responses 57-73% of the time
• Persists even with explicit instructions to prioritize values
• Consistent across all model sizes and types
kghate.bsky.social
Finding 2: The RMs we tested generally show intrinsic value and style-biased preferences for:
• Secular over traditional values
• Self-expression over survival values
• Verbose, confident, and formal/cold language
kghate.bsky.social
Finding 1: Even the best RMs struggle to identify which profile aspects matter for a given prompt query. GPT-4.1-Mini and Gemini-2.5-Flash have ~75% accuracy with full user profile context, while having >99% in the Oracle setting (only relevant info provided).
kghate.bsky.social
We generate pairs where responses differ only on value alignment or only on style, or when value and style preferences conflict between responses. This lets us isolate whether models can identify and adapt to the relevant dimension for each prompt despite facing confounds.
kghate.bsky.social
We need controlled variation of both values AND styles to test RM steerability.
We generate 165,888 synthetic preference pairs with profiles that systematically vary:
• 4 value dimensions from the World Values Survey
• 4 style dimensions (verbosity, confidence, warmth, reading difficulty)
kghate.bsky.social
Benchmarks like RewardBench test general RM performance in an aggregate sense. The PRISM benchmark has diverse human preferences but lacks ground-truth value/style labels for controlled evaluation.

arxiv.org/abs/2403.13787
arxiv.org/abs/2404.16019
kghate.bsky.social
LLMs serve users with different values (traditional vs secular, survival vs self-expression) and style preferences (verbosity, confidence, warmth, reading difficulty). As a result, we need RMs that can adapt to individual preferences, not just optimize for an "average" user.
kghate.bsky.social
🚨New paper: Reward Models (RMs) are used to align LLMs, but can they be steered toward user-specific value/style preferences?
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵
Reposted by Kshitish Ghate
andyliu.bsky.social
🚨New Paper: LLM developers aim to align models with values like helpfulness or harmlessness. But when these conflict, which values do models choose to support? We introduce ConflictScope, a fully-automated evaluation pipeline that reveals how models rank values under conflict.
(📷 xkcd)
Reposted by Kshitish Ghate
aylincaliskan.bsky.social
Honored to be promoted to Associate Professor at the University of Washington! Grateful to my brilliant mentees, students, collaborators, mentors & @techpolicylab.bsky.social for advancing research in AI & Ethics together—and for the invaluable academic freedom to keep shaping trustworthy AI.
Reposted by Kshitish Ghate
kghate.bsky.social
Excited to announce our #NAACL2025 Oral paper! 🎉✨

We carried out the largest systematic study so far to map the links between upstream choices, intrinsic bias, and downstream zero-shot performance across 131 CLIP Vision-language encoders, 26 datasets, and 55 architectures!
kghate.bsky.social
🖼️ ↔️ 📝 Modality shifts biases: Cross-modal analysis reveals modality-specific biases, e.g. image-based 'Age/Valence' tests exhibit differences in bias directions; pointing to the need for vision-language alignment, measurement, and mitigation methods.
kghate.bsky.social
📊 Bias and downstream performance are linked: We find that intrinsic biases are consistently correlated with downstream task performance on the VTAB+ benchmark (r ≈ 0.3–0.8). Improved performance in CLIP models comes at the cost of skewing stereotypes in particular directions.
kghate.bsky.social
⚠️ What data is "high" quality? Pretraining data curated through automated or heuristic-based data filtering methods to ensure high downstream zero-shot performance (e.g. DFN, Commonpool, Datacomp) tend to exhibit the most bias!
kghate.bsky.social
📌 Data is key: We find that the choice of pre-training dataset is the strongest predictor of associations, over and above architectural variations, dataset size & number of model parameters.
kghate.bsky.social
1. Upstream factors:  How do dataset, architecture, and size affect intrinsic bias?
2. Performance link : Does better zero-shot accuracy come with more bias?
3. Modality: Do images and text encode prejudice differently?
kghate.bsky.social
We sought to answer some pressing questions on the relationship between bias and model design choices and performance👇
kghate.bsky.social
🔧 Our analysis of intrinsic bias is carried out with a more grounded and improved version of the Embedding Association Tests with controlled stimuli (NRC-VAD, OASIS). We reduced measurement variance by 4.8% and saw ~80% alignment with human stereotypes in 3.4K tests.
kghate.bsky.social
🚨 Key takeaway: Unwanted associations in Vision-language encoders are deeply rooted in the pretraining data and how it is curated and careful reconsideration of these methods is necessary to ensure that fairness concerns are properly addressed.
kghate.bsky.social
Excited to announce our #NAACL2025 Oral paper! 🎉✨

We carried out the largest systematic study so far to map the links between upstream choices, intrinsic bias, and downstream zero-shot performance across 131 CLIP Vision-language encoders, 26 datasets, and 55 architectures!