The 31-50% risk reflects well-intentioned researchers who just run one reasonable config w/o cherry-picking.
The 31-50% risk reflects well-intentioned researchers who just run one reasonable config w/o cherry-picking.
Even when LLMs correctly identify significant effects, estimated effect sizes still deviate from true values by 40-77% (see Type M risk, Table 3 and Figure 3)
Even when LLMs correctly identify significant effects, estimated effect sizes still deviate from true values by 40-77% (see Type M risk, Table 3 and Figure 3)
Please check out our preprint, we'd be happy to receive your feedback!
#LLMHacking #SocialScience #ResearchIntegrity #Reproducibility #DataAnnotation #NLP #OpenScience #Statistics
Please check out our preprint, we'd be happy to receive your feedback!
#LLMHacking #SocialScience #ResearchIntegrity #Reproducibility #DataAnnotation #NLP #OpenScience #Statistics
✅ Larger, more capable models are safer (but no guarantee).
✅ Few human annotations beat many AI annotations.
✅ Testing several models and configurations on held-out data helps.
✅ Pre-registering AI choices can prevent cherry-picking.
✅ Larger, more capable models are safer (but no guarantee).
✅ Few human annotations beat many AI annotations.
✅ Testing several models and configurations on held-out data helps.
✅ Pre-registering AI choices can prevent cherry-picking.
- Risk peaks near significance thresholds (p=0.05), where 70% of "discoveries" may be false.
- Regression correction methods often don't work as they trade off Type I vs. Type II errors.
- Risk peaks near significance thresholds (p=0.05), where 70% of "discoveries" may be false.
- Regression correction methods often don't work as they trade off Type I vs. Type II errors.
Importantly this also concerns well-intentioned researchers!
Importantly this also concerns well-intentioned researchers!
bsky.app/profile/mila...
bsky.app/profile/mila...
Language Models to Simulate Human Behaviors, SRW Oral, Monday, July 28, 14:00-15:30