The first jump into Science of Science! We systematically investigated the NLP4SG landscape and quantified the proportion of work addressing social good concerns both within and beyond the ACL community. Preprint coming soon!
The first jump into Science of Science! We systematically investigated the NLP4SG landscape and quantified the proportion of work addressing social good concerns both within and beyond the ACL community. Preprint coming soon!
By far, the most comprehensive multilingual benchmark for evaluating LLMs. Qwen 3 2507 is using this benchmark to evaluate multilingual ability!
Paper 3️⃣: arxiv.org/abs/2503.10497
By far, the most comprehensive multilingual benchmark for evaluating LLMs. Qwen 3 2507 is using this benchmark to evaluate multilingual ability!
Paper 3️⃣: arxiv.org/abs/2503.10497
(4) Thinking with images helps SO much in VLMs' calibration!
Paper1️⃣: arxiv.org/abs/2504.06564
Paper2️⃣: arxiv.org/abs/2505.20236
(4) Thinking with images helps SO much in VLMs' calibration!
Paper1️⃣: arxiv.org/abs/2504.06564
Paper2️⃣: arxiv.org/abs/2505.20236
(1) SFT reasoning models usually lead to better calibration in in-distribution settings, and worse calibration in OOD settings.
(2) RL could help improve(recover) a bit.
...
(1) SFT reasoning models usually lead to better calibration in in-distribution settings, and worse calibration in OOD settings.
(2) RL could help improve(recover) a bit.
...