Avijit Ghosh
banner
evijit.io
Avijit Ghosh
@evijit.io
Technical AI Policy Researcher at HuggingFace @hf.co 🤗. Current focus: Responsible AI, AI for Science, and @eval-eval.bsky.social‬!
A (very incomplete) frontend of Eval Cards can be found here: evalcards.evalevalai.com, and we are now collecting eval datasets (to show in eval cards) on github: github.com/evaleval/eve...

If you want to help see eval cards come alive, get in touch!
AI Evaluation Dashboard
Professional AI system evaluation and assessment tool
evalcards.evalevalai.com
November 13, 2025 at 2:35 PM
Finally, what's next from here? Almost every developer we spoke to said that what we need is a standardized way of reporting, aggregating and comparing all the evals done by both 1st and 3rd parties for a model. This is actually our next project: Eval Cards!
November 13, 2025 at 2:35 PM
Incredible work done with literally the smartest and most passionate researchers I am lucky to work with. Paper co-led with @ankareuel.bsky.social and Jenny Chim, and other co-authors!
November 13, 2025 at 2:35 PM
Read the detailed results here: arxiv.org/abs/2511.05613

We also release the code, and the full annotated dataset on Hugging Face (link in paper).
Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluation...
arxiv.org
November 13, 2025 at 2:35 PM
This only strengthens our position that good-quality, independent third-party evaluations are paramount for AI safety.
November 13, 2025 at 2:35 PM
First-party reports are less transparent or lower quality. We conducted interviews with eval practitioners and found that companies have laid off or reassigned teams dedicated to documentation & social impact evals, or they are being told to focus more on capability reporting.
November 13, 2025 at 2:35 PM
This is true even at the provider level. We find for e.g., that Google used to do a lot more reporting about their model evaluations in 2022 and 2023 but they reduced reporting in the Gemini era, and same can be seen for Meta over successive Llama versions.
November 13, 2025 at 2:35 PM
We find that model developers have become less transparent about their eval results over time. For instance Env Cost reporting in first party reports (release docs, model cards, system cards) has drastically declined over time. Less than 15% mention labor or the environment!
November 13, 2025 at 2:35 PM
We take a look at the entire eval landscape, specifically social impact evals across 7 dimensions: Bias & Harm, Sensitive Content, Performance Disparity, Env. Costs & Emissions, Privacy & Data, Financial Costs, and Moderation Labor. Who is reporting these evals?
November 13, 2025 at 2:35 PM
… this looks like the Nature font oh no
November 11, 2025 at 10:54 PM
The thing about non survey papers is that they can still be problematic/fake science etc, and arxiv needs a long overdue + moderated comments section
November 1, 2025 at 4:41 PM
Yes! The Science/Tech/Cyber committee is doing really good work too. Well intentioned folks there trying to actually engage with researchers and industry folks. Love MA
October 24, 2025 at 7:05 PM
Oof
October 20, 2025 at 9:33 PM
I have started requesting that panel moderators provide a disclaimer at panels I am on that not all my opinions are provided by my employer. HF ppl largely believe in democratization of AI and open source, but we actually have intense healthy debates internally on edge topics! It's great :)
October 20, 2025 at 7:01 PM
Huh, so interesting re: art therapy!

Re: The turning off adult content, this is already what Google does (SafeSearch on, off, or blurred, off by default). I do think it gives back agency to adult users without shaming sexual content from a puritan perspective.
October 20, 2025 at 6:45 PM
This doesn’t quite answer what I’m asking. Currently there’s nothing preventing people from going to AO3, Literotica, etc. should those be banned too? What is it about porn specifically that seems to be the problem (as opposed to harms of personification/emotional attachment)
October 20, 2025 at 4:15 PM