Gabriele Sarti
banner
gsarti.com
Gabriele Sarti
@gsarti.com
Postdoc @ Northeastern, @ndif-team.bsky.social with @davidbau.bsky.social. Interpretability ∩ HCI ∩ #NLProc. Creator of @inseq.org. Prev: PhD @gronlp.bsky.social, ML @awscloud.bsky.social & Aindo

gsarti.com
This work aims to contribute to AI safety by developing detection methods, characterizing which architectures are prone to these behaviors, and creating resources for the broader research community. See more in the proposal: sparai.org/projects/sp2...
Monitoring and Attributing Implicit Personalization in Conversational Agents - SPAR Project
This project investigates implicit personalization, i.e. how conversational models form implicit beliefs about their users, focusing in particular how these bel...
sparai.org
January 9, 2026 at 2:09 PM
You can find relevant references in the project description, including excellent work by Transluce, @veraneplenbroek.bsky.social, @arianna-bis.bsky.social @wattenberg.bsky.social , etc. building on recent advances in extracting latent user representations for understanding personalization behaviors.
January 9, 2026 at 2:09 PM
Mentees will take a leading role in defining research questions, reviewing literature, conducting technical work, adapting codebases, training decoders, or building evaluation pipelines. We'll prioritize 1-2 directions based on your background and interests.
January 9, 2026 at 2:09 PM
This project aims to expand our understanding of implicit personalization in LLMs: how models form user beliefs, which elements in prompts/training drive these behaviors, and how we can leverage interpretability methods for control beyond simple detection.
January 9, 2026 at 2:09 PM
When language models interact with users, they implicitly infer user attributes (expertise, demographics, beliefs) that influence responses in ways users neither expect nor endorse. This hidden personalization can lead to sycophancy, deception, and demographic bias.
January 9, 2026 at 2:09 PM
For more info, see ndif.us, and check out the amazing NNSight toolkit for extracting and analyzing the internals of any Torch-compatible model! nnsight.net
NSF National Deep Inference Fabric
NDIF is a research computing project that enables researchers and students to crack open the mysteries inside large-scale AI systems.
ndif.us
January 4, 2026 at 6:44 PM
Now onwards to making language models transparent and trustworthy for everyone! 🚀

For those curious to know more about my thesis:
- Web-optimized version: gsarti.com/phd-thesis/
- PDF: research.rug.nl/en/publicati...
- Steal my Quarto template: github.com/gsarti/phd-t...
From Insights to Impact
Ph.D. Thesis, Center for Language and Cognition (CLCG), University of Groningen
gsarti.com
December 16, 2025 at 12:21 PM
Agreed, also the use of an LLM for identifying "high impact" with the arbitrary k=5 threshold is kinda random, esp given that transparent tools like Semantic Scholar by @ai2.bsky.social provide this kind of info already. Still, refreshing to see something not using pure pub counts to measure impact!
November 16, 2025 at 1:50 PM