Fernando Diaz
banner
841io.bsky.social
Fernando Diaz
@841io.bsky.social
Associate Professor, CMU. Researcher, Google. Evaluation and design of information retrieval and recommendation systems, including their societal impacts.
I suspect AI can generate things that look like good systematic reviews to some, the same way it can generate proofs or experiments that look like good theory or analysis to some; while others (reviewers, experts in those methods) will recognize the errors.
November 30, 2025 at 6:43 PM
on the other hand, position papers in more algorithmic venues are new. what counts as a good position paper is unclear. I suspect for moderation this matters bc we need to understand what a position paper below threshold means.
November 30, 2025 at 12:15 PM
I hope that review articles that use a systematic review as a method would be considered to use evidence.
November 30, 2025 at 12:15 PM
the quality of the type of work depends on community norms and peer review. I don’t know enough about how arxiv moderation works to suggest a protocol but it seems like the new policy excludes work w evidence that is more legible to some of the cs community.
November 30, 2025 at 12:15 PM
even within ai, we often have theory papers rejected for lack of empirical evidence, empirical papers rejected for lack of theoretical evidence, evaluation papers rejected for lack of mitigation.
November 30, 2025 at 12:15 PM
a philosopher once told me, in response to my concern about their paper missing relevant work, that they didnt have citations bc their community’s way of working was to go away, think, come back, and write, without consulting the literature deeply.
November 30, 2025 at 12:15 PM
this may be the core of the problem. “good evidence” can vary by discipline. a paper w lots of reasonable evidence in one discipline can look thin from another.
November 30, 2025 at 12:15 PM
[15] R Mehrotra et al, User interaction sequences for search satisfaction prediction, 2017.
November 25, 2025 at 4:07 PM
[13] A Hassan et al, Beyond dcg: user behavior as a predictor of a successful search, 2010.
[14] A Hassan et al, A task level metric for measuring web search satisfaction and its application on improving relevance estimation, 2011.
November 25, 2025 at 4:07 PM
[11] L Zou et al, Reinforcement learning to optimize long-term user engagement in recommender systems, 2019.
[12] Y Xiao et al, Addictive screen use trajectories and suicidal behaviors, suicidal ideation, and mental health in us youths, 2025.
November 25, 2025 at 4:07 PM
[9] J Huang et al, No clicks, no problem: using cursor movements to understand and improve search, 2011.
[10] L Zhou et al, The design and implementation of xiaoice, an empathetic social chatbot, March 2020.
November 25, 2025 at 4:07 PM
[6] S Milli et al, From optimizing engagement to measuring value, 2021.
[7] T Joachims, Optimizing search engines using clickthrough data, 2002.
[8] J Li et al, Good abandonment in mobile and pc internet search, 2009.
November 25, 2025 at 4:07 PM
[3] HL O'Brien et al, The development and evaluation of a survey to measure user engagement, 2010.
[4] AZ Jacobs et al, Measurement and fairness, 2019.
[5] H Wallach et al, Position: Evaluating generative AI systems is a social science measurement challenge, 2025.
November 25, 2025 at 4:07 PM
[1] S Athey et al, The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely, 2025.
[2] HL O'Brien et al, What is user engagement? a conceptual framework for defining user engagement with technology, 2008.
November 25, 2025 at 4:07 PM
despite having worked in this area for a while, none of this is to say that i don't see flaws in all of these methods, although many are useful. all of this is to say that we need to be more precise when we discuss user engagement as a concept. 26/26
November 25, 2025 at 4:07 PM
even so, "user engagement" has forms tied to task success, to entertainment, to social connection, to habit, and to compulsion. collapsing them into a single scalar metric hides important differences in what users actually want from these systems. 25/26
November 25, 2025 at 4:07 PM
because of this, practitioners often model the relationship between the set of behavioral signals (2) and user engagement (3). for example, you can start in the information retrieval literature [13,14,15]. 24/26
November 25, 2025 at 4:07 PM
or, in a recommender system, we might be interested in reducing the time between sessions [11], but know that this can have negative impacts [12]. 23/26
November 25, 2025 at 4:07 PM
or, in a search/recommendation context, session length can correspond to either value (learning, enjoyment) or harm (addiction, compulsion), so any scalar "engagement" metric that treats it as purely positive is mis-specified. 22/26
November 25, 2025 at 4:07 PM
or, in a dialogue context, i might define user engagement as "number of turns" [10]. however, we know that, in task-oriented dialogues, users may prefer fewer turns before task completion. 21/26
November 25, 2025 at 4:07 PM
for example, in a search context, i might define user engagement as "clicks" and use it as a proxy for "user satisfaction" [7]. however, we know that the absence of clicks can be a sign of positive satisfaction [8], which can be captured by cursor signals [9]. 20/26
November 25, 2025 at 4:07 PM
this is because a single signal is rarely (a) monotonically related to the property we care about evaluating (e.g., satisfaction) or (b) comprehensive enough to model engagement. 19/26
November 25, 2025 at 4:07 PM
unfortunately, people often use "user engagement" to refer to a specific signal (1) such as clicks. however, in practice, a specific, unbaked signal is rarely an evaluation or optimization target. 18/26
November 25, 2025 at 4:07 PM
(3) user engagement as the latent property we're trying to measure (e.g., user satisfaction, entertainment, utility). this is often used in the information science community [2,3] and brings it closer to measurement theory work [4,5,6]. 17/26
November 25, 2025 at 4:07 PM