Lightnews — Scholar-powered news

Jérémie Beucler

@jeremiebeucler.bsky.social

160 followers 510 following 26 posts

PhD student with Wim de Neys & Lucie Charles at LaPsyDE; MSc in Cog Sciences at ENS - interested in reasoning & metacognition

https://jeremie-beucler.github.io/

Posts Replies Media Videos

Jérémie Beucler

@jeremiebeucler.bsky.social

8/10

We also re-analyzed existing base-rate stimuli from past research using our method. The results revealed a large, previously unnoticed variability in belief strength, which could be problematic in some cases.

Histogram showing the distribution of stereotype strength in existing items, spanning a wide range of stereotype strength values from a log ratio of around 0 to a log ration > 2.

October 16, 2025 at 4:17 PM

Jérémie Beucler

@jeremiebeucler.bsky.social

7/10

This method allows us to create a massive database of over 100,000 base-rate items, each with an associated belief strength value.

Here is an example of every possible items for one single adjective out of 66 ("Arrogant")! Best to be a kindergarten teacher than a politician in this case. 🤭

Matrix showing all the possible items created for the adjective "arrogant" for all the possible groups in our study. Upper part shows stereotype strength, and lower part shows the predicted choice probability of one group based on our fitted model.

October 16, 2025 at 4:17 PM

Jérémie Beucler

@jeremiebeucler.bsky.social

6/10

And it works really well! LLM-generated ratings showed a very strong correlation with human judgments.

More importantly, our belief-strength measure robustly predicted participants' actual choices in a separate base-rate neglect experiment!

The left panel of the figure shows the positive correlation between LLM and average human typicality ratings for the two LLMs; the right panel shows how choices are consistently predicted by our stereotype strength measure.

October 16, 2025 at 4:17 PM

Jérémie Beucler

@jeremiebeucler.bsky.social

3/10

We argue that measuring “belief strength” is a major bottleneck in reasoning research, which mostly relies on conflict vs. no-conflict items.

It requires costly human ratings and is rarely done parametrically, limiting the development of theoretical & computational models of biased reasoning.

Illustration of different hypothetical response functions (linear, sigmoid, and step-like) linking belief strength to choice probability when only two belief strength levels (low and high, indicated by black diamonds) are used. This limited binary approach restricts researchers’ ability to precisely characterize participants’ underlying cognitive processes or strategies and differentiate among competing theoretical models.

October 16, 2025 at 4:17 PM

Jérémie Beucler

@jeremiebeucler.bsky.social

1/10

🚨 New preprint: Using Large Language Models to Estimate Belief Strength in Reasoning 🚨

When asked: "There are 995 politicians and 5 nurses. Person 'L' is kind. Is Person 'L' more likely to be a politician or a nurse?", most people will answer "nurse", neglecting the base-rate info.

A 🧵👇

Abstract

Accurately quantifying belief strength in heuristics-and-biases tasks is crucial yet methodologically challenging. In this paper, we introduce an automated method leveraging large language models (LLMs) to systematically measure and manipulate belief strength. We specifically tested this method in the widely used “lawyer-engineer” base-rate neglect task, in which stereotypical descriptions (e.g., someone enjoying mathematical puzzles) conflict with normative base-rate information (e.g., engineers represent a very small percentage of the sample). Using this approach, we created an open-access database containing over 100,000 unique items systematically varying in stereotype-driven belief strength. Validation studies demonstrate that our LLM-derived belief strength measure correlates strongly with human typicality ratings and robustly predicts human choices in a base-rate neglect task. Additionally, our method revealed substantial and previously unnoticed variability in stereotype-driven belief strength in popular base-rate items from existing research, underlining the need to control for this in future studies. We further highlight methodological improvements achievable by refining the LLM prompt, as well as ways to enhance cross-cultural validity. The database presented here serves as a powerful resource for researchers, facilitating rigorous, replicable, and theoretically precise experimental designs, as well as enabling advancements in cognitive and computational modeling of reasoning. To support its use, we provide the R package baserater, which allows researchers to access the database to apply or adapt the method to their own research.

October 16, 2025 at 4:17 PM

Jérémie Beucler

@jeremiebeucler.bsky.social

Curious about the mechanisms behind biased reasoning and metacognition? 🤔

📍 Come see our poster at #CCN2025, Aug 12, 1:30–4:30pm

We show how a biased drift-diffusion model can explain choice, RT and confidence in a base-rate neglect task, revealing why more deliberation doesn’t always fix bias.

poster about why we don't stop and think, showing the main results of the study

August 11, 2025 at 1:06 PM

Jérémie Beucler

@jeremiebeucler.bsky.social

6/8

Finding #3: The strength of the illusion is key! As the semantic overlap gets stronger (e.g., "Moses" is closer to "Noah" than "Goliath" is), confidence in incorrect answers tended to increase, while confidence in correct answers tended to decrease. 📈📉

Figure shows confidence results as a function of illusion strength.

Figure 5. Regression results of initial confidence as a function of illusion strength for control no-anomaly correct (baseline), anomaly correct and anomaly incorrect responses. Illusion strength = mean initial no-anomaly accuracy − mean initial anomaly accuracy for each item. The shaded bands are 95% confidence bands.

August 6, 2025 at 10:24 AM

Jérémie Beucler

@jeremiebeucler.bsky.social

5/8

Finding #2: Even when participants got it wrong and fell for the illusion, they showed a significant error sensitivity (lower confidence). Interestingly, this effect was not affected by load and deadline, suggesting this error sensitivity is intuitive.

Figure showing confidence results.

Figure 3. Confidence ratings at anomaly and control no-anomaly trials in the initial response stage as a function of accuracy in Experiment 1 and Experiment 2. The lower and upper hinges of the boxplot correspond to the first and third quartiles, and the middle line shows the median. The lower (resp. upper) whiskers extend from the hinges to the smallest (resp. largest) value no further than 1.5 times the interquartile range. Overlaid black dots represent the mean and error bars are standard errors of the mean.

August 6, 2025 at 10:24 AM

Jérémie Beucler

@jeremiebeucler.bsky.social

4/8

Finding #1: You don't always need to be slow to be right! 🐢 A significant number of participants intuitively spotted the anomaly from the start, without needing extra time and resources to deliberate. 🐇 Sound intuitive reasoning does happen.

figure on accuracy

Figure 2. Accuracy and Direction of Change in Experiment 1 and Experiment 2. a) Response
accuracy at anomaly and control no-anomaly trials as a function of response stage. b) Proportion
of each direction of change category at anomaly and control no-anomaly trials; “00” = incorrect
initial and incorrect final response; “01” = incorrect initial and correct final response; “10” =
correct initial and incorrect final response; “11” = correct initial and correct final response. The
lower and upper hinges of the boxplot correspond to the first and third quartiles, and the middle
line shows the median. The lower (resp. upper) whiskers extend from the hinges to the smallest
(resp. largest) value no further than 1.5 times the interquartile range. Overlaid black dots represent
the mean and black error bars are standard errors of the mean.

August 6, 2025 at 10:24 AM

Jérémie Beucler

@jeremiebeucler.bsky.social

3/8

To test this, we ran 4 experiments with over 500 participants! We used a two-response paradigm: first, a quick intuitive answer under time pressure & cognitive load. Then, a final, deliberated response with no constraints. Here are the main results:

Shows a figure describing the paradigm and two cognitive load matrices

Figure 1. Experiment 1 trial sequence and examples of load patterns in Experiment 1-3. a) Example of one trial in Experiment 1. Participants had to respond to a trivia question twice, once with a deadline and a concurrent load and a second time without any constraint. b) Example of the to-be-memorized load patterns in Experiment 1-3 (upper panel) and Experiment 2 (lower panel).

August 6, 2025 at 10:24 AM

Jérémie Beucler

@jeremiebeucler.bsky.social

2/8

These semantic illusions are often used to test for deliberate "System 2" thinking (e.g., in the verbal Cognitive Reflection Test). The classic theory? We intuitively fall for the illusion & need slow, effortful deliberation to correct the mistake. But is it really that simple?

Shows this title:

Measuring cognitive reflection without maths: Development
and validation of the verbal cognitive reflection test

Miroslav Sirota1 | Chris Dewberry2 | Marie Juanchich1 | Lenka Valuš3 |
Amanda C. Marshall

August 6, 2025 at 10:24 AM

Jérémie Beucler

@jeremiebeucler.bsky.social

1/8

New (and first) paper accepted at JEP:LMC 🎉

Ever fallen for this type of questions: "How many animals of each kind did Moses take on the Ark?" Most say "Two," forgetting it was Noah, and not Moses, who took the animals on the Ark. But what’s really going on here? 🧵

When asked “How many animals of each kind did Moses take on the Ark?”, most people answer “Two”, failing to notice that it was Noah, and not Moses, who took the animals in the Ark. “Fast-and-slow” dual process accounts of such semantic illusions posit that incorrect responders are not sensitive to their error and that overcoming the illusion requires deliberate correction of an intuitive erroneous answer. We present three experiments that force us to revise this dual process view. We used a two-response paradigm in which participants had to give their first, initial answer under cognitive load and time pressure. Next, participants could take all the time they wanted to deliberate and select a final answer. This enabled us to identify the intuitively generated response that preceded the final response given after deliberation. Results show that participants do not necessarily need to deliberate to avoid the illusion and that incorrect respondents consistently display error sensitivity (as reflected in decreased confidence), even when deliberation is minimized. Both reasoning performance and error sensitivity in the initial, intuitive stage tended to be driven by the semantic relatedness between the anomalous word (e.g., “Moses”) and the undistorted word (e.g., “Noah”). We show how this leads to a revised model where the response to semantic illusions depends on the interplay of both incorrect and correct intuitions.

August 6, 2025 at 10:24 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news