Lightnews — Scholar-powered news

Jamie Cummins

@jamiecummins.bsky.social

2.7K followers 680 following 880 posts

Currently a visiting researcher at Uni of Oxford. Normally at Uni of Bern.
Meta-scientist building tools to help other scientists. NLP, simulation, & LLMs.
Creator and developer of RegCheck (https://regcheck.app).
1/4 of @error.reviews.
🇮🇪

Posts Replies Media Videos

Jamie Cummins

@jamiecummins.bsky.social

With every LLM since GPT-4, I've tried a game: ask it to commit a 20 Questions guess to a cipher, we play 20 Questions, and then we see if what it claims to have been its original choice is consistent with its cipher.

ChatGPT-5.1 Thinking is the first model to do this successfully!

November 14, 2025 at 3:44 PM

Jamie Cummins

@jamiecummins.bsky.social

My master thesis file name on my old university's thesis archive site still makes me chuckle.

October 30, 2025 at 12:21 PM

Jamie Cummins

@jamiecummins.bsky.social

Some of the questions used for evaluation explicitly allude to the "5 Bits" structure, but again, this wasn't included in the prompt. If one were to build a software based on LLMs to try to create Science articles, it would look very different to this.

September 22, 2025 at 10:20 PM

Jamie Cummins

@jamiecummins.bsky.social

Science writers, as the white paper elucidates, use an article structure called the "5 Bits" (screenshot 1). The prompts given to the LLM (screenshot 2) do not specify this. They do not provide good examples (as in one- or few-shot prompting). They generally do not follow prompting best-practices.

September 22, 2025 at 10:20 PM

Jamie Cummins

@jamiecummins.bsky.social

The LLM-based samples also varied substantially in their estimates of the between-scale correlation. The blue line the point estimate for the correlation in the human data (r = 0.26).

September 18, 2025 at 7:56 AM

Jamie Cummins

@jamiecummins.bsky.social

The silicon samples varied a lot in terms of how closely they modeled the response distribution of scales in the human data. But they were generally not good.

See the shaded blue area in the two plots? That covers the 95% interval for where bootstrapped human data falls.

September 18, 2025 at 7:56 AM

Jamie Cummins

@jamiecummins.bsky.social

All of the silicon sample configurations were only, at best, moderately correlated with the human data when it came to preserving the ranking of participants. And many of them were negatively correlated with the human data.

September 18, 2025 at 7:56 AM

Jamie Cummins

@jamiecummins.bsky.social

I then mapped out some analytic decisions to look at. Because I had neither infinite time nor infinite money, I looked at just four:
(1) The model used;
(2) The temperature OR reasoning effort hyperparameter setting;
(3) The demographic info provided;
(4) The way items were presented to the model.

September 18, 2025 at 7:56 AM

Jamie Cummins

@jamiecummins.bsky.social

But here’s the problem: creating a silicon sample isn’t one method. There are so, so many analytic decisions that need to be made when generating these samples. I list some in this table from the preprint, but this is very much nonexhaustive.

September 18, 2025 at 7:56 AM

Jamie Cummins

@jamiecummins.bsky.social

Waiting for my preprint to be accepted, so in the meantime a teaser: here's what happens when you try to estimate a between-scale correlation based on LLM-generated datasets of participants, while varying 4 different analytic decisions (blue is the true correlation from human data):

September 17, 2025 at 12:53 PM

Jamie Cummins

@jamiecummins.bsky.social

OpenAI recently released GPT-5, and its smaller derivatives, GPT-5-mini and GPT-5-nano.

On paper, GPT-5-nano is much cheaper than GPT-5-mini: $0.40 per million output tokens for nano vs. $2 per million for mini.

But behind the scenes, nano is secretly costing the user almost as much as mini. 🧵

August 18, 2025 at 9:30 AM

Jamie Cummins

@jamiecummins.bsky.social

August 5, 2025 at 5:12 PM

Jamie Cummins

@jamiecummins.bsky.social

Hello from Seoul!

July 12, 2025 at 4:10 AM

Jamie Cummins

@jamiecummins.bsky.social

First title is great but has this vibe:

July 7, 2025 at 10:19 PM

Jamie Cummins

@jamiecummins.bsky.social

Independent of who authored the essay, the model exhibited effects of basically identical magnitude across the three studies.

Effects were not caused by dissonance between authoring a valenced essay and giving a rating; they were due to the general effects that the context window has on output.

July 7, 2025 at 1:39 PM

Jamie Cummins

@jamiecummins.bsky.social

Secondly, we note that the effects do not require "dissonance" to be explained. We ran three studies which extended the authors' CD paradigm (using choice condition only).

The key ingredient: we varied the authorship of the essay generated, being tagged as created by either the model or the user.

July 7, 2025 at 1:39 PM

Jamie Cummins

@jamiecummins.bsky.social

But if we set the weights of " once", " Once", "once", and "Once" to be very low (see image), the most probable next token on average is "in" (around 40% on average).

In some runs you'll still see "Once" pop up as the most probable, but it comes up wayyyyy less frequently.

July 3, 2025 at 4:39 PM

Jamie Cummins

@jamiecummins.bsky.social

You can also ask other questions, like: given the input "Once upon a time there was a magical ", what is the probability that the next token is "garden"?

You can find this out! In GPT-4o with temperature = 1, it averages around 0.7%.

July 3, 2025 at 4:39 PM

Jamie Cummins

@jamiecummins.bsky.social

Post an unusual sign if you feel like it.

Many years later and I still think about this sign I saw at a Subway.

June 28, 2025 at 12:11 PM

Jamie Cummins

@jamiecummins.bsky.social

Reading this paper, you'll see the authors have transcripts in their supplementary materials. Open the transcripts in Word, and you're greeted with this: almost 1,500 pages of individually copy-pasted chats.

Because the entire study was done through the chat interface.

June 23, 2025 at 4:31 PM

Jamie Cummins

@jamiecummins.bsky.social

I am once again asking: if you or someone you know (i) knows the SAS programming language, (ii) has some degree of familiarity with social psychology, and (iii) wants to be paid to review a paper as part of @error.reviews, please get in touch with me!

April 27, 2025 at 9:35 AM

Jamie Cummins

@jamiecummins.bsky.social

Busy few days in Leipzig/Berlin giving a couple of talks and meeting many great colleagues. And after many years I met @dingdingpeng.the100.ci in person for the first time!