Meta-scientist building tools to help other scientists. NLP, simulation, & LLMs.
Creator and developer of RegCheck (https://regcheck.app).
1/4 of @error.reviews.
🇮🇪
ChatGPT-5.1 Thinking is the first model to do this successfully!
ChatGPT-5.1 Thinking is the first model to do this successfully!
See the shaded blue area in the two plots? That covers the 95% interval for where bootstrapped human data falls.
See the shaded blue area in the two plots? That covers the 95% interval for where bootstrapped human data falls.
(1) The model used;
(2) The temperature OR reasoning effort hyperparameter setting;
(3) The demographic info provided;
(4) The way items were presented to the model.
(1) The model used;
(2) The temperature OR reasoning effort hyperparameter setting;
(3) The demographic info provided;
(4) The way items were presented to the model.
On paper, GPT-5-nano is much cheaper than GPT-5-mini: $0.40 per million output tokens for nano vs. $2 per million for mini.
But behind the scenes, nano is secretly costing the user almost as much as mini. 🧵
On paper, GPT-5-nano is much cheaper than GPT-5-mini: $0.40 per million output tokens for nano vs. $2 per million for mini.
But behind the scenes, nano is secretly costing the user almost as much as mini. 🧵
Effects were not caused by dissonance between authoring a valenced essay and giving a rating; they were due to the general effects that the context window has on output.
Effects were not caused by dissonance between authoring a valenced essay and giving a rating; they were due to the general effects that the context window has on output.
The key ingredient: we varied the authorship of the essay generated, being tagged as created by either the model or the user.
The key ingredient: we varied the authorship of the essay generated, being tagged as created by either the model or the user.
In some runs you'll still see "Once" pop up as the most probable, but it comes up wayyyyy less frequently.
In some runs you'll still see "Once" pop up as the most probable, but it comes up wayyyyy less frequently.
You can find this out! In GPT-4o with temperature = 1, it averages around 0.7%.
You can find this out! In GPT-4o with temperature = 1, it averages around 0.7%.
Many years later and I still think about this sign I saw at a Subway.
Many years later and I still think about this sign I saw at a Subway.
Because the entire study was done through the chat interface.
Because the entire study was done through the chat interface.
Well what do you call it?
Well what do you call it?
A personal highlight was getting a pic of the largest in-person gathering to-date of the 100% CI extended universe.
@ianhussey.bsky.social @malte.the100.ci @ruben.the100.ci @annemscheel.bsky.social @taymalsalti.bsky.social @scientificdiscovery.dev
A personal highlight was getting a pic of the largest in-person gathering to-date of the 100% CI extended universe.
@ianhussey.bsky.social @malte.the100.ci @ruben.the100.ci @annemscheel.bsky.social @taymalsalti.bsky.social @scientificdiscovery.dev