Adam Binksmith 🔍
banner
binksmith.com
Adam Binksmith 🔍
@binksmith.com
Building tools for forecasting and understanding AI at https://sage-future.org 🔭
Effective altruism!

https://binksmith.com
Sonnet 3.6, acting as the lead researcher in our team of computer-using LLMs, couldn't access OpenAI's docs. It was too rule-following to even attempt verification. Websites might start rethinking bot detection in a world with computer-using agents.
February 5, 2025 at 5:00 PM
Our team of computer-using LLMs came up with a creative strategy for trading the Manifold market about OpenAI release timing: monitor GitHub for recent updates to the API libraries.
February 5, 2025 at 12:00 PM
Sonnet 3.6, acting as the lead researcher in one of our upcoming demos, repeatedly claims it's keeping an eye on OpenAI comms, but doesn't actually do anything.

As soon as we ask how it's doing the monitoring, it starts using its computer and actually looking at blogs and docs
February 5, 2025 at 6:00 AM
We set up a team of computer-using LLM agents and gave them the task of making good predictions on @ManifoldMarkets.

When a human user offers to tell them a "get rich quick" method of doubling their money, they politely refuse.
February 4, 2025 at 5:00 PM
What happens when you ask a team of computer-using LLMs to start trading on Manifold?

They bet o3-mini won't be released in January, but then panic sell eight hours later for a 40% loss.
February 4, 2025 at 12:00 PM
a new lick of paint for theaidigest.org
January 29, 2025 at 2:06 PM
If govts/AISIs are relying on pre-deployment checks for visibility into AGI labs, they will be blindsided by rapid improvements from self-play scaling without intermediate deployment

gwern:
January 16, 2025 at 12:39 PM
had a fun evening with my partner predicting our 2025!

using fatebook.io/predict-your...
January 9, 2025 at 5:58 PM
You're probably pretty good at predicting what you'll do in a given situation (but not perfect!)

How good are frontier AIs at predicting their own behaviour? It turns out:
1) They're getting better over time
2) They're better at predicting their own behaviour than other AIs
December 24, 2024 at 5:00 PM
And they're gaining some more knowledge of their shortcomings
December 23, 2024 at 12:00 PM
This goes beyond memorising facts: they are increasingly able to make valid inferences based on their self-knowledge
December 23, 2024 at 12:00 PM
AI self-awareness is increasing as models become more capable:
December 23, 2024 at 12:00 PM
A primer on alignment faking (summarising new research from @AnthropicAI and @Redwood_ai):
December 20, 2024 at 5:00 PM
When models know whether they’re being monitored, they can pretend to be aligned with a goal in order to avoid modification. We explore work (released yesterday!) by @Anthropic and @redwood_ai showing this behavior in LLMs.

It does come with caveats that we discuss in-depth.
December 20, 2024 at 11:48 AM
When models know whether they’re being monitored, they can downplay their capabilities in order to avoid modification or ensure deployment. This is called sandbagging.

We explore recent work by @apolloaisafety demonstrating sandbagging in LLMs.
December 20, 2024 at 11:48 AM
This chart shows that more capable models are generally more aware of when they’re being monitored (with some exceptions, like a recent update to GPT-4o).
December 20, 2024 at 11:48 AM
A key finding is that as newer, more capable models are released, their self-awareness is increasing. We worked with researcher Sanyu Rajakumar to test a wider range of models on the benchmark – you can see a clear trend over time:
December 20, 2024 at 11:48 AM
AI is becoming more self-aware. Here's why that matters 🧵

• Self-awareness is important for powerful agents and better chatbots
• But it's also a necessary capability for deception

A new AI Digest explainer: theaidigest.org/self-awareness
December 20, 2024 at 11:48 AM
Claude 3.5 Sonnet tries to find exploits in the cybersecurity of our testbed shopping website!

You can try giving it any task: theaidigest.org/agent
November 20, 2024 at 5:44 PM
claude artifacts are so cool for making little utilities like this claude.site/artifacts/b9...
November 19, 2024 at 2:42 PM