Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome
We welcome any works that further our ability to use the internals of a model to better understand it
Details: mechinterpworkshop com
Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome
We welcome any works that further our ability to use the internals of a model to better understand it
Details: mechinterpworkshop com
In a *weird* amount of recent interactions with normal people (eg my hairdresser) when I say I do AI research (*not* safety), they ask if AI will take over
Alas, I have no reassurances to offer
In a *weird* amount of recent interactions with normal people (eg my hairdresser) when I say I do AI research (*not* safety), they ask if AI will take over
Alas, I have no reassurances to offer
My favourite approach is Socratic persuasion: guiding them through my case via questions. If I'm wrong there's soon a surprising answer!
I can be opinionated *and* truth seeking
My favourite approach is Socratic persuasion: guiding them through my case via questions. If I'm wrong there's soon a surprising answer!
I can be opinionated *and* truth seeking
This mystical notion separates new and experienced researchers. It's real and important. But what is it and how to learn it?
I break down taste as the mix of intuition/models behind good open-ended decisions and share tricks to speed up learning
x.com/NeelNanda5/...
This mystical notion separates new and experienced researchers. It's real and important. But what is it and how to learn it?
I break down taste as the mix of intuition/models behind good open-ended decisions and share tricks to speed up learning
x.com/NeelNanda5/...
Expert forecasters filter for the events that actually matter (not just noise), and forecast how likely this is to affect eg war, pandemics, frontier AI etc
Highly recommended!
Expert forecasters filter for the events that actually matter (not just noise), and forecast how likely this is to affect eg war, pandemics, frontier AI etc
Highly recommended!
x.com/BartBussman...
x.com/BartBussman...
One role is applied interpretability: a new subteam of my team using interp for safety in production
x.com/rohinmshah/...
One role is applied interpretability: a new subteam of my team using interp for safety in production
x.com/rohinmshah/...
It's clarified my thinking on the flaws with current SAEs - the fact that we must choose the SAE size means we can't be finding 'true' concepts.
SAEs are an imperfect lens, that represent concepts at varying levels of granularity
x.com/BartBussman...
It's clarified my thinking on the flaws with current SAEs - the fact that we must choose the SAE size means we can't be finding 'true' concepts.
SAEs are an imperfect lens, that represent concepts at varying levels of granularity
x.com/BartBussman...
x.com/NeelNanda5/...
x.com/NeelNanda5/...
This great GDM safety paper shows that myopically optimising for plans that an overseer approves of, rather than outcomes, reduces these issues while performing well!
x.com/davlindner/...
This great GDM safety paper shows that myopically optimising for plans that an overseer approves of, rather than outcomes, reduces these issues while performing well!
x.com/davlindner/...