Co-author of: Malware Data Science
🎧 www.heroku.com/podcasts/cod...
🎧 www.heroku.com/podcasts/cod...
www.youtube.com/watch?v=01I4...
www.youtube.com/watch?v=01I4...
Meanwhile, Musk has recently issued public support for the far-right wing AfD party, often described as anti-semetic / extremist.
www.cnn.com/2024/12/20/m...
That + no apology...
Meanwhile, Musk has recently issued public support for the far-right wing AfD party, often described as anti-semetic / extremist.
www.cnn.com/2024/12/20/m...
That + no apology...
docs.google.com/document/d/1...
Incredibly impressive person.
docs.google.com/document/d/1...
Incredibly impressive person.
arxiv.org/pdf/2401.05566
So many AI safety issues get worse, & harder to combat the larger and more advanced your model gets:
"The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning"
arxiv.org/pdf/2401.05566
So many AI safety issues get worse, & harder to combat the larger and more advanced your model gets:
"The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning"
It's extremely hard to take out sycophancy out of an LLM, trained the way we train them.
It's extremely hard to take out sycophancy out of an LLM, trained the way we train them.
TLDR: LLMs tend to generate sycophantic responses.
Human feedback & preference models encourage this behavior.
I also think this is just the nature of training on internet writing.... We write in social clusters:
TLDR: LLMs tend to generate sycophantic responses.
Human feedback & preference models encourage this behavior.
I also think this is just the nature of training on internet writing.... We write in social clusters:
Assuming outer misalignment, x can be seen as safer than y.
That being said, the better the model, the less this will happen.
Assuming outer misalignment, x can be seen as safer than y.
That being said, the better the model, the less this will happen.
But I do think that inner misalignment (~learned features) tend to act as a protective mechanism to avoid outer misalignment implications.
I, er, really hope.
But I do think that inner misalignment (~learned features) tend to act as a protective mechanism to avoid outer misalignment implications.
I, er, really hope.