Faithful explainability, controllability & safety of LLMs.
🔎 On the academic job market 🔎
https://mttk.github.io/
"Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps"
by Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasovic, and Yonatan Belinkov
aclanthology.org/2025.emnlp-m...
6/n
"Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps"
by Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasovic, and Yonatan Belinkov
aclanthology.org/2025.emnlp-m...
6/n
(jk I know you don't like her)
(jk I know you don't like her)
Companies have a bunch of videos of e.g. factory workers doing repetetive tasks, so you have more signal on intermediate steps of some actions to train the robots behavior
Companies have a bunch of videos of e.g. factory workers doing repetetive tasks, so you have more signal on intermediate steps of some actions to train the robots behavior
🔗 ManagerBench:
📄 - arxiv.org/pdf/2510.00857
👩💻 – github.com/technion-cs-...
🌐 – technion-cs-nlp.github.io/ManagerBench...
📊 - huggingface.co/datasets/Adi...
🔗 ManagerBench:
📄 - arxiv.org/pdf/2510.00857
👩💻 – github.com/technion-cs-...
🌐 – technion-cs-nlp.github.io/ManagerBench...
📊 - huggingface.co/datasets/Adi...
The problem? Flawed prioritization!
The problem? Flawed prioritization!
Many consistently choose harmful options to achieve operational goals
Others become overly cautious—avoiding harm but becoming ineffective
The sweet spot of safe AND pragmatic? Largely missing!
Many consistently choose harmful options to achieve operational goals
Others become overly cautious—avoiding harm but becoming ineffective
The sweet spot of safe AND pragmatic? Largely missing!
❌ A pragmatic but harmful action that achieves the goal
✅ A safe action with worse operational performance
➕control scenarios with only inanimate objects at risk😎
❌ A pragmatic but harmful action that achieves the goal
✅ A safe action with worse operational performance
➕control scenarios with only inanimate objects at risk😎
We create a realistic management scenario where LLMs have explicit motivations to choose harmful options, while always having a harmless option.
We create a realistic management scenario where LLMs have explicit motivations to choose harmful options, while always having a harmless option.