GDM Safety blog post: deepmindsafetyresearch.medium.com/consistency-...
GDM Safety blog post: deepmindsafetyresearch.medium.com/consistency-...
ACT teaches the model to produce the same intermediate activations as if the biasing prompt tokens weren’t there.
ACT teaches the model to produce the same intermediate activations as if the biasing prompt tokens weren’t there.
Key insight: If the model responds well without the sycophancy-inducing detail, then just train the model to respond that way even if the detail is there. AKA consistency training.
Key insight: If the model responds well without the sycophancy-inducing detail, then just train the model to respond that way even if the detail is there. AKA consistency training.
GDM Safety blog post: deepmindsafetyresearch.medium.com/consistency-...
GDM Safety blog post: deepmindsafetyresearch.medium.com/consistency-...
ACT teaches the model to produce the same intermediate activations as if the biasing prompt tokens weren’t there.
ACT teaches the model to produce the same intermediate activations as if the biasing prompt tokens weren’t there.
Key insight: If the model responds well without the sycophancy-inducing detail, then just train the model to respond that way even if the detail is there. AKA consistency training.
Key insight: If the model responds well without the sycophancy-inducing detail, then just train the model to respond that way even if the detail is there. AKA consistency training.
We'll need both in the coming years.
We'll need both in the coming years.
turntrout.com/advanced-pri...
turntrout.com/advanced-pri...