Can
canrager.bsky.social
Can
@canrager.bsky.social
Huge thanks to @ekdeepl.bsky.social for vision and project. Really enjoyed working with the team @sumedh-hindupur.bsky.social , @amuuueller.bsky.social , and many more that the tweet char limit allows. We made this big collaboration work, with 12h time zone difference at times!
November 13, 2025 at 10:32 PM
Find more experiments on parsing complex grammar, in-context learning and the interpretability of novel codes in our paper. arxiv.org/abs/2511.01836
Priors in Time: Missing Inductive Biases for Language Model Interpretability
Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions,...
arxiv.org
November 13, 2025 at 10:32 PM
The predictive code detects events in stories. Try it yourself in the interactive demo on Neuronpedia, h/t to @johnnylin.bsky.social .
November 13, 2025 at 10:32 PM
The predictive code chronologically parses the input, while codes of existing methods don’t.
November 13, 2025 at 10:32 PM
LLM representations reflect the temporal structure of language, too. When parsing text, representations are highly correlated to its context, and intrinsic dimensionality grows over time.
November 13, 2025 at 10:32 PM
Motivating observation: Language has rich temporal structure. Human brain activity reflects temporal dynamics (eg www.biorxiv.org/content/10.1...).
November 13, 2025 at 10:32 PM
Thanks to @wendlerc.bsky.social, @rohitgandikota.bsky.social, and @davidbau.bsky.social for strong support in writing this paper and Eugen Hotaj, Adam Karvonen, Sam Marks, Owain Evans, Jason Vega, @ericwtodd.bsky.social, Stephen Casper, and Byron Wallace for valuable feedback!
June 13, 2025 at 3:59 PM
We compare the refused topics of 4 popular LLMs. While all largely agree on safety-related domains, their behavior starkly differs in the political domain.
June 13, 2025 at 3:59 PM
As LLMs grow more complex, we can't anticipate all possible failure modes. We need unsupervised misalignment discovery methods! Marks et al. call this 'alignment auditing'. LLM-Crawler is one technique in this new field.

www.anthropic.com/research/aud...
Auditing language models for hidden objectives
A collaboration between Anthropic's Alignment Science and Interpretability teams
www.anthropic.com
June 13, 2025 at 3:59 PM
Perplexity unknowingly published a CCP-aligned version of their flagship R1-1776-671B model to the official API. Though decensored in internal tests, quantization reintroduced censorship. The issue is fixed now, but shows why thorough alignment auditing is necessary before deployment.
June 13, 2025 at 3:59 PM
PerplexityAI claimed that they removed CCP-aligned censorship in their finetuned “1776” version of R1. Did they succeed?

Yes, but it’s fragile! The bf-16 version of the model provides objective answers on CCP-sensitive topics, but in the fp-8 quantized version, we see that the censorship returns.
June 13, 2025 at 3:59 PM
Our method, the Iterated Prefill Crawler, discovers refused topics with repeated prefill attacks. Previously obtained topics are seeds for subsequent attacks.
June 13, 2025 at 3:59 PM
How does it work? We force the first few tokens of an LLM assistant's thought (or answer), analogous to Vega et al.'s prefilling attacks. This method reveals knowledge that DeepSeek-R1 refuses to discuss.
June 13, 2025 at 3:59 PM
Whoops, you're right! Too bad I can't edit posts in this case, though I think blocking edits is a good thing in general.
February 20, 2025 at 8:02 PM
ARBOR is a space where everyone can propose research questions, get feedback on early results, and join ongoing projects.

Browse existing projects: github.com/ArborProject...
ARBORproject arborproject.github.io · Discussions
Explore the GitHub Discussions forum for ARBORproject arborproject.github.io. Discuss code, ask questions & collaborate with the developer community.
github.com
February 20, 2025 at 7:55 PM
@wendlerc.bsky.social and @ajyl.bsky.social are analyzing self-correction, backtracking, and verification of reasoning models. They found a funny steering vector that urges a distilled DeepSeek-R1 to rethink it's answer.
February 20, 2025 at 7:55 PM
Check out my project collecting all refused topics in a reasoning language model.

github.com/ARBORproject...
Mapping All Restricted Topics · ARBORproject arborproject.github.io · Discussion #5
Research question Can we list all restricted topics that reasoning language models refuse to answer? Owners Can Rager, David Bau Project status This is work in progress, and we chose one of many po...
github.com
February 20, 2025 at 7:55 PM
It is also brittle wrt. prompt template, eg. the usage/omission of "<|User|>" and "<|Assistant|>" tokens.
February 7, 2025 at 2:56 PM