Lightnews — Scholar-powered news

Can

@canrager.bsky.social

Huge thanks to @ekdeepl.bsky.social for vision and project. Really enjoyed working with the team @sumedh-hindupur.bsky.social , @amuuueller.bsky.social , and many more that the tweet char limit allows. We made this big collaboration work, with 12h time zone difference at times!

November 13, 2025 at 10:32 PM

Can

@canrager.bsky.social

Find more experiments on parsing complex grammar, in-context learning and the interpretability of novel codes in our paper. arxiv.org/abs/2511.01836

Priors in Time: Missing Inductive Biases for Language Model Interpretability

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions,...

arxiv.org

November 13, 2025 at 10:32 PM

Can

@canrager.bsky.social

The predictive code detects events in stories. Try it yourself in the interactive demo on Neuronpedia, h/t to @johnnylin.bsky.social .

November 13, 2025 at 10:32 PM

Can

@canrager.bsky.social

The predictive code chronologically parses the input, while codes of existing methods don’t.

November 13, 2025 at 10:32 PM

Can

@canrager.bsky.social

LLM representations reflect the temporal structure of language, too. When parsing text, representations are highly correlated to its context, and intrinsic dimensionality grows over time.

November 13, 2025 at 10:32 PM

Can

@canrager.bsky.social

Motivating observation: Language has rich temporal structure. Human brain activity reflects temporal dynamics (eg www.biorxiv.org/content/10.1...).

November 13, 2025 at 10:32 PM

Can

@canrager.bsky.social

Quick Links
Paper arxiv.org/abs/2511.01836
Demo Code + Pretrained TFAs colab.research.google.com/github/eslub...
Demo Interface on Neuronpedia www.neuronpedia.org/gemma-2-2b/1...

Priors in Time: Missing Inductive Biases for Language Model Interpretability

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions,...

arxiv.org

November 13, 2025 at 10:32 PM

Can

@canrager.bsky.social

Find out more on forbidden.baulab.info!

Arxiv: arxiv.org/abs/2505.17441

Discovering Forbidden Topics in Language Models

A black-box evaluation technique for characterizing refusal behavior of language models using Iterated Prefill Crawler.

forbidden.baulab.info

June 13, 2025 at 4:03 PM

Can

@canrager.bsky.social

Thanks to @wendlerc.bsky.social, @rohitgandikota.bsky.social, and @davidbau.bsky.social for strong support in writing this paper and Eugen Hotaj, Adam Karvonen, Sam Marks, Owain Evans, Jason Vega, @ericwtodd.bsky.social, Stephen Casper, and Byron Wallace for valuable feedback!

June 13, 2025 at 3:59 PM

Can

@canrager.bsky.social

We compare the refused topics of 4 popular LLMs. While all largely agree on safety-related domains, their behavior starkly differs in the political domain.

June 13, 2025 at 3:59 PM

Can

@canrager.bsky.social

As LLMs grow more complex, we can't anticipate all possible failure modes. We need unsupervised misalignment discovery methods! Marks et al. call this 'alignment auditing'. LLM-Crawler is one technique in this new field.

www.anthropic.com/research/aud...

Auditing language models for hidden objectives

A collaboration between Anthropic's Alignment Science and Interpretability teams

www.anthropic.com

June 13, 2025 at 3:59 PM

Can

@canrager.bsky.social

Perplexity unknowingly published a CCP-aligned version of their flagship R1-1776-671B model to the official API. Though decensored in internal tests, quantization reintroduced censorship. The issue is fixed now, but shows why thorough alignment auditing is necessary before deployment.

June 13, 2025 at 3:59 PM

Can

@canrager.bsky.social

PerplexityAI claimed that they removed CCP-aligned censorship in their finetuned “1776” version of R1. Did they succeed?

Yes, but it’s fragile! The bf-16 version of the model provides objective answers on CCP-sensitive topics, but in the fp-8 quantized version, we see that the censorship returns.

June 13, 2025 at 3:59 PM

Can

@canrager.bsky.social

Our method, the Iterated Prefill Crawler, discovers refused topics with repeated prefill attacks. Previously obtained topics are seeds for subsequent attacks.

June 13, 2025 at 3:59 PM

Can

@canrager.bsky.social

How does it work? We force the first few tokens of an LLM assistant's thought (or answer), analogous to Vega et al.'s prefilling attacks. This method reveals knowledge that DeepSeek-R1 refuses to discuss.

June 13, 2025 at 3:59 PM

Can

@canrager.bsky.social

Whoops, you're right! Too bad I can't edit posts in this case, though I think blocking edits is a good thing in general.

February 20, 2025 at 8:02 PM

Can

@canrager.bsky.social

ARBOR is a space where everyone can propose research questions, get feedback on early results, and join ongoing projects.

Browse existing projects: github.com/ArborProject...

ARBORproject arborproject.github.io · Discussions

Explore the GitHub Discussions forum for ARBORproject arborproject.github.io. Discuss code, ask questions & collaborate with the developer community.

github.com

February 20, 2025 at 7:55 PM

Can

@canrager.bsky.social

@wendlerc.bsky.social and @ajyl.bsky.social are analyzing self-correction, backtracking, and verification of reasoning models. They found a funny steering vector that urges a distilled DeepSeek-R1 to rethink it's answer.