https://mainuliitkgp.github.io/
📊 CoT maintains similar CK alignment compared to standard prompting for all the datasets, and also reduces PK alignment.
📊 CoT maintains similar CK alignment compared to standard prompting for all the datasets, and also reduces PK alignment.
📊 The gap between PK and CK is much higher for the examples with hallucinated spans than for the examples with no hallucinated spans across the sequence steps.
📊 The gap between PK and CK is much higher for the examples with hallucinated spans than for the examples with no hallucinated spans across the sequence steps.
📊 During most of the NLE generations, the model slightly prioritizes PK.
📊 During most of the NLE generations, the model slightly prioritizes PK.
📊 While generating an answer, the model aligns with the CK direction for conflicting examples, while for supportive examples, the model aligns with PK.
📊 While generating an answer, the model aligns with the CK direction for conflicting examples, while for supportive examples, the model aligns with PK.
📊 Different knowledge interactions are poorly captured by the rank-1 projection subspace in LLM model parameter
📊 Different knowledge interactions are poorly captured by the rank-1 projection subspace in LLM model parameter
"Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement"
📄 Paper: arxiv.org/pdf/2511.01706
💻 Code: github.com/copenlu/pk-c...
"Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement"
📄 Paper: arxiv.org/pdf/2511.01706
💻 Code: github.com/copenlu/pk-c...
@aicentre.dk
@aicentre.dk
3️⃣ Real & Fictional Bias Mitigation: Reduces both real-world stereotypes (e.g., “Italians are reckless drivers”) and fictional associations (e.g., “citizens of a fictional country have blue skin”), making it useful for both safety and interpretability research.
3️⃣ Real & Fictional Bias Mitigation: Reduces both real-world stereotypes (e.g., “Italians are reckless drivers”) and fictional associations (e.g., “citizens of a fictional country have blue skin”), making it useful for both safety and interpretability research.
2️⃣ Strong Generalization: Works on unseen biases during token-based fine-tuning.
2️⃣ Strong Generalization: Works on unseen biases during token-based fine-tuning.
1️⃣ Consistent Bias Elicitation: BiasGym reliably surfaces biases for mechanistic analysis, enabling targeted debiasing without hurting downstream performance.
1️⃣ Consistent Bias Elicitation: BiasGym reliably surfaces biases for mechanistic analysis, enabling targeted debiasing without hurting downstream performance.
BiasInject: injects specific biases into the model via token-based fine-tuning while keeping the model frozen.
BiasScope: leverages these injected signals to identify and steer the components responsible for biased behaviour.
BiasInject: injects specific biases into the model via token-based fine-tuning while keeping the model frozen.
BiasScope: leverages these injected signals to identify and steer the components responsible for biased behaviour.