arxiv.org/abs/2511.19299
arxiv.org/abs/2511.19299
t.co/CVkAKNXZme
t.co/CVkAKNXZme
They tested filtration of species/genus data against adv. fine-tuning. It didn't work well. This suggests filtering may work better if applied to entire tasks/domains rather than specific instances.
arxiv.org/abs/2510.27629
They tested filtration of species/genus data against adv. fine-tuning. It didn't work well. This suggests filtering may work better if applied to entire tasks/domains rather than specific instances.
arxiv.org/abs/2510.27629
We showed that filtering biothreat-related pretraining data is SOTA for making models resist adversarial fine-tuning. We proposed an amendment to the hypothesis from papers 1 and 2 above.
deepignorance.ai
We showed that filtering biothreat-related pretraining data is SOTA for making models resist adversarial fine-tuning. We proposed an amendment to the hypothesis from papers 1 and 2 above.
deepignorance.ai
They reported an instance where filtering biothreat data didn't have a big impact. But without more info on how and how much they filtered, it's hard to draw strong conclusions.
arxiv.org/abs/2508.03153
They reported an instance where filtering biothreat data didn't have a big impact. But without more info on how and how much they filtered, it's hard to draw strong conclusions.
arxiv.org/abs/2508.03153
They found similar results to the safety pretraining paper -- that models trained on without toxic text could be *more* vulnerable to attacks eliciting toxicity.
arxiv.org/abs/2505.04741
They found similar results to the safety pretraining paper -- that models trained on without toxic text could be *more* vulnerable to attacks eliciting toxicity.
arxiv.org/abs/2505.04741
The experiment of theirs that was most interesting to me found that models trained without toxic text could be *more* vulnerable to attacks eliciting toxicity.
arxiv.org/abs/2504.16980
The experiment of theirs that was most interesting to me found that models trained without toxic text could be *more* vulnerable to attacks eliciting toxicity.
arxiv.org/abs/2504.16980
x.com/StephenLCasp...
x.com/StephenLCasp...
t.co/Ag4J6rrejz
t.co/Ag4J6rrejz
Thx Epoch & Bhandari et al.
Thx Epoch & Bhandari et al.
t.co/CVkAKNXZme
t.co/CVkAKNXZme