This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Interpretability 2.0
Accepted at the 
Interpretability 2.0
 research sprint on 
May 10, 2023

AutoAdminsteredAntidotes: Circuit detection in a poisoned model for MNIST classification

We trained a simple Convolutional Neural Network on a poisoned version of the MNIST dataset. Some elements of the dataset include a watermark, for which the label has been modified. We describe the process for uncovering the path through the network the watermark takes by method of ablation and poisoning visualization through feature maximization methods. We also discuss applications to safety and further generalizations.

Kola Ayonrinde, Denizhan “Dennis” Akar, Kitti Kovács, Adam Newgas, David Quarel
4th place
3rd place
2nd place
1st place
 by peer review