This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ARENA Interpretability
Accepted at the 
ARENA Interpretability
 research sprint on 
June 11, 2023

Why Might Negative Name Mover Heads Exist?

In this project we form a theory of why Negative Name Mover Heads (Wang et al., 2023) form in GPT-2 Small. We suspect that Negative Name Mover Heads i) respond to confident token predictions in the residual stream via Q-composition (Elhage et al., 2021), ii) attend to previous instances of such tokens in context and iii) negatively copy these tokens into the current token position. We use maximum activating dataset examples, negative copying score and a novel metric that tests our theory. Our results represent early research thoughts and are subject to ongoing investigation

Arthur Conmy and Callum McDougall
4th place
3rd place
2nd place
1st place
 by peer review