This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Mechanistic Interpretability Hackathon
Accepted at the 
Mechanistic Interpretability Hackathon
 research sprint on 
January 25, 2023

Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small

We identify the broad structure of a circuit that is associated with correctly predicting a gendered pronoun given the subject of a rhetorical question. Progress towards identifying this circuit is achieved through a variety of existing tools, namely Conmy’s Automatic Circuit Discovery and Nanda’s Exploratory Analysis tools. We present this report, not only as a preliminary understanding of the broad structure of a gendered pronoun circuit, but also as (perhaps) a structured, re-implementable procedure (or maybe just naive inspiration) for identifying circuits for other tasks in large transformer language models. Further work is warranted in refining the proposed circuit and better understanding the associated human-interpretable algorithm.

Chris Mathwin, Guillaume Corlouer
4th place
3rd place
2nd place
1st place
 by peer review