This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Interpretability Hackathon
Accepted at the 
Interpretability Hackathon
 research sprint on 
November 15, 2022

Backup Transformer Heads are Robust to Ablation Distribution

Mechanistic Interpretability techniques can be employed to characterize the function of specific attention heads in transformer models, given a task. Prior work has shown, however, that when all heads performing a particular function are ablated for a run of the model, other attention heads replace the ablated heads by performing their original function. Such heads are known as "backup heads". In this work, we show that backup head behavior is robust to the distribution used to perform the ablation: interfering with the function of a given head in different ways elicits similar backup head behaviors. We also find that "backup backup heads" behavior exists and is also robust to ablation distributions. Code supporting the writeup can be found at the following Colab Notebook:

Lucas Sato, Gabe Mukobi, Mishika Govil
4th place
3rd place
2nd place
1st place
 by peer review