This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Mechanistic Interpretability Hackathon
Accepted at the 
Mechanistic Interpretability Hackathon
 research sprint on 
January 25, 2023

In search of linguistic concepts: investigating BERT's context vectors

Vision Transformers, just like other neural networks for image classification, have long been interpretable thanks to guided backpropagation to visualise attention. Likewise, simply by visualising them, it is clear that individual neurons in visual transformers learn and identify features that align with human-interpretable features that we might also use for classification. Language models have struggled with a lack of interpretability due to the lack of intuitive visualisation of linguistic patterns as they cannot be easily 'averaged over' to produce a representative sample of some linguistic feature. There is no human-understandable 'average' humorous sentence. As a result, we have been unable to learn whether the neurons in language models similarly learn and represent linguistic features that align with human concepts that we might use for classification. We investigate whether large language models learn and represent linguistic features, some of which correspond to human-interpretable linguistic concepts such as 'sarcasm', 'formality', 'humour' etc., which are intuitively relevant for the specific classification task.

Roksana Goworek, Paul Martin, Jonathan Frennert
4th place
3rd place
2nd place
1st place
 by peer review