This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ARENA Interpretability
Accepted at the 
ARENA Interpretability
 research sprint on 
June 11, 2023

Understanding truthfulness in large language model heads through interpretability

In what ways do large language models represent truth? The main claim of this report is that attention heads in GPT-2 XL represent two distinctly different kinds of“truth directions.” One kind of truth direction represents the truths of common misconceptions/uncertain truths and the other represents more obvious factual statements. We show results across various truthfulness datasets – the CounterFact (CFact) dataset, Capitals of Countries (Capitals) dataset, TruthfulQA (TQA) dataset, and our own custom-made Easy (EZ) dataset. This is all preliminary research code, and can be wrong or contain bugs. Please keep this in mind when analyzing our results.

Richard Ren, Kevin Wang, Phillip Guo
4th place
3rd place
2nd place
1st place
 by peer review