This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Interpretability Hackathon
Accepted at the 
Interpretability Hackathon
 research sprint on 
November 15, 2022

An Intuitive Logic for Understanding Autoregressive Language Models

Transformer-based language models have shown a stunning collection of capabilities but largely remain black boxes. Understanding these models is hard because they employ complex non-linear interactions in densely-connected layers and operate in high-dimensional spaces. In this article, we address the problem of interpretability of large regressive language models with a principled approach inspired by basic logic. First, we show how classical mathematical logic does not grasp the reasoning system of these models and we propose the intuitive logic, which is notoriously asymmetric and redefines the classic logical operators. We then proceed with the localization of the activated areas associated with the conjunction, disjunction, negation, adversive conjunctions and conditional constructions. From the localization results, we obtain topological important information about the network that induces the formulation of a conjecture about the mechanisms underlying the intuitive logic introduced in GPT 2-XL. We test the conjecture through model editing and conclude by laying the foundations for a connectomics for GPT.

Adnan Ben Mansour, Gaia Carenini, Alexandre Duplessis
4th place
3rd place
2nd place
1st place
 by peer review