Updated quickstart guide for mechanistic interpretability

Updated quickstart guide for mechanistic interpretability

Written by Neel Nanda, who previously worked on mech interp under Chris Olah at Anthropic, who is currently a researcher on the DeepMind mechanistic interpretability team.

February 7, 2024

2023 updates for the Interpretability 3.0 hackathon!



Mechanistic Interpretability is the study of reverse-engineering neural networks - analogous to how we might try to reverse-engineer a program’s source code from its compiled binary, our goal is to reverse engineer the parameters of a trained neural network, and to try to reverse engineer what algorithms and internal cognition the model is actually doing. Going from knowing that it works, to understanding how it works. Check out Circuits: Zoom In for an introduction.

In my (extremely biased!) opinion, mech interp is a very exciting subfield of alignment. Currently our models are inscrutable black boxes! If we can really understand what models are thinking, and why they do what they do, then I feel much happier about living in a world with human level and beyond models, and it seems far easier to align them. 

Further, it is a young field, with a lot of low-hanging fruit. And it suffices to screw around in a Colab notebook with a small-ish model that someone else trained, copying code from an existing demo - the bar for entry can be much lower than other areas of alignment. So you stand a chance of getting traction on a problem in this hackathon!

Recommended mindset

Though the bar for entry is lower for mech interp than other areas of alignment, it is still far from zero. I’ve written a post on how to get started that lays out the key prerequisites and my takes for what to do to get them. A weekend hackathon isn’t long enough to properly engage with those, so I recommend picking a problem you’re excited about, and dipping into the resources summarised here whenever you get stuck. I recommend trying to have some problem in mind, so you can direct your learning towards making progress on that goal. But it’s completely fine if, in fact, you just spend the weekend learning as much as you can - if you feel like you’ve learned cool things, then I call that a great hackathon!

Getting Started

A summary of the key resources, and how to think of them during a hackathon.

  • What even is a transformer? A key subskill in mech interp, is really having a deep intuition for how a transformer (the architecture for modern language models) actually works - what are the basic operations going on inside of it, and how do these all fit together?
  • ~Important: My what is a transformer tutorial video (1h)
  • ~Recommended: My tutorial on implementing GPT-2 from scratch (1.5h) plus template notebook to fill out yourself (with tests!) (2-8h). This is more involved and not essential to do fully, but will help a lot.
  • ~~Implementing GPT-2 from scratch can sound pretty hard, but the tutorial and template guides you through the process, and gives you tests to keep you on track. I think that once you’ve done this, you have a solidly deep understanding of transformers!
  • ~Reference: Look up unfamiliar terms in the transformers section of my explainer
  • Tooling: The core operation in mech interp is loading in a model, looking at its weights and activations, and editing/intervening on these. I recommend doing this with my TransformerLens library, tutorial here.
  • ~Getting started: I recommend skimming the main demo, copying relevant snippets for it for your project, and cribbing from my two demos as much as possible. If you get stuck, go through the demo properly.
  • ~~The library is written in PyTorch, and it’s worth skimming the PyTorch tutorials if you’re not familiar. A lot of PyTorch features aren’t that relevant - the key is getting your head around what a tensor is
  • ~~You want to know the basics of how to play around with tensors, and the basics of how networks are defined. 
  • ~~PyTorch tensors are basically NumPy arrays, and the NumPy tutorials are also useful.
  • ~Coding Environment: Unless you know what you’re doing, write your code in a Colab notebook - they have free GPUs (runtime -> change runtime type_ and the key software pre-installed.
  • ~Demos: You want to start with some existing code that kinda does what you want, that you can adapt to your use case. The main demo and exploratory analysis demo should provide enough for most use cases.
  • ~Recommended: I highly recommend using einops to manipulate an individual tensor and einsum to multiply tensors (analogous to Einstein Summation notation) - you will shoot yourself in the foot way less, and spend less time debugging.
  • Understanding mech interp concepts: There’s a lot of concepts and jargon in mech interp, sorry! I’ve written a long explainer on key ideas, I recommend looking up anything unfamiliar in there as you go. 
  • ~If a section feels very project relevant I’d read it in full

Choosing a Problem

There’s a lot of concrete open problems you could engage with! I’ve written a sequence called 200 Concrete Open Problems in Mechanistic Interpretability that tries to lay out a ton of them. Try not to get paralysed by choice! I recommend reading the overview, skimming the posts that seem exciting, picking a problem that jumps out at you and running with it. You should be able to get traction on anything rated difficulty A and maybe something rated difficulty B! 

Here are some meta thoughts on particularly suitable areas for a hackathon. Please just take these takes as a starting point! There’s a bunch of other posts in the sequence, and problems of a very different flavour. And please take the sequence itself as just another starting point - if you’re already excited about some other question or direction, go wild!

  • Looking for learned features: If you want a project that’s low on coding, check this out. (You can skip most of the post). A good project is to look through the text that most activates various language model neurons in my Neuroscope tool. Look for patterns among the texts to form guesses about what feature(s) the neuron may represent, and then interactively enter inputs to the neuron in an interactive interface
  • ~For this, you don’t need to really understand TransformerLens or how transformers work
  • Interpreting Algorithmic problems: Projects here look like coming up with a toy, algorithmic problem, training a model for it, and interpreting it the learned algorithm. This is most likely to get to a satisfying end conclusion of really understanding a model, since it’s more likely to be clean and you know the ground truth of what the model is trying to do. But it will involve engaging with things like.
  • ~For this, you’ll need to be able to play with TransformerLens and understand transformer basics, but can skip some complexities like LayerNorm, the unembed, tokenization, etc.
  • ~Demo: A real-time recording of myself training + partially interpreting an algorithmic model, with accompanying colab notebook
  • Analysing toy language models: Projects here look like finding an interesting behaviour or question about toy language models (one to four layers) and trying to get traction on figuring out what’s going on. You can load 12 different toy language models in TransformerLens. 
  • ~Expect to apply a bunch of tools and techniques, visualise a bunch of model internals and attention patterns. This should feel a lot like what doing real mech interp work is like!
  • ~For this, you’ll need to hack around with TransformerLens but can crib heavily from Exploratory Analysis Demo
  • ~You definitely want to understand transformer basics, and it’s pretty helpful to implement a transformer yourself. 
  • ~Don’t expect to really finish anything, but I think you can get some cool traction.
  • Looking for Circuits in the Wild: Projects here mostly look like analysing toy language model projects, but in a larger model, and at a higher level - you can still crib heavily from Exploratory Analysis Demo, and the same advice applies. But another category of projects involve building on existing circuits that we understand (indirect object identification or induction heads), e.g. looking for the IOI circuit in other models - these may go less deep, but I think it’s easier to get traction and to act at a higher level of understanding.
  • ~For the second category, you need a less detailed understanding of a transformer, but want to crib heavily from Exploratory Analysis Demo or from their codebase. It helps to have a good understanding of the paper - for an overview check out my explainer or my interview with the authors
  • ~Problems in the second category are listed in the post!
  • Techniques, Tooling and Automation: Projects here can take on a lot of different flavours - finding examples where current techniques break, creating automated tests for specific circuits, exploring questions and edge cases around current techniques, trying to build your own etc. There’s a ton of variation and they vary a lot in difficulty, and length. But they can be very different in flavour to each other and from the problems above, and you should skim it if this feels exciting to you!
  • Superposition and Polysemanticity: Projects here can look like playing with tiny toy models (even simpler than toy language models, these don’t need to be transformers!), or trying to analyse specific questions about transformers and how much they engage in superposition. 
  • ~The bar for entry here, especially on the toy models, is lower, and you don’t need to understand much about transformers or TransformerLens! Though I think it’s easy to find something cool in the toy models that isn’t actually relevant to real models - it takes a lot more skill to do a good and principled project here, but can still be a fun and educational thing to hack around with.