DeepDecipher

Models are rapidly being deployed in the real world. How do we evaluate models, especially ones as complex as GPT-4, to ensure that they behave safely in pursuit of their objectives? Can we design models that robustly avoid any harms while achieving their goals?

Paper

Code

Explainer

3 methods and 25 models

The website provides data on 25 models. All models have highest activating samples as scraped from neuroscope.io, 20 models have Neuron2Graph data and 2 models (gpt2-small and gpt2-xl) have GPT-4 explanations. This will almost certainly be expanded later.

Search and similar neurons

We have used the graphs provided by Neuron2Graph to calculate a similarity score between each pair of neurons, and used this to provide links from each neuron to similar neurons. It is also possible to search for any neurons in a model that include a specific token in their graph. This makes it easy to look for neurons handling specific concepts.

Synergy

It turns out there is a lot of opportunity for synergy between the various methods once they are brought together in this way. Firstly, simply being able to search for neurons based on one method and viewing the results of other methods can lead to interesting discoveries. Secondly, seeing the results side allows for interesting insights. For example, based on the N2G graph of neuron 2:2070 in model gpt2-small, it might seem that it captures the concept of weight, but looking at both the GPT-4 explanation and the highest activating examples, it is clear that it is instead the concept of “weighing in” on things. We can then look at the similar neuron 0:2351, and see that that one is actually about weights.

Fast API

While the website is great for ad-hoc exploration, if you want to work with the data, scraping is a chore. Therefore, DeepDecipher provides an API that makes it easy to fetch any information provided on the website programmatically. Specifically for highest activating samples, this is an improvement over Neuroscope, since that site has no API and scraping is necessary.

‍

January 27, 2024

An interview with

Albert Garde

"

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

" was written by

Albert Garde

during the Apart Lab Fellowship

What inspired you to embark on this particular project?

I’ve long thought that ensuring safe development and usage of AI models is one of the worlds most important problems. The issue is that since I’m more of a software engineer than a ML researcher, I had a hard time finding ways of contributing. However, when I looked at the state of interpretability of language models, I saw a lot of interesting research and many interesting methods, but they were all very inaccessible, and thought that this was a problem solvable with my skillset

What was the biggest challenge you faced during this project, and how did you overcome it?

Writing the article was by far the largest challenge. I just wanted to develop the website, and I had a hard time finding motivation for the article. Luckily, support from the Apart Research team made me realise that the publication was an essential part of getting the website out there, and once I really understood that, motivation was much easier to come by.

How do you see your work influencing or changing the current landscape of your field?

I hope to continue developing DeepDecipher to support any new advances in the field so that useful methods actually get used and aren’t forgotten. I also see DeepDecipher as part of the same broad movement as the amazing transformer_lens and sparse_autoencoder packages

What advice would you give to fellow researchers or students interested in pursuing a similar path?

A lot of research in this field is done by ML researchers (funnily enough), that maybe don’t have much interest or experience in software development. This means that their results are often hard to use for others, and this is a great opportunity to contribute without doing too much ML yourself. One option here is to contribute to existing libraries like transformer_lens.

Based on your findings, what future research directions do you find most promising or necessary?

I think DeepDecipher has a lot of room for improvement and I am excited to pursue that. Everything from the UX, to the number of supported methods and models. The open-source GitHub is full of potential ideas. One thing I am especially interested in is supporting sparse autoencoders.

Author contribution

Albert Garde

:

Developed the idea, wrote most of the code, ran the necessary methods, and set up the server.

Esben Kran

:

Developed the idea, worked on UX, and wrote the article on it.

Fazl Barez

:

Advised and oversaw the project, and assisted with writing the article and the entire academic process.

Citation

@misc{ garde2023deepdecipher, title={DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models}, author={Albert Garde and Esben Kran and Fazl Barez}, year={2023}, eprint={2310.01870}, archivePrefix={arXiv}, primaryClass={cs.LG} }

Send feedback

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Media kit

Quotes

No items found.

All figures

No items found.