AI safety needs to scale, and here's how you can do it
AI development attracts more than $67 billion in yearly investments, contrasting sharply with the $250 million allocated to AI safety. This gap suggests there's a large opportunity for AI safety to tap into the commercial market. The big question then is, how do you close that gap?
Updated quickstart guide for mechanistic interpretability
Written by Neel Nanda, who previously worked on mech interp under Chris Olah at Anthropic, who is currently a researcher on the DeepMind mechanistic interpretability team.
Results from the Scale Oversight hackathon
Check out the top projects from the "Scale Oversight" hackathon hosted in February 2023: Playing games with LLMs, scaling of prompt specificity, and more.
Results from the AI testing hackathon
See the winning projects from the AI testing hackathon held in December 2022: Trojan networks, unsupervised latent knowledge representation, and token loss trajectories to target interpretability methods.
Results from the language model hackathon
See winning projects from the language model hackathon hosted November 2022: GPT-3 shows sycophancy, OpenAI's flagging is biased, and truthfulness is sensitive to prompt design.
Results from the interpretability hackathon
Read the winning projects from the interpretability hackathon hosted in November 2022: Automatic interpretability, backup backup name mover heads, and "loud facts" in memory editing.