Apart — AI Security Research and Sprints

Published research

We aim to produce foundational research enabling the safe and beneficial development of advanced AI.

Google Scholar

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Clement Neo*, Shay B. Cohen, Fazl Barez*

Visit website Read paper

Increasing Trust in Language Models through the Reuse of Verified Circuits

Philip Quirke, Clement Neo, Fazl Barez

Visit website Read paper

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger et al.

Visit website Read paper

Large Language Models Relearn Removed Concepts

Michelle Lo*, Shay Cohen, Fazl Barez

Visit website Read paper

See all publications

Luke Marks*, Amir Abdullah*, Luna Mendez, Rauno Arike, Philip Torr, Fazl Barez

arXiv

[Read Publication}

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

Alex Foote*, Neel Nanda, Esben Kran, Ionnis Konstas, Shay Cohen, Fazl Barez*

RTML workshop at ICLR 2023

[Read Publication}

Neuron to Graph: Interpreting Language Model Neurons at Scale

Albert Garde*, Esben Kran*, Fazl Barez

NeurIPS 2023 XAI in Action Workshop

[Read Publication}

DeepDecipher

Philip Quirke, Fazl Barez

arXiv

[Read Publication}

Understanding Addition in Transformers

Michael Lan, Fazl Barez

arXiv

[Read Publication}

Locating Cross-Task Sequence Continuation Circuits in Transformers

Michelle Lo*, Shay Cohen, Fazl Barez

[Read Publication}

Neuroplasticity in LLMs

Clement Neo*, Shay B. Cohen, Fazl Barez*

arXiv

[Read Publication}

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ionnis Konstas, Fazl Barez

ACL 2023

[Read Publication}

Sleeper Agents

Evan Hubinger et al.

Anthropic

[Read Publication}

Research Augmentation Hackathon: Supercharging AI Alignment

This event concluded on

Jul 29, 2024

with

entries from

signups

Join us for an exhilarating weekend at the Research Augmentation Hackathon, where we'll develop innovative tools and methods to boost AI safety research progress by 5x or even 10x with new research tools!

Jul 26

to

Jul 29, 2024

Research Augmentation Hackathon: Supercharging AI Alignment

Independently organized SprintX

📈 Accelerating AI Safety

26

Jul

Canceled

Research Augmentation Hackathon: Supercharging AI Alignment

Independently organized SprintX

📈 Accelerating AI Safety

AI Safety Deep Tech Startup Hackathon

Independently organized SprintX

📈 Accelerating AI Safety

Virtual & local

30

Aug

Canceled

AI Safety Deep Tech Startup Hackathon

Independently organized SprintX

📈 Accelerating AI Safety

Virtual & local

Agent Security Hackathon: Expanding our Knowledge

Independently organized SprintX

📈 Accelerating AI Safety

Virtual & local

13

Sep

Canceled

Agent Security Hackathon: Expanding our Knowledge

Independently organized SprintX

📈 Accelerating AI Safety

Virtual & local

Welcome to our Accelerating AI Safety Sprint Season! Here, we'll focus on building concrete solutions and tools to solve key questions in AI safety. Join our exciting hackathons to develop new research in deep tech AI safety, agent security, research tooling, and policy.

The Flagging AI Risks Sprint Season has finished and ran from March to June 2024 after four research hackathons focused on catastrophic risk evaluations of AI. A cohort of researchers has been invited to continue their research from these.

What does Apart do?

We solve high-impact, neglected, and tractable problems in AI safety

Field-building for AI safety

Our initiatives allow people from diverse backgrounds to have an impact on AI safety. On 7 continents, more than 250 projects have been developed by over 1,200 researchers. Some teams have gone on to publish their research at major academic venues, such as NeurIPS, ICML, and ACL.

High-impact research

We engage in both original research and contracting for research projects that aim to translate academic insights into actionable strategies for mitigating catastrophic risks associated with AI. We have co-authored with researchers from the University of Oxford, DeepMind, Anthropic, and more.

A vision for the future

Our aim is to foster a positive vision and an action-focused approach to AI safety. With projects driven by fundamental innovation to the research process inspired by over 10 years of experimenting with research processes, solving large problems has become our specialization.

The core of Apart Research

the A*PART team

Our core team makes the magic happen! Contact any of us if you are interested in our work or would like to join the core team. See our open research process on our Discord server.

Esben Kran

The founder and co-director of Apart Research. Write to me at esben@apartresearch.com.

Fazl Barez

Research lead
‍fazl@apartresearch.com.

Additionally, we are supported by more than 30 volunteers across the globe who help us make the Alignment Jams and the newsletters a reality.

Apart Research

Get involved

Check out the list below for ways you can interact or research with Apart!

I have lists of ideas

If you have lists of AI safety and AI governance ideas that are shovel-ready lying around, submit them to aisafetyideas.com and we'll put them into the list as we make each more shovel-ready!

I would like to collaborate on the research projects

You can work directly with us on aisafetyideas.com, on Discord, or on Trello. If you have some specific questions, write to us here.

I would love a specific feature on aisafetyideas.com

Send your feature ideas our way in the #features-bugs channel on Discord. We appreciate any and all feedback!

Let's have a meeting!

You can book a meeting here and we can talk about anything between the clouds and the dirt. We're looking forward to meeting you.

I would love to mentor research ideas

We have a design where ideas are validated by experts on the website. If you would like to be one of these experts, write to us here. It can be a huge help for the community!

Participate in the center projects

Active projects

Long term projects

Technical AI safety ideas platform

Create a platform for students, researchers, entrepreneurs, and developers to get free AI safety ideas with the possibility of pre-committed associated funding. Collaborating with Nonlinear.

Currently our main project

Platform for AI Safety Prizes

Website with leaderboards for major issues in AI safety with rewards / prizes associated with the solving or optimization of them. Reflects ideas from AIcrowd.com, OpenPhil's adversarial learning challenge and the ELK competition.

In preliminary research phase

For-profit AI safety

Many in the community argue that we need more for-profit ventures focused on helping the long term future since they can quickly scale and drive massive impact. Apart Research works on developing ideas within the space of AI safety and disseminate them.

Preliminary research and interviews

Understanding longtermist pain points

By interviewing and surveying AI safety researchers and longtermists, we can pinpoint areas where there is a gap in the current tooling for the community.

In progress

Outreach projects

We generally attempt to make our research and projects publicly accessible while also creating dedicated outreach projects in relation to AI safety.

Ongoing

Research projects

Potential for multimodal applied alignment research

Study how the process of researching alignment differs when using the image domain versus when using the text domain.

Exploratory research phase

CLIP-GAN alignment

Imagine an alignment experiment and study that focuses on CLIP+StyleGAN to generate images and then attempts to "align" the output with what we actually intend it to generate. This is part of a bigger analysis on multimodal alignment work.

Exploratory research phase

Interpretability tooling capabilities overview and development

Analyze the capabilities of current interpretability tools in three depths: 1) No-code, 2) library implementation, 3) custom programming solutions. (3) will contribute to the development of new interpretability tools. This will play into exploring productizations of AI safety theory.

Advanced explainability tools

In a collaboration with the Stanford Center for AI Safety, we hope to make it easier for end-user engineers to understand models for geological marking which will play a part in the wider project of interpretability tooling.

Get updated on A*PART's work

Blog & Mailing list

The blog contains the public outreach for A*PART. Sign up for the mailing list below to get future updates.

Archana Vaidheeswaran

July 24, 2024

Finding Deception in Language Models

This June, Apart Research and Apollo Research joined forces to host the Deception Detection Hackathon, bringing together students, researchers and engineers from around the world to tackle one of the most pressing challenges in AI safety: Preventing AI from deceiving humans.

Esben Kran

July 24, 2024

Results from the AI testing hackathon

See the winning projects from the AI testing hackathon held in December 2022: Trojan networks, unsupervised latent knowledge representation, and token loss trajectories to target interpretability methods.

Esben Kran

June 26, 2024

Code Red LLM Evaluations Hackathon Wrap Up (METR and Apart)

A few months ago, Apart, in collaboration with METR, ran the Code Red Hackathon to engage talent across the world in impactful AI safety research. Our 128 participants submitted more than 200 project ideas, 100 detailed task specifications, and more than 20 complete implementations! In this post, we also get an exclusive interview with one of the winners.

People

Members

Central committee board

Associate Kranc
Head of Research Department
Commanding Center Management Executive

Partner Associate Juhasz
Head of Global Research
Commanding Cross-Cultural Research Executive

Associate Soha
Commanding Research Executive
Manager of Experimental Design

Partner Associate Lækra
Head of Climate Research Associations
Research Equality- and Diversity Manager

Partner Associate Hvithammar
Honorary Fellow of Data Science and AI
P0rM Deep Fake Expert

Partner Associate Waade
Head of Free Energy Principle Modelling
London Subsidiary Manager

Partner Associate Dankvid
Partner Snus Executive
Bodily Contamination Manager

Partner Associate Nips
Head of Graphics Department
Cake Coding Expert

Honorary members

Associate Professor Formula T.
Honorary Associate Fellow of Research Ethics and Linguistics
Optimal Science Prediction Analyst

Alumni

Partner Associate A.L.T.
Commander of the Internally Restricted CINeMa Research
Keeper of Secrets and Manager of the Internal REC

Apart Research

Finding Deception in Language Models

With partners and collaborators from

Published research

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Increasing Trust in Language Models through the Reuse of Verified Circuits

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Large Language Models Relearn Removed Concepts

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

Neuron to Graph: Interpreting Language Model Neurons at Scale

DeepDecipher

Understanding Addition in Transformers

Locating Cross-Task Sequence Continuation Circuits in Transformers

Neuroplasticity in LLMs

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Sleeper Agents

Research Augmentation Hackathon: Supercharging AI Alignment

Research Augmentation Hackathon: Supercharging AI Alignment

Research Augmentation Hackathon: Supercharging AI Alignment

Research Augmentation Hackathon: Supercharging AI Alignment

AI Safety Deep Tech Startup Hackathon

AI Safety Deep Tech Startup Hackathon

AI Safety Deep Tech Startup Hackathon

Agent Security Hackathon: Expanding our Knowledge

Agent Security Hackathon: Expanding our Knowledge

Agent Security Hackathon: Expanding our Knowledge

We solve high-impact, neglected, and tractable problems in AI safety

Field-building for AI safety

High-impact research

A vision for the future

The core of Apart Research

the A*PART team

Esben Kran

Dataset alignment

Action grounding decay

Non-symbolic probabilistic alignment research

Lasse Hyldig Hansen

Website for tracking progress on specific AI safety problems

Massive public outreach projects

Thomas Steinthal

Fazl Barez

Apart Research

Get involved

I have lists of ideas

I would like to collaborate on the research projects

I would love a specific feature on aisafetyideas.com

Dataset alignment

Action grounding decay

Non-symbolic probabilistic alignment research

Let's have a meeting!

I would love to mentor research ideas

Website for tracking progress on specific AI safety problems

Massive public outreach projects

Participate in the center projects

Active projects

Long term projects

Technical AI safety ideas platform

Platform for AI Safety Prizes

For-profit AI safety

Understanding longtermist pain points

Outreach projects

Dataset alignment

Action grounding decay

Non-symbolic probabilistic alignment research

Research projects

Potential for multimodal applied alignment research

CLIP-GAN alignment

Interpretability tooling capabilities overview and development

Advanced explainability tools

Website for tracking progress on specific AI safety problems

Massive public outreach projects

Get updated on A*PART's work

Blog & Mailing list

Finding Deception in Language Models

Results from the AI testing hackathon

Code Red LLM Evaluations Hackathon Wrap Up (METR and Apart)

People

Members

Central committee board