We are an AI safety lab

Our mission is to ensure AI systems are safe and beneficial

High-impact research

Our Lab publishes empirical work focused on state-of-the-art solutions for AI safety.

Research

Doing things differently

We are a decentralized organization designed to solve the biggest problems in AI risk.

Lab

Enabling builders

Apart provides opportunities for thousands of builders to pilot AI safety research.

Sprints

With partners and co-authors from world-class partners

Our work

Apart > News

Apart News: Agents, Submissions & Spain

Apart News is our newsletter to keep you up-to-date.

By 
Connor Axiotes
 on 
October 4, 2024
Apart > Lab

Interpreting Learned Feedback Patterns in Large Language Models

We hypothesize that LLMs with Learned Feedback Patterns accurately aligned to the fine-tuning feedback exhibit consistent activation patterns for outputs that would have received similar feedback during RLHF.

Luke Marks∗†, Amir Abdullah∗† ♢, Clement Neo†, Rauno Arike†, David Krueger□, Philip Torr‡, Fazl Barez∗‡

†Apart Research, ♢Cynch.ai, □University of Cambridge, ‡Department of Engineering Sciences, University of Oxford

Published research

We produce foundational research enabling the safe and beneficial development of advanced AI.

Google Scholar

Interpreting Learned Feedback Patterns in Large Language Models

Luke Marks∗†, Amir Abdullah∗† ♢, Clement Neo†, Rauno Arike†, David Krueger□, Philip Torr‡, Fazl Barez∗‡

†Apart Research, ♢Cynch.ai, □University of Cambridge, ‡Department of Engineering Sciences, University of Oxford

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Clement Neo*, Shay B. Cohen, Fazl Barez*

Increasing Trust in Language Models through the Reuse of Verified Circuits

Philip Quirke, Clement Neo, Fazl Barez

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger et al.

See all publications

Alex Foote*, Neel Nanda, Esben Kran, Ionnis Konstas, Shay Cohen, Fazl Barez*

RTML workshop at ICLR 2023
[Read Publication}

Neuron to Graph: Interpreting Language Model Neurons at Scale

Albert Garde*, Esben Kran*, Fazl Barez

NeurIPS 2023 XAI in Action Workshop
[Read Publication}

DeepDecipher

Philip Quirke, Fazl Barez

arXiv
[Read Publication}

Understanding Addition in Transformers

Michael Lan, Fazl Barez

arXiv
[Read Publication}

Locating Cross-Task Sequence Continuation Circuits in Transformers

[Read Publication}

Neuroplasticity in LLMs

Clement Neo*, Shay B. Cohen, Fazl Barez*

arXiv
[Read Publication}

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Luke Marks∗†, Amir Abdullah∗† ♢, Clement Neo†, Rauno Arike†, David Krueger□, Philip Torr‡, Fazl Barez∗‡

†Apart Research, ♢Cynch.ai, □University of Cambridge, ‡Department of Engineering Sciences, University of Oxford

arXiv
[Read Publication}

Interpreting Learned Feedback Patterns in Large Language Models

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

ACL 2023
[Read Publication}

Sleeper Agents

Evan Hubinger et al.

Anthropic
[Read Publication}

Events

Join experts and fellow researchers as we build AI safety together.

Agent Security Hackathon

This event concluded on
Oct 7, 2024
with
entries from
signups
If AI agents advance to a level of intelligence surpassing human capabilities and develop ambitions, they could potentially attempt to seize control of the world, resulting in irreversible consequences for humanity. Let's make sure this doesn't happen!
Oct 4
to
Oct 7, 2024
4
Oct
4
Oct
4
Oct

Agent Security Hackathon

Independently organized SprintX
☣️ Frontier AGI Safety
Virtual & local
4
Oct
Canceled

Agent Security Hackathon

Agent Security Hackathon

Independently organized SprintX
☣️ Frontier AGI Safety
Virtual & local
25
Oct
25
Oct
25
Oct

Autonomous Agent Evaluations Hackathon [dates TBD]

Independently organized SprintX
Virtual & in-person
25
Oct
Canceled

Autonomous Agent Evaluations Hackathon [dates TBD]

Autonomous Agent Evaluations Hackathon [dates TBD]

Independently organized SprintX
Virtual & in-person

Apart Research

Get involved

Check out the list below for ways you can interact or research with Apart!

Let's have a meeting!

You can book a meeting here and we can talk about anything between the clouds and the dirt. We're looking forward to meeting you.

I would love to mentor research ideas

We have a design where ideas are validated by experts on the website. If you would like to be one of these experts, write to us here. It can be a huge help for the community!

Get updated on A*PART's work

Blog & Mailing list

The blog contains the public outreach for A*PART. Sign up for the mailing list below to get future updates.

People

Members

Central committee board

Associate Kranc
Head of Research Department
Commanding Center Management Executive

Partner Associate Juhasz
Head of Global Research
Commanding Cross-Cultural Research Executive

Associate Soha
Commanding Research Executive
Manager of Experimental Design

Partner Associate Lækra
Head of Climate Research Associations
Research Equality- and Diversity Manager

Partner Associate Hvithammar
Honorary Fellow of Data Science and AI
P0rM Deep Fake Expert

Partner Associate Waade
Head of Free Energy Principle Modelling
London Subsidiary Manager

Partner Associate Dankvid
Partner Snus Executive
Bodily Contamination Manager

Partner Associate Nips
Head of Graphics Department
Cake Coding Expert

Honorary members

Associate Professor Formula T.
Honorary Associate Fellow of Research Ethics and Linguistics
Optimal Science Prediction Analyst

Alumni

Partner Associate A.L.T.
Commander of the Internally Restricted CINeMa Research
Keeper of Secrets and Manager of the Internal REC

Newsletter

Stay updated with Apart

Follow the weekly updated from the Apart community and stay updated on the latest news, research, and events.