Sleeper Agents

We train LLMs with hidden backdoors that produce unsafe outputs for specific prompts. These backdoors persist through standard safety training like fine-tuning, RL, and adversarial training, especially in large models. Adversarial training may even hide the deceptive behavior.

An interview with

"

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

" was written by

during the Apart Lab Fellowship

Author contribution

No items found.

Citation

Send feedback

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Media kit

Quotes

No items found.

All figures

No items found.