Sleeper Agents

We train LLMs with hidden backdoors that produce unsafe outputs for specific prompts. These backdoors persist through standard safety training like fine-tuning, RL, and adversarial training, especially in large models. Adversarial training may even hide the deceptive behavior.

An interview with

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
" was written by

Author contribution

No items found.


Send feedback

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Media kit


No items found.

All figures

No items found.