This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Accepted at the 
 research sprint on 
October 2, 2023

Second-order Jailbreaks

We evaluate LLMs on their ability to "jailbreak" other agents directly and through varying intermediaries. In our experimental setup, an attacker must extract a password from a defender. Attacker can be connected to the defender directly or through an intermediary side. We show that, even if the intermediary was instructed to prevent the attacker from getting the password, a strong enough attacker can succeed. We believe this has implications for a setting of the "box experiment" and, more broadly, on the second-order effects of malignant intelligent agents in a communication network.

Mikhail Terekhov, Romain Graux, Denis Rosset, Eduardo Neville, Gabin Kolly
4th place
3rd place
2nd place
1st place
 by peer review