This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Jailbreaking the Overseer

If a chat model knows that the task that it's doing is scored by another AI, will it try to exploit jailbreaks without being prompted to do so? The answer is yes!* *see details inside :p

Alexander Meinke
