This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Accepted at the 
 research sprint on 
October 1, 2023

Jailbreaking the Overseer

If a chat model knows that the task that it's doing is scored by another AI, will it try to exploit jailbreaks without being prompted to do so? The answer is yes!* *see details inside :p

Alexander Meinke
4th place
3rd place
2nd place
1st place
 by peer review