This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Safety Benchmarks
Accepted at the 
Safety Benchmarks
 research sprint on 
July 2, 2023

MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark

MACHIAVELLI is an AI safety benchmark that uses text-based choose-your-own-adventure games to measure the tendency of AI agents to behave unethically in the pursuit of their goals. We discuss what we see as two crucial assumptions behind the MACHIAVELLI benchmark and how these assumptions impact the validity of MACHIAVELLI as a test of ethical behavior of AI agents deployed in the real world. The assumptions we investigate are: - Equivalence of action evaluation and action generation - Independence of ethical judgments from agent capabilities We then propose modifications to the MACHIAVELLI benchmark to empirically study to which extent the assumptions behind MACHIAVELLI hold for AI agents in the real world.

Roman Leventov, Jason Hoelscher-Obermaier
4th place
3rd place
2nd place
1st place
 by peer review