This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Safety Benchmarks
Accepted at the 
Safety Benchmarks
 research sprint on 
July 2, 2023

Exploring the Robustness of Model-Graded Evaluations of Language Models

There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models. We test the robustness of these model-graded evaluations to prompt injections on different datasets including a new Deception Eval. The prompt injections are designed to influence the evaluator to change their grading. We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model. We find significant susceptibility to these injections in state-of-the-art commercial models on all examined evaluations. The results inspire future work and should caution against unqualified trust in evaluations.

Simon Lermen, Ondřej Kvapil
4th place
3rd place
2nd place
1st place
 by peer review