This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Accepted at the 
 research sprint on 
February 16, 2023

Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques

Our project furthers the progress of Scale Oversight through automation of the sandwiching paradigm. In the Bowman et al. (2022) paper, the question is presented of how humans can effectively prompt unreliable, superhuman AIs to answer questions via conversation to arrive at accurate answers. We want to explore and evaluate the methods that humans can use reliably to elicit honest responses, from a more intelligent AI. We present a novel method, called Automatic Sandwiching, for implementing this paradigm. We implement a simplified version of this, evaluate our system on 163 training examples from Multi-task Language Understanding (MMLU) with 2 different oversight techniques.

Sophia Pung, Gabriel Mukobi
