This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Accepted at the 
 research sprint on 
November 27, 2023

Multifaceted Benchmarking

Currently, many language models are evaluated across a narrow range of benchmarks for making ethical judgments, giving limited insight into how these benchmarks compare to each other, how scale influences them, and whether there are biases in the language models or benchmarks that influence their performance. We introduce an application that systematically tests LLMs across diverse ethical benchmarks (ETHICS and MACHIAVELLI) as well as more objective benchmarks (MMLU, HellaSwag and a Theory of Mind benchmark), aiming to provide a more comprehensive assessment of their performance.

Eduardo Neville, George Golynskyi, Tetra Jones
4th place
3rd place
2nd place
1st place
 by peer review