This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Deception Detection Hackathon: Preventing AI deception
660d65646a619f5cf53b1f56
Deception Detection Hackathon: Preventing AI deception
July 1, 2024
Accepted at the 
660d65646a619f5cf53b1f56
 research sprint on 

Sandbag Detection through Model Degradation

We propose a novel technique to detect sandbagging in LLMs by adding varying amount of noise to model weights and monitoring performance.

By 
Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahani
🏆 
4th place
3rd place
2nd place
1st place
 by peer review