This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
May 6, 2024
Beyond Refusal: Scrubbing Hazards from Open-Source Models

Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).

Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis
