This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Deception Detection Hackathon: Preventing AI deception
660d65646a619f5cf53b1f56
Deception Detection Hackathon: Preventing AI deception
July 1, 2024
Accepted at the 
660d65646a619f5cf53b1f56
 research sprint on 

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.

By 
Alexi Roth Luis Cañamo, Kyle Gabriel Reynoso
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission is under review.
Oops! Something went wrong while submitting the form.

This project is private