Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

We use sparse autoencoders to compare RLHF and base model features

Interpreting reward models learned through RLHF in LLMs
