Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

We use sparse autoencoders to compare RLHF and base model features

An interview with

Interpreting reward models learned through RLHF in LLMs
" was written by

Author contribution

No items found.


Send feedback

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Media kit


No items found.

All figures

No items found.