This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Accepted at the 
 research sprint on 
September 25, 2023

Discovering Agency Features as Latent Space Directions in LLMs via SVD

Understanding the capacity of large language models to recognize agency in other entities is an important research endeavor in AI Safety. In this work, we adapt techniques from a previous study to tackle this problem on GPT-2 Medium. We utilize Singular Value Decomposition to identify interpretable feature directions, and use GPT-4 to automatically determine if these directions correspond to agency concepts. Our experiments show evidence suggesting that GPT-2 Medium contains concepts associating actions on agents with changes in their state of being.

max max
4th place
3rd place
2nd place
1st place
 by peer review