Bias Mitigation in LLM by Steering Features

26 : 08 : 29 : 54

Keep Apart Research Going: Donate Today

Nov 25, 2024

Bias Mitigation in LLM by Steering Features

Akanksha Devkar

Read Project

Review Project

Details

Summary

View Related Sprint

To ensure that we create a safe and unbiased path to AGI, we must calibrate the biases in our LLMs. And with this goal in mind, I worked on testing Goodfire SDK and the steering features to mitigate bias in the recently help Apart Research x Goodfire-led hackathon on ‘Reprogramming AI Models’.

See Code

Download PDF

View Presentation

Cite this work:

@misc {

title={

Bias Mitigation in LLM by Steering Features

author={

Akanksha Devkar

date={

11/25/24

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Reviewer's Comments

Liv Gorton

This is an interesting project that applies interpretability to understand the bias that exists in LLMs. I liked the result of nudging resulting in a gender neutral pronoun in the top logits rather than just making a gendered pronoun more or less likely. The figures were well-presented and the paper was clearly written. Overall a nice demonstration of how steering could be used! It could be interesting to explore how this holds up to in context pronouns or if there is a set of features that produces the result regardless of the direction of the bias.

Recent Projects

View All

Apr 14, 2025

Jan 24, 2025

Safe ai

The rapid adoption of AI in critical industries like healthcare and legal services has highlighted the urgent need for robust risk mitigation mechanisms. While domain-specific AI agents offer efficiency, they often lack transparency and accountability, raising concerns about safety, reliability, and compliance. The stakes are high, as AI failures in these sectors can lead to catastrophic outcomes, including loss of life, legal repercussions, and significant financial and reputational damage. Current solutions, such as regulatory frameworks and quality assurance protocols, provide only partial protection against the multifaceted risks associated with AI deployment. This situation underscores the necessity for an innovative approach that combines comprehensive risk assessment with financial safeguards to ensure the responsible and secure implementation of AI technologies across high-stakes industries.

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.

Careers

AI Safety Ideas

Code of Conduct

Responsible disclosure policy

hello@apartresearch.com