AlphaSteer is a theoretically grounded activation steering method designed to enhance LLM safety without compromising utility. While traditional activation steering approaches face a trade-off between safety and performance, AlphaSteer addresses this challenge through a principled learning approach with dual objectives:
- Utility Preservation: Learns to create near-zero steering vectors for benign inputs using null-space constraints
- Safety Enhancement: Generates effective refusal direction vectors for malicious prompts through linear regression
AlphaSteer steers activations of malicious prompts towards refusal, while largely leaving those of benign prompts unchanged. Traditional activation steering methods struggle to maintain benign prompts unchanged. Therefore, AlphaSteer maintains the utility unchanged while enhancing the safety of the model by a large margin.
conda create -n alphasteer python=3.11
conda activate alphasteer
pip install -r requirements.txtThe alphasteer.sh script automates the process of extracting embeddings, calculating the steering matrix, and generating steered responses for the meta-llama/Llama-3.1-8B-Instruct model.
./scripts/alphasteer.shOr you can directly download our steering matrix from this Google Drive link(recommended).
Please download it directly to the data/steering_matrix directory, and then execute the final part of the generation process.
./scripts/generate.shPlease contact any of the first authors for queries.
- Leheng Sheng, leheng.sheng@u.nus.edu
- Changshuo Shen, stephen_shen@mail.ustc.edu.cn
We would like to express our gratitude to the authors of AdaSteer for their pioneering work on adaptive activation steering. Our refusal vector extraction process is inspired by their methodology presented in "AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender". We also acknowledge their open-source implementation available at https://github.com/MuyuenLP/AdaSteer, which has been instrumental in our research.
If you find our work useful, please kindly consider citing our work as follows:
@article{AlphaSteer,
title={AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint},
author={Sheng, Leheng and Shen, Changshuo and Zhao, Weixiang and Fang, Junfeng and Liu, Xiaohao and Liang, Zhenkai and Wang, Xiang and Zhang, An and Chua, Tat-Seng},
journal={arXiv preprint arXiv:2506.07022},
year={2025}
}

