AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Leheng Sheng^*1, Changshuo Shen^*2, Weixiang Zhao³, Junfeng Fang¹, Xiaohao Liu¹, Zhengkai Liang¹, Xiang Wang², An Zhang^2†, Tat-Seng Chua¹,

¹National University of Singapore, ²University of Science and Technology of China, ³Harbin Institute of Technology

^* Equal contribution. ⁺ Corresponding author.

Overview

AlphaSteer is a theoretically grounded activation steering method designed to enhance LLM safety without compromising utility. While traditional activation steering approaches face a trade-off between safety and performance, AlphaSteer addresses this challenge through a principled learning approach with dual objectives:

Utility Preservation: Learns to create near-zero steering vectors for benign inputs using null-space constraints
Safety Enhancement: Generates effective refusal direction vectors for malicious prompts through linear regression

Effect on Different Prompt Activations & Performance

AlphaSteer steers activations of malicious prompts towards refusal, while largely leaving those of benign prompts unchanged. Traditional activation steering methods struggle to maintain benign prompts unchanged. Therefore, AlphaSteer maintains the utility unchanged while enhancing the safety of the model by a large margin.

👉 Quick Start of AlphaSteer

Installation of Dependencies

conda create -n alphasteer python=3.11
conda activate alphasteer
pip install -r requirements.txt

Usage

The alphasteer.sh script automates the process of extracting embeddings, calculating the steering matrix, and generating steered responses for the meta-llama/Llama-3.1-8B-Instruct model.

./scripts/alphasteer.sh

Or you can directly download our steering matrix from this Google Drive link(recommended).

Please download it directly to the data/steering_matrix directory, and then execute the final part of the generation process.

./scripts/generate.sh

☎️ Contact

Please contact any of the first authors for queries.

Leheng Sheng, leheng.sheng@u.nus.edu
Changshuo Shen, stephen_shen@mail.ustc.edu.cn

🙏 Acknowledgments

We would like to express our gratitude to the authors of AdaSteer for their pioneering work on adaptive activation steering. Our refusal vector extraction process is inspired by their methodology presented in "AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender". We also acknowledge their open-source implementation available at https://github.com/MuyuenLP/AdaSteer, which has been instrumental in our research.

🌟 Citation

If you find our work useful, please kindly consider citing our work as follows:

@article{AlphaSteer,
  title={AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint},
  author={Sheng, Leheng and Shen, Changshuo and Zhao, Weixiang and Fang, Junfeng and Liu, Xiaohao and Liang, Zhenkai and Wang, Xiang and Zhang, An and Chua, Tat-Seng},
  journal={arXiv preprint arXiv:2506.07022},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
config/llama3.1		config/llama3.1
data		data
evaluation		evaluation
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Overview

Effect on Different Prompt Activations & Performance

👉 Quick Start of AlphaSteer

Installation of Dependencies

Usage

☎️ Contact

🙏 Acknowledgments

🌟 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

AlphaLab-USTC/AlphaSteer

Folders and files

Latest commit

History

Repository files navigation

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Overview

Effect on Different Prompt Activations & Performance

👉 Quick Start of AlphaSteer

Installation of Dependencies

Usage

☎️ Contact

🙏 Acknowledgments

🌟 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages