steering-vectors

Here are 7 public repositories matching this topic...

bassrehab / steering-vectors-agents

Runtime control of LLM agent behaviors through activation steering vectors. More calibrated than prompting.

machine-learning transformers pytorch steering-behaviors ai-safety interpretability langchain llm-agents activation-engineering steering-vectors contrastive-activation-addition

Updated Dec 19, 2025
Python

SolomonB14D3 / knowledge-fidelity

Star

Behavioral auditing & repair toolkit for LLMs. Measures 8 dimensions via confidence probes.

transformers pytorch svd interpretability confidence bias-detection truthfulness model-merging sycophancy llm-compression mergekit activation-engineering model-auditing steering-vectors rho-audit behavioral-evaluation

Updated Mar 26, 2026
Python

G-Art / matrix_steering_vector_research

Star

Iterative Sparse Matrix Steering: Closed-Form Subspace Alignment for Multi-Layer LLM Control (No SGD required).

pytorch alignment interpretability llm activation-engineering steering-vectors

Updated Jan 5, 2026
Jupyter Notebook

JoschkaCBraun / adaptive-steering

Star

Official implementation of "Beyond Multiple Choice: Evaluating Steering Vectors for Summarization" (Findings of EACL 2026).

evaluation summarization language-model abstractive-text-summarization abstractive-summarization steering-vector steering-vectors

Updated Jan 21, 2026
Python

JoschkaCBraun / steering-vector-reliability

Star

Repository for paper "Understanding (Un)Reliability of Steering Vectors in Language Models" by Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov.

machine-learning language-models unreliability steering-vector steering-vectors

Updated Jun 10, 2025
Jupyter Notebook

aygp-dr / qwen3-steering

Star

Qwen3-0.6B activation steering: style vectors, lens contamination eval, CPRR methodology

transformer style-transfer property-based-testing literate-programming superposition mechanistic-interpretability llm-evaluation small-language-models representation-engineering activation-steering qwen3 steering-vectors actadd cprr conceptual-lens-drift

Updated Mar 26, 2026
Python

Investigating honesty, deception and steering in large language models. Replicating and extending the MASK honesty benchmark on frontier models, working toward internal representation analysis and activation steering for honesty.

ai-safety honesty truthfulness steering-vectors mask-benchmark

Updated Mar 22, 2026

Improve this page

Add a description, image, and links to the steering-vectors topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the steering-vectors topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

steering-vectors

Here are 7 public repositories matching this topic...

bassrehab / steering-vectors-agents

SolomonB14D3 / knowledge-fidelity

G-Art / matrix_steering_vector_research

JoschkaCBraun / adaptive-steering

JoschkaCBraun / steering-vector-reliability

aygp-dr / qwen3-steering

IgRoF / steering_trust

Improve this page

Add this topic to your repo