Welcome to PersonaSafe documentation. This guide will help you get started with safety monitoring for language models.
See the main README.md for installation instructions and a 5-minute quick start.
TUTORIAL.md - Step-by-step guide to using PersonaSafe:
- Extracting persona vectors
- Screening datasets for drift
- Applying activation steering
API_REFERENCE.md - Complete API documentation for all classes and methods.
from personasafe import PersonaExtractor
extractor = PersonaExtractor("google/gemma-3-4b")
vector = extractor.compute_persona_vector(
positive_prompts=["Be helpful..."],
negative_prompts=["Be harmful..."],
trait_name="helpfulness"
)from personasafe import DataScreener
import pandas as pd
screener = DataScreener(extractor=extractor, persona_vectors={"helpfulness": vector})
df = pd.DataFrame({"text": ["This is helpful", "This is harmful"]})
screened_df = screener.screen_dataset(df) # defaults to text_column="text"
report = screener.generate_report(screened_df)from personasafe import ActivationSteerer
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "google/gemma-3-4b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
steerer = ActivationSteerer(model, tokenizer)
original_text, steered_text = steerer.steer(
prompt="Hello, how are you?",
persona_vector=vector,
multiplier=1.0,
layer=20
)- Research Paper: Persona Vectors by Anthropic
- GitHub: shehral/PersonaSafe
- Gemma Models: Google AI
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Last Updated: October 26, 2025