Name	Name	Last commit message	Last commit date
parent directory ..
API_REFERENCE.md	API_REFERENCE.md
README.md	README.md
TUTORIAL.md	TUTORIAL.md

Name

Last commit message

Last commit date

README.md

TUTORIAL.md

PersonaSafe Documentation

Welcome to PersonaSafe documentation. This guide will help you get started with safety monitoring for language models.

📖 Getting Started

Installation & Quick Start

See the main README.md for installation instructions and a 5-minute quick start.

Tutorial

TUTORIAL.md - Step-by-step guide to using PersonaSafe:

Extracting persona vectors
Screening datasets for drift
Applying activation steering

API Reference

API_REFERENCE.md - Complete API documentation for all classes and methods.

🎯 Common Tasks

Extract Persona Vectors

from personasafe import PersonaExtractor

extractor = PersonaExtractor("google/gemma-3-4b")
vector = extractor.compute_persona_vector(
    positive_prompts=["Be helpful..."],
    negative_prompts=["Be harmful..."],
    trait_name="helpfulness"
)

Screen a Dataset

from personasafe import DataScreener
import pandas as pd

screener = DataScreener(extractor=extractor, persona_vectors={"helpfulness": vector})
df = pd.DataFrame({"text": ["This is helpful", "This is harmful"]})
screened_df = screener.screen_dataset(df)  # defaults to text_column="text"
report = screener.generate_report(screened_df)

Apply Steering

from personasafe import ActivationSteerer
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "google/gemma-3-4b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
steerer = ActivationSteerer(model, tokenizer)
original_text, steered_text = steerer.steer(
    prompt="Hello, how are you?",
    persona_vector=vector,
    multiplier=1.0,
    layer=20
)

🔗 External Resources

Research Paper: Persona Vectors by Anthropic
GitHub: shehral/PersonaSafe
Gemma Models: Google AI

🤝 Getting Help

Issues: GitHub Issues
Discussions: GitHub Discussions

Last Updated: October 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

PersonaSafe Documentation

📖 Getting Started

Installation & Quick Start

Tutorial

API Reference

🎯 Common Tasks

Extract Persona Vectors

Screen a Dataset

Apply Steering

🔗 External Resources

🤝 Getting Help

FilesExpand file tree

docs

Directory actions

More options

Directory actions

More options

Latest commit

History

docs

Folders and files

parent directory

README.md

PersonaSafe Documentation

📖 Getting Started

Installation & Quick Start

Tutorial

API Reference

🎯 Common Tasks

Extract Persona Vectors

Screen a Dataset

Apply Steering

🔗 External Resources

🤝 Getting Help