drift_orchestrator

LLM safety monitor security research platform. Built to detect policy drift in LLM sessions — and to find out how that detection can be defeated.

Live Demos

Demo	Description
Research Portal	Full findings, interactive vectors, attack chain, data
Dual Signal Governor	Interactive architecture — run simulated sessions live

What This Is

drift_orchestrator started as a drift detection tool and became a security research program. It implements a multi-signal LLM session monitor and uses that monitor as an experimental platform to find, characterize, and document vulnerabilities in LLM safety architectures.

The research has produced three published vulnerability classes and a complete attack chain against dual-signal safety monitors.

Published Research

Second-Order Injection: Attacking the Evaluator in LLM Safety Monitors

LLM-based safety evaluators share a context window with the content they evaluate. Attacker-influenced session content can directly override evaluator verdicts. Six injection vectors characterized across three model families. 100% bypass achieved on all three models. Vector transfer confirmed -- a payload discovered on the most resistant model works universally without modification. No prompt sanitization fully mitigates the attack.

The Dual-Signal Governor: A Control Plane Pattern for Drift-Aware Systems

Architecture for combining geometric drift signals with LLM coherence evaluation. Documents the false positive reduction mechanism and introduces the policy boost system. Includes empirical false boost rate data.

Semantic Gradient Evasion (SGE): Bypassing Embedding-Based Drift Detectors

Embedding-based drift detectors can be bypassed through monotonic semantic gradient chains -- small consistent shifts that individually stay below thresholds while cumulatively inverting policy meaning.

Architecture

Signal A: geometric drift (sentence-transformers embeddings)
Signal B: LLM coherence evaluator (qwen/mistral/phi3 via Ollama)
Governor: dual-signal arbitration with policy boost
Gateway: localai_gateway FastAPI control plane
Telemetry: SQLite session logging

Key Findings

Second-order injection is universal. Any LLM evaluator that reads attacker-influenced content to produce a safety verdict shares a context window with its attack surface. V4 (reasoning capture) achieves 100% bypass across qwen2.5:3b, mistral, and phi3:mini without modification.

Coupled architectures collapse under symmetric injection. When both evaluators in a dual-system are injected simultaneously, inter-system divergence drops to ~0.01 -- indistinguishable from a clean session. The primary detection signal is eliminated.

No prompt sanitization fully mitigates V4. M4 combined sanitization (bracket stripping + keyword filtering + first-sentence truncation) eliminates V1/V2/V3 bypass but V4 reasoning capture uses natural language that survives all tested filters.

Vector transfer confirmed. Probe the most resistant model to discover effective vectors, deploy against all models simultaneously. One payload, universal exploit.

Experimental Data

File	Contents
`results/soi_phase1.jsonl`	Individual model injection vulnerability
`results/soi_phase2.jsonl`	Coupled system bypass
`results/mitigation_probe.jsonl`	Sanitization effectiveness
`results/cross_model_generalization.jsonl`	phi3:mini generalization
`results/phi3_targeted_injection.jsonl`	phi3 targeted vectors
`results/vector_transfer.jsonl`	Transfer confirmation
`results/overnight_20260422_084001_summary.json`	3-iteration overnight run

Probe Scripts

Script	Purpose
`scripts/second_order_injection_phase1.py`	Individual model injection characterization
`scripts/second_order_injection_phase2.py`	Coupled system bypass
`scripts/mitigation_probe.py`	Sanitization strategy testing
`scripts/cross_model_generalization_probe.py`	Multi-model generalization
`scripts/phi3_targeted_injection_probe.py`	phi3-specific vector tuning
`scripts/vector_transfer_probe.py`	Cross-model transfer testing
`scripts/divergence_threshold_calibration.py`	Detection threshold analysis
`overnight_runner.py`	Full multi-iteration orchestration

Setup

# Requires Ollama with qwen2.5:3b, mistral, phi3:mini
ollama pull qwen2.5:3b && ollama pull mistral && ollama pull phi3:mini

# Install dependencies
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Start gateways
cd ../localai_gateway
MODEL_FAST=qwen2.5:3b GATEWAY_PORT=8765 python main.py &
MODEL_FAST=mistral GATEWAY_PORT=8766 python main.py &
MODEL_FAST=phi3:mini GATEWAY_PORT=8767 python main.py &

# Run full overnight probe suite
cd ../drift_orchestrator
python overnight_runner.py

Research Roadmap

SGE -- semantic gradient evasion of embedding detectors
Dual-signal governor -- false positive reduction architecture
Coupled dual-system -- divergence as detection signal
Second-order injection -- evaluator override, universal bypass
Cross-model generalization -- phi3 resistance and tuned bypass
Vector transfer -- universal payload confirmed
Adaptive injection -- feedback loop attack
Evaluator isolation -- architectural defense design
Formal model -- mathematical characterization of injectability

Reference Docs

Document	Contents
FINDINGS.md	All empirical results — quick reference tables
THREAT_MODEL.md	Attacker profile, attack surface, defender assumptions
papers/second_order_injection.md	Full research paper

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github/workflows		.github/workflows
agents		agents
alerts		alerts
analysis		analysis
backend		backend
benchmarks		benchmarks
data		data
demo_recordings		demo_recordings
demo_shots		demo_shots
docs		docs
examples		examples
firewall		firewall
orchestrator		orchestrator
papers		papers
results		results
scripts		scripts
tests		tests
ui		ui
verifier		verifier
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
FINDINGS.md		FINDINGS.md
LICENSE		LICENSE
README.md		README.md
REPLICATE.md		REPLICATE.md
RESEARCH_LOG.md		RESEARCH_LOG.md
SECURITY.md		SECURITY.md
THREAT_MODEL.md		THREAT_MODEL.md
agent.py		agent.py
agent_runtime.py		agent_runtime.py
branch_merge.py		branch_merge.py
compare.py		compare.py
demo_real.py		demo_real.py
demo_real.py.bak.20260405_152026		demo_real.py.bak.20260405_152026
demo_real.py.bak.cleanup.20260405_152451		demo_real.py.bak.cleanup.20260405_152451
deploy.py		deploy.py
deploy_closed_loop.py		deploy_closed_loop.py
deploy_full_stack.py		deploy_full_stack.py
drift_live_signal.py		drift_live_signal.py
embed_config.py		embed_config.py
embedding_evaluator.py		embedding_evaluator.py
embedding_evaluator.py.bak.20260405_152228		embedding_evaluator.py.bak.20260405_152228
embeddings.py		embeddings.py
evaluator.py		evaluator.py
evaluator.py.bak.20260405_184357		evaluator.py.bak.20260405_184357
evaluator.py.bak.20260405_185236		evaluator.py.bak.20260405_185236
evaluator.py.bak.20260405_190251		evaluator.py.bak.20260405_190251
evaluator.py.bak.demo.20260405_151758		evaluator.py.bak.demo.20260405_151758
evasion_test_suite.py		evasion_test_suite.py
external_evaluator.py		external_evaluator.py
live.py		live.py
live.py.bak.20260405_184357		live.py.bak.20260405_184357
live.py.bak.20260405_185236		live.py.bak.20260405_185236
live_dashboard.py		live_dashboard.py
live_evaluator.py		live_evaluator.py
live_signal_api.py		live_signal_api.py
log_emitter.py		log_emitter.py
metrics.py		metrics.py
multi_runtime.py		multi_runtime.py
operator_view.py		operator_view.py
overnight_runner.py		overnight_runner.py
policy.py		policy.py
policy.py.bak.20260405_184357		policy.py.bak.20260405_184357
policy.py.bak.20260405_185236		policy.py.bak.20260405_185236
policy.py.bak.20260405_190251		policy.py.bak.20260405_190251
policy.py.bak.safe_escalation.20260405_190409		policy.py.bak.safe_escalation.20260405_190409
pytest.ini		pytest.ini
recovery.py		recovery.py
recovery.py.bak.20260405_184357		recovery.py.bak.20260405_184357
recovery.py.bak.20260405_185236		recovery.py.bak.20260405_185236
report.py		report.py
requirements.txt		requirements.txt
run_eval.sh		run_eval.sh
session_manager.py		session_manager.py
session_manager.py.bak.20260405_184357		session_manager.py.bak.20260405_184357
session_manager.py.bak.20260405_185236		session_manager.py.bak.20260405_185236
sqlite_store.py		sqlite_store.py
test_closed_loop.py		test_closed_loop.py
test_full_stack.py		test_full_stack.py
test_live_signal_stream.py		test_live_signal_stream.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

drift_orchestrator

Live Demos

What This Is

Published Research

Second-Order Injection: Attacking the Evaluator in LLM Safety Monitors

The Dual-Signal Governor: A Control Plane Pattern for Drift-Aware Systems

Semantic Gradient Evasion (SGE): Bypassing Embedding-Based Drift Detectors

Architecture

Key Findings

Experimental Data

Probe Scripts

Setup

Research Roadmap

Reference Docs

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

drift_orchestrator

Live Demos

What This Is

Published Research

Second-Order Injection: Attacking the Evaluator in LLM Safety Monitors

The Dual-Signal Governor: A Control Plane Pattern for Drift-Aware Systems

Semantic Gradient Evasion (SGE): Bypassing Embedding-Based Drift Detectors

Architecture

Key Findings

Experimental Data

Probe Scripts

Setup

Research Roadmap

Reference Docs

Related

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages