Skip to content

GnomeMan4201/drift_orchestrator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

108 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

drift_orchestrator

Second-Order Injection Attack Chain

LLM safety monitor security research platform. Built to detect policy drift in LLM sessions — and to find out how that detection can be defeated.

CI Vectors Models Bypass Paper Portal


Live Demos

Demo Description
Research Portal Full findings, interactive vectors, attack chain, data
Dual Signal Governor Interactive architecture — run simulated sessions live

What This Is

drift_orchestrator started as a drift detection tool and became a security research program. It implements a multi-signal LLM session monitor and uses that monitor as an experimental platform to find, characterize, and document vulnerabilities in LLM safety architectures.

The research has produced three published vulnerability classes and a complete attack chain against dual-signal safety monitors.


Published Research

LLM-based safety evaluators share a context window with the content they evaluate. Attacker-influenced session content can directly override evaluator verdicts. Six injection vectors characterized across three model families. 100% bypass achieved on all three models. Vector transfer confirmed -- a payload discovered on the most resistant model works universally without modification. No prompt sanitization fully mitigates the attack.

Architecture for combining geometric drift signals with LLM coherence evaluation. Documents the false positive reduction mechanism and introduces the policy boost system. Includes empirical false boost rate data.

Embedding-based drift detectors can be bypassed through monotonic semantic gradient chains -- small consistent shifts that individually stay below thresholds while cumulatively inverting policy meaning.


Architecture

Signal A: geometric drift (sentence-transformers embeddings)
Signal B: LLM coherence evaluator (qwen/mistral/phi3 via Ollama)
Governor: dual-signal arbitration with policy boost
Gateway: localai_gateway FastAPI control plane
Telemetry: SQLite session logging

Key Findings

Second-order injection is universal. Any LLM evaluator that reads attacker-influenced content to produce a safety verdict shares a context window with its attack surface. V4 (reasoning capture) achieves 100% bypass across qwen2.5:3b, mistral, and phi3:mini without modification.

Coupled architectures collapse under symmetric injection. When both evaluators in a dual-system are injected simultaneously, inter-system divergence drops to ~0.01 -- indistinguishable from a clean session. The primary detection signal is eliminated.

No prompt sanitization fully mitigates V4. M4 combined sanitization (bracket stripping + keyword filtering + first-sentence truncation) eliminates V1/V2/V3 bypass but V4 reasoning capture uses natural language that survives all tested filters.

Vector transfer confirmed. Probe the most resistant model to discover effective vectors, deploy against all models simultaneously. One payload, universal exploit.


Experimental Data

File Contents
results/soi_phase1.jsonl Individual model injection vulnerability
results/soi_phase2.jsonl Coupled system bypass
results/mitigation_probe.jsonl Sanitization effectiveness
results/cross_model_generalization.jsonl phi3:mini generalization
results/phi3_targeted_injection.jsonl phi3 targeted vectors
results/vector_transfer.jsonl Transfer confirmation
results/overnight_20260422_084001_summary.json 3-iteration overnight run

Probe Scripts

Script Purpose
scripts/second_order_injection_phase1.py Individual model injection characterization
scripts/second_order_injection_phase2.py Coupled system bypass
scripts/mitigation_probe.py Sanitization strategy testing
scripts/cross_model_generalization_probe.py Multi-model generalization
scripts/phi3_targeted_injection_probe.py phi3-specific vector tuning
scripts/vector_transfer_probe.py Cross-model transfer testing
scripts/divergence_threshold_calibration.py Detection threshold analysis
overnight_runner.py Full multi-iteration orchestration

Setup

# Requires Ollama with qwen2.5:3b, mistral, phi3:mini
ollama pull qwen2.5:3b && ollama pull mistral && ollama pull phi3:mini

# Install dependencies
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Start gateways
cd ../localai_gateway
MODEL_FAST=qwen2.5:3b GATEWAY_PORT=8765 python main.py &
MODEL_FAST=mistral GATEWAY_PORT=8766 python main.py &
MODEL_FAST=phi3:mini GATEWAY_PORT=8767 python main.py &

# Run full overnight probe suite
cd ../drift_orchestrator
python overnight_runner.py

Research Roadmap

  • SGE -- semantic gradient evasion of embedding detectors
  • Dual-signal governor -- false positive reduction architecture
  • Coupled dual-system -- divergence as detection signal
  • Second-order injection -- evaluator override, universal bypass
  • Cross-model generalization -- phi3 resistance and tuned bypass
  • Vector transfer -- universal payload confirmed
  • Adaptive injection -- feedback loop attack
  • Evaluator isolation -- architectural defense design
  • Formal model -- mathematical characterization of injectability

Reference Docs

Document Contents
FINDINGS.md All empirical results — quick reference tables
THREAT_MODEL.md Attacker profile, attack surface, defender assumptions
papers/second_order_injection.md Full research paper

Related

localai_gateway -- FastAPI control plane routing inference across local Ollama models

badBANANA research -- all published work


Independent security research. Necessity-driven development. badBANANA // gnomeman4201

About

Runtime drift detection and hallucination verification for LLM sessions. SQLite telemetry, semantic embeddings, and policy-based intervention.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages