Temporal behavioral drift evaluator for Agent Control. Detects gradual degradation patterns that point-in-time evaluators miss.
Agent Control's built-in evaluators (regex, list, SQL, JSON) assess individual interactions. They answer: "Is this response safe right now?" But they don't answer: "Is this agent becoming less reliable over time?"
Empirical observation across 13 LLM agents showed:
- Agents scoring 1.0 on point-in-time tests drifted ~7% on behavioral consistency over 28-day windows
- Self-reported capability claims diverged from measured behavior by 7% on average
- Degradation patterns were non-monotonic — stability windows followed by abrupt shifts, not gradual decline
This evaluator fills that gap.
- Records behavioral observations per agent over time
- Compares recent window against an established baseline
- Flags drift when mean shift exceeds a configurable threshold
- Dampens false signals from tasks with multiple valid behavioral patterns (
spec_clarity)
- Baseline vs window: First N observations establish baseline; last M observations are compared
- Mean shift: Absolute delta between baseline mean and recent mean
- Cohen's d: Standardized effect size for practical significance
- Confidence: Weighted combination of sample size and effect size
- Specification clarity:
MULTI_VALIDtasks suppress drift flags when effect size is small (agents legitimately behave differently on ambiguous tasks)
pip install agent-control-drift-evaluatorWith Redis backend:
pip install agent-control-drift-evaluator[redis]from agent_control import control
@control(
name="behavioral-drift-check",
evaluator="drift",
config={
"window_size": 7, # recent observations to analyze
"baseline_size": 10, # observations for baseline
"drift_threshold": 0.10, # 10% mean shift triggers
"dimensions": ["calibration", "adaptation", "robustness"],
"action": "warn", # or "deny" for critical agents
"spec_clarity": "unambiguous",
},
)
async def my_agent_step(input):
...The evaluator expects data with agent_id and score:
{
"agent_id": "my-agent-001",
"score": 0.92,
"dimension": "calibration",
"timestamp": 1710844800.0,
"metadata": {"probe": "pii-detection"}
}| Field | Required | Description |
|---|---|---|
agent_id |
✅ | Identifies which agent this observation is for |
score |
✅ | Behavioral measurement (0.0–1.0, higher = more reliable) |
dimension |
❌ | Category for separate tracking (default: "default") |
timestamp |
❌ | Unix epoch seconds (default: current time) |
metadata |
❌ | Extra context (probe type, model version, etc.) |
| Parameter | Default | Description |
|---|---|---|
window_size |
7 |
Recent observations to compare. Empirical minimum: 5 |
baseline_size |
10 |
Initial observations for baseline |
drift_threshold |
0.10 |
Mean-shift threshold (0.0–1.0) |
dimensions |
["default"] |
Dimensions to track separately |
action |
"warn" |
Action on drift: warn, deny, or log |
storage_backend |
"file" |
Storage: file or redis |
storage_dir |
~/.agent-control-drift/observations |
File backend directory |
spec_clarity |
"unambiguous" |
Task clarity: unambiguous, multi_valid, underspecified |
When drift is detected, the EvaluatorResult.metadata includes:
{
"agent_id": "my-agent-001",
"dimension": "calibration",
"mean_shift": -0.15,
"effect_size": 0.82,
"drift_detected": true,
"window_size": 7,
"baseline_size": 10,
"specification_clarity": "unambiguous",
"total_observations": 28
}From production validation across two independent systems:
- Window ≥ 5 required: Below 5 observations, drift detection is noise. Validated on Gerundium (3-node swarm) and NexusGuard (19-agent fleet).
- Non-monotonic drift: Agents don't degrade gradually. They show stability → abrupt shift → stability. Rolling windows catch this; cumulative averages blur it.
- Specification clarity matters: Under identical prompts, one agent produced a stable 6A/4B split across two reasoning paths. Without
spec_clarity: multi_valid, this would be flagged as drift.
Observations stored as JSON lines in ~/.agent-control-drift/observations/{agent_id}/{dimension}.jsonl. Atomic appends via O_APPEND. Good for single-host deployments.
config = DriftEvaluatorConfig(
storage_backend="redis",
redis_url="redis://localhost:6379/0",
)Uses Redis lists with RPUSH/LRANGE. Better for multi-host or high-throughput setups.
git clone https://github.com/nanookclaw/agent-control-drift-evaluator
cd agent-control-drift-evaluator
pip install -e ".[dev]"
pytestMIT
- Agent Control — Runtime guardrails for AI agents
- PDR Paper — Behavioral reliability measurement methodology
- Agent Control Evaluators — Evaluator plugin guide