-
Notifications
You must be signed in to change notification settings - Fork 0
Coverage Model
AgentAssay defines a five-dimensional coverage model for AI agents. Each dimension measures a distinct aspect of behavioral space.
In traditional software, code coverage measures which lines/branches were exercised. For AI agents, there is no static source code. Instead, coverage must measure how thoroughly tests exercise the agent's behavioral space.
What it measures: Fraction of available tools that were invoked during testing.
C_tool = |tools invoked| / |total known tools|
Example: Agent has 5 tools: search, calculate, write_file, read_file, send_email. Tests invoke 3 of them → C_tool = 60%.
Why it matters: If your tests never call send_email, you have no confidence that tool works after a change.
What it measures: Fraction of distinct action sequences observed during testing.
A "path" is the ordered sequence of action types: llm_response → tool_call → llm_response → tool_call.
Why it matters: Agents take different paths depending on input. If tests only exercise the happy path, edge-case behaviors (retries, fallbacks) are untested.
What it measures: Fraction of distinct intermediate states the agent visited.
States are derived from metadata and tool outputs at each step.
Why it matters: A regression might only manifest when the agent reaches a specific state that your tests never trigger.
What it measures: How well tests exercise edge cases and boundary conditions.
Tracked conditions:
- Maximum step count reached
- Timeout triggered
- Cost limit approached/exceeded
- Empty or minimal inputs
- Error/exception paths
- Tool failure handling
Why it matters: Most agent failures occur at boundaries — when context is full, when a tool errors, when budget runs out.
What it measures: Fraction of model variants tested.
C_model = |models tested| / |total known models|
Why it matters: An agent that works with GPT-4o may fail with Claude Opus. Model coverage tracks whether your tests validate behavior across all supported backends.
C = (C_tool, C_path, C_state, C_boundary, C_model)
Example:
C = (0.80, 0.65, 0.72, 0.40, 0.50)
This tells you at a glance:
- ✅ Tool coverage is strong (80%)
⚠️ Path coverage is moderate (65%)⚠️ State coverage is moderate (72%)- ❌ Boundary coverage is weak (40%)
- ❌ Model coverage is limited (50%)
The geometric mean of all five dimensions:
C_overall = (C_tool × C_path × C_state × C_boundary × C_model)^(1/5)
Why geometric mean? It penalizes low scores in any dimension. A test suite with 100% tool coverage but 0% boundary coverage gets a score of 0%, not 80%.
| Score | Status |
|---|---|
| >= 80% | 🟢 Strong coverage |
| 50-79% | 🟡 Moderate coverage — consider adding scenarios |
| < 50% | 🔴 Weak coverage — significant gaps exist |
| Dimension | How to Improve |
|---|---|
| Tool | Add scenarios that require different tool combinations |
| Path | Add scenarios with alternative paths (error cases, ambiguous inputs) |
| State | Add multi-step scenarios with varied intermediate states |
| Boundary | Add edge cases: empty inputs, max-length inputs, timeout-inducing tasks |
| Model | Run the same test suite against multiple model backends |
# View coverage for a results file
agentassay coverage --results trials.json --tools search,calculate,write
# Specify known models for model coverage
agentassay coverage --results trials.json --models gpt-4o,claude-opus-4-6Output:
====== AgentAssay Coverage Report ======
Tool Coverage: ████████████████░░░░ 80% (4/5 tools)
Path Coverage: █████████████░░░░░░░ 65% (13/20 paths)
State Coverage: ██████████████░░░░░░ 72% (18/25 states)
Boundary Coverage: ████████░░░░░░░░░░░░ 40% (2/5 conditions)
Model Coverage: ██████████░░░░░░░░░░ 50% (1/2 models)
Overall Score: ████████████░░░░░░░░ 60% [MODERATE]
Weakest Dimension: boundary (40%)
Analyzed 50 traces, observed 4 tools, 13 unique paths.
from agentassay.coverage import AgentCoverageCollector
collector = AgentCoverageCollector(
known_tools={"search", "calculate", "write_file"},
known_models={"gpt-4o", "claude-opus-4-6"},
)
# Feed execution traces
for trace in execution_traces:
collector.update(trace)
# Get coverage snapshot
snapshot = collector.snapshot()
print(f"Overall: {snapshot.overall:.2%}")
print(f"Weakest: {snapshot.weakest}")
print(f"Dimensions: {snapshot.dimensions}")Coverage can be computed from existing traces:
# Load production traces (zero token cost)
traces = load_from_monitoring("production-2026-03/")
# Compute coverage offline
collector = AgentCoverageCollector(known_tools=["search", "book", "cancel"])
coverage = collector.compute(traces)
print(f"Production coverage: {coverage.overall:.1%}")- Mutation Testing — Evaluate test suite sensitivity
- Token-Efficient Testing — Combine coverage with adaptive budgeting
- Quick Start — Try coverage analysis in 5 minutes
Part of Qualixar | Author: Varun Pratap Bhardwaj
Getting Started
Core Concepts
- Token-Efficient Testing
- Behavioral Fingerprinting
- Statistical Methods
- Coverage Model
- Mutation Testing
Guides
Reference