Skip to content

Coverage Model

Varun Pratap Bhardwaj edited this page Mar 6, 2026 · 1 revision

Coverage Model

AgentAssay defines a five-dimensional coverage model for AI agents. Each dimension measures a distinct aspect of behavioral space.

Why Agent Coverage Is Different

In traditional software, code coverage measures which lines/branches were exercised. For AI agents, there is no static source code. Instead, coverage must measure how thoroughly tests exercise the agent's behavioral space.

The Five Dimensions

1. Tool Coverage (C_tool)

What it measures: Fraction of available tools that were invoked during testing.

C_tool = |tools invoked| / |total known tools|

Example: Agent has 5 tools: search, calculate, write_file, read_file, send_email. Tests invoke 3 of them → C_tool = 60%.

Why it matters: If your tests never call send_email, you have no confidence that tool works after a change.

2. Path Coverage (C_path)

What it measures: Fraction of distinct action sequences observed during testing.

A "path" is the ordered sequence of action types: llm_response → tool_call → llm_response → tool_call.

Why it matters: Agents take different paths depending on input. If tests only exercise the happy path, edge-case behaviors (retries, fallbacks) are untested.

3. State Coverage (C_state)

What it measures: Fraction of distinct intermediate states the agent visited.

States are derived from metadata and tool outputs at each step.

Why it matters: A regression might only manifest when the agent reaches a specific state that your tests never trigger.

4. Boundary Coverage (C_boundary)

What it measures: How well tests exercise edge cases and boundary conditions.

Tracked conditions:

  • Maximum step count reached
  • Timeout triggered
  • Cost limit approached/exceeded
  • Empty or minimal inputs
  • Error/exception paths
  • Tool failure handling

Why it matters: Most agent failures occur at boundaries — when context is full, when a tool errors, when budget runs out.

5. Model Coverage (C_model)

What it measures: Fraction of model variants tested.

C_model = |models tested| / |total known models|

Why it matters: An agent that works with GPT-4o may fail with Claude Opus. Model coverage tracks whether your tests validate behavior across all supported backends.

The Coverage Tuple

C = (C_tool, C_path, C_state, C_boundary, C_model)

Example:

C = (0.80, 0.65, 0.72, 0.40, 0.50)

This tells you at a glance:

  • ✅ Tool coverage is strong (80%)
  • ⚠️ Path coverage is moderate (65%)
  • ⚠️ State coverage is moderate (72%)
  • ❌ Boundary coverage is weak (40%)
  • ❌ Model coverage is limited (50%)

Overall Coverage Score

The geometric mean of all five dimensions:

C_overall = (C_tool × C_path × C_state × C_boundary × C_model)^(1/5)

Why geometric mean? It penalizes low scores in any dimension. A test suite with 100% tool coverage but 0% boundary coverage gets a score of 0%, not 80%.

Interpreting Coverage

Score Status
>= 80% 🟢 Strong coverage
50-79% 🟡 Moderate coverage — consider adding scenarios
< 50% 🔴 Weak coverage — significant gaps exist

Improving Coverage

Dimension How to Improve
Tool Add scenarios that require different tool combinations
Path Add scenarios with alternative paths (error cases, ambiguous inputs)
State Add multi-step scenarios with varied intermediate states
Boundary Add edge cases: empty inputs, max-length inputs, timeout-inducing tasks
Model Run the same test suite against multiple model backends

Using Coverage from CLI

# View coverage for a results file
agentassay coverage --results trials.json --tools search,calculate,write

# Specify known models for model coverage
agentassay coverage --results trials.json --models gpt-4o,claude-opus-4-6

Output:

====== AgentAssay Coverage Report ======

Tool Coverage:        ████████████████░░░░  80% (4/5 tools)
Path Coverage:        █████████████░░░░░░░  65% (13/20 paths)
State Coverage:       ██████████████░░░░░░  72% (18/25 states)
Boundary Coverage:    ████████░░░░░░░░░░░░  40% (2/5 conditions)
Model Coverage:       ██████████░░░░░░░░░░  50% (1/2 models)

Overall Score:        ████████████░░░░░░░░  60%  [MODERATE]
Weakest Dimension:    boundary (40%)

Analyzed 50 traces, observed 4 tools, 13 unique paths.

Using Coverage from Python

from agentassay.coverage import AgentCoverageCollector

collector = AgentCoverageCollector(
    known_tools={"search", "calculate", "write_file"},
    known_models={"gpt-4o", "claude-opus-4-6"},
)

# Feed execution traces
for trace in execution_traces:
    collector.update(trace)

# Get coverage snapshot
snapshot = collector.snapshot()
print(f"Overall: {snapshot.overall:.2%}")
print(f"Weakest: {snapshot.weakest}")
print(f"Dimensions: {snapshot.dimensions}")

Offline Coverage (Zero Token Cost)

Coverage can be computed from existing traces:

# Load production traces (zero token cost)
traces = load_from_monitoring("production-2026-03/")

# Compute coverage offline
collector = AgentCoverageCollector(known_tools=["search", "book", "cancel"])
coverage = collector.compute(traces)

print(f"Production coverage: {coverage.overall:.1%}")

Next Steps


Part of Qualixar | Author: Varun Pratap Bhardwaj

Clone this wiki locally