Coverage Model

AgentAssay defines a five-dimensional coverage model for AI agents. Each dimension measures a distinct aspect of behavioral space.

Why Agent Coverage Is Different

In traditional software, code coverage measures which lines/branches were exercised. For AI agents, there is no static source code. Instead, coverage must measure how thoroughly tests exercise the agent's behavioral space.

The Five Dimensions

1. Tool Coverage (C_tool)

What it measures: Fraction of available tools that were invoked during testing.

C_tool = |tools invoked| / |total known tools|

Example: Agent has 5 tools: search, calculate, write_file, read_file, send_email. Tests invoke 3 of them → C_tool = 60%.

Why it matters: If your tests never call send_email, you have no confidence that tool works after a change.

2. Path Coverage (C_path)

What it measures: Fraction of distinct action sequences observed during testing.

A "path" is the ordered sequence of action types: llm_response → tool_call → llm_response → tool_call.

Why it matters: Agents take different paths depending on input. If tests only exercise the happy path, edge-case behaviors (retries, fallbacks) are untested.

3. State Coverage (C_state)

What it measures: Fraction of distinct intermediate states the agent visited.

States are derived from metadata and tool outputs at each step.

Why it matters: A regression might only manifest when the agent reaches a specific state that your tests never trigger.

4. Boundary Coverage (C_boundary)

What it measures: How well tests exercise edge cases and boundary conditions.

Tracked conditions:

Maximum step count reached
Timeout triggered
Cost limit approached/exceeded
Empty or minimal inputs
Error/exception paths
Tool failure handling

Why it matters: Most agent failures occur at boundaries — when context is full, when a tool errors, when budget runs out.

5. Model Coverage (C_model)

What it measures: Fraction of model variants tested.

C_model = |models tested| / |total known models|

Why it matters: An agent that works with GPT-4o may fail with Claude Opus. Model coverage tracks whether your tests validate behavior across all supported backends.

The Coverage Tuple

C = (C_tool, C_path, C_state, C_boundary, C_model)

Example:

C = (0.80, 0.65, 0.72, 0.40, 0.50)

This tells you at a glance:

✅ Tool coverage is strong (80%)
⚠️ Path coverage is moderate (65%)
⚠️ State coverage is moderate (72%)
❌ Boundary coverage is weak (40%)
❌ Model coverage is limited (50%)

Overall Coverage Score

The geometric mean of all five dimensions:

C_overall = (C_tool × C_path × C_state × C_boundary × C_model)^(1/5)

Why geometric mean? It penalizes low scores in any dimension. A test suite with 100% tool coverage but 0% boundary coverage gets a score of 0%, not 80%.

Interpreting Coverage

Score	Status
>= 80%	🟢 Strong coverage
50-79%	🟡 Moderate coverage — consider adding scenarios
< 50%	🔴 Weak coverage — significant gaps exist

Improving Coverage

Dimension	How to Improve
Tool	Add scenarios that require different tool combinations
Path	Add scenarios with alternative paths (error cases, ambiguous inputs)
State	Add multi-step scenarios with varied intermediate states
Boundary	Add edge cases: empty inputs, max-length inputs, timeout-inducing tasks
Model	Run the same test suite against multiple model backends

Using Coverage from CLI

# View coverage for a results file
agentassay coverage --results trials.json --tools search,calculate,write

# Specify known models for model coverage
agentassay coverage --results trials.json --models gpt-4o,claude-opus-4-6

Output:

====== AgentAssay Coverage Report ======

Tool Coverage:        ████████████████░░░░  80% (4/5 tools)
Path Coverage:        █████████████░░░░░░░  65% (13/20 paths)
State Coverage:       ██████████████░░░░░░  72% (18/25 states)
Boundary Coverage:    ████████░░░░░░░░░░░░  40% (2/5 conditions)
Model Coverage:       ██████████░░░░░░░░░░  50% (1/2 models)

Overall Score:        ████████████░░░░░░░░  60%  [MODERATE]
Weakest Dimension:    boundary (40%)

Analyzed 50 traces, observed 4 tools, 13 unique paths.

Using Coverage from Python

from agentassay.coverage import AgentCoverageCollector

collector = AgentCoverageCollector(
    known_tools={"search", "calculate", "write_file"},
    known_models={"gpt-4o", "claude-opus-4-6"},
)

# Feed execution traces
for trace in execution_traces:
    collector.update(trace)

# Get coverage snapshot
snapshot = collector.snapshot()
print(f"Overall: {snapshot.overall:.2%}")
print(f"Weakest: {snapshot.weakest}")
print(f"Dimensions: {snapshot.dimensions}")

Offline Coverage (Zero Token Cost)

Coverage can be computed from existing traces:

# Load production traces (zero token cost)
traces = load_from_monitoring("production-2026-03/")

# Compute coverage offline
collector = AgentCoverageCollector(known_tools=["search", "book", "cancel"])
coverage = collector.compute(traces)

print(f"Production coverage: {coverage.overall:.1%}")

Next Steps

Mutation Testing — Evaluate test suite sensitivity
Token-Efficient Testing — Combine coverage with adaptive budgeting
Quick Start — Try coverage analysis in 5 minutes

Part of Qualixar | Author: Varun Pratap Bhardwaj

Home

Getting Started

Core Concepts

Guides

Reference

qualixar.com | arXiv:2603.02601

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coverage Model

Coverage Model

Why Agent Coverage Is Different

The Five Dimensions

1. Tool Coverage (C_tool)

2. Path Coverage (C_path)

3. State Coverage (C_state)

4. Boundary Coverage (C_boundary)

5. Model Coverage (C_model)

The Coverage Tuple

Overall Coverage Score

Interpreting Coverage

Improving Coverage

Using Coverage from CLI

Using Coverage from Python

Offline Coverage (Zero Token Cost)

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally