Test More. Spend Less. Ship Confident.
The first agent testing framework that delivers statistical guarantees WITHOUT burning your token budget.
A Qualixar Research Initiative by Varun Pratap Bhardwaj
Every time you change a prompt, swap a model, or update a tool, you need to know: does my agent still work?
Today, answering that question is painfully expensive. Run 100 trials across 20 scenarios, and you've burned thousands of tokens just to check for a regression. Most teams either:
- Over-test: Run fixed-N trials and waste budget on scenarios that don't need it.
- Under-test: Skip testing because the cost is too high, and ship broken agents.
- Guess: Run a few trials, eyeball the results, and hope for the best.
None of these are engineering. They are gambling.
AgentAssay introduces token-efficient agent testing -- three techniques that deliver the same statistical confidence at a fraction of the cost:
Instead of comparing raw text outputs (high-dimensional, noisy, expensive), AgentAssay extracts behavioral fingerprints -- compact representations of what the agent did rather than what it said. Tool sequences, state transitions, decision patterns. Low-dimensional signals need fewer samples to detect change.
No more guessing how many trials to run. AgentAssay runs a small calibration set (5-10 runs), measures behavioral variance, and computes the exact minimum number of trials needed for your target confidence level. High-variance scenarios get more trials. Stable scenarios get fewer. Zero waste.
Coverage metrics, contract checks, metamorphic relations, and mutation analysis can all run on production traces you already have -- at zero additional token cost. Why re-run your agent when you can analyze runs that already happened?
Result: Same confidence. 83% less cost.
pip install agentassay # Core (works with CustomAdapter)
pip install agentassay[langgraph] # + LangGraph support
pip install agentassay[crewai] # + CrewAI support
pip install agentassay[all] # All framework adaptersAgentAssay works with every major agent framework — zero lock-in, plug-and-play.
| Framework | Install | Adapter |
|---|---|---|
| LangGraph | pip install agentassay[langgraph] |
LangGraphAdapter |
| CrewAI | pip install agentassay[crewai] |
CrewAIAdapter |
| AutoGen | pip install agentassay[autogen] |
AutoGenAdapter |
| OpenAI Agents | pip install agentassay[openai] |
OpenAIAgentsAdapter |
| smolagents | pip install agentassay[smolagents] |
SmolAgentsAdapter |
| Semantic Kernel | pip install agentassay[semantic-kernel] |
SemanticKernelAdapter |
| AWS Bedrock Agents | pip install agentassay[bedrock] |
BedrockAgentsAdapter |
| MCP | pip install agentassay[mcp] |
MCPToolsAdapter |
| Vertex AI Agents | pip install agentassay[vertex] |
VertexAIAgentsAdapter |
| Any custom agent | pip install agentassay |
CustomAdapter |
Don't see your framework? Use
CustomAdapter— wrap any callable that returns execution traces.
LangGraph:
from agentassay.integrations import LangGraphAdapter
adapter = LangGraphAdapter(graph=your_graph)
trace = adapter.run({"query": "Book a flight from NYC to London"})
print(f"Steps: {len(trace.steps)}, Cost: ${trace.total_cost_usd:.4f}")CrewAI:
from agentassay.integrations import CrewAIAdapter
adapter = CrewAIAdapter(crew=your_crew)
trace = adapter.run({"task": "Research protein folding"})
print(f"Success: {trace.success}, Duration: {trace.total_duration_ms}ms")Any Framework:
from agentassay.integrations import CustomAdapter
def my_agent_fn(input_data):
# Your agent logic here
return execution_trace
adapter = CustomAdapter(callable_fn=my_agent_fn)
trace = adapter.run({"query": "Hello world"})Try the demo:
# See it in action instantly (no config needed)
agentassay demofrom agentassay.efficiency import BehavioralFingerprint, AdaptiveBudgetOptimizer
from agentassay.core.runner import TrialRunner
from agentassay.verdicts import VerdictFunction
# Step 1: Calibrate -- run just 10 trials to measure variance
optimizer = AdaptiveBudgetOptimizer(alpha=0.05, beta=0.10)
estimate = optimizer.calibrate(calibration_traces)
print(f"Recommended trials: {estimate.recommended_n}") # e.g., 17 (not 100)
print(f"Estimated cost: ${estimate.estimated_cost_usd:.2f}") # e.g., $0.34
print(f"Savings vs fixed-100: {estimate.savings_vs_fixed_100:.0%}") # e.g., 83%
# Step 2: Run only the trials you need
runner = TrialRunner(agent_fn=my_agent, config=config)
results = runner.run_trials(scenario, n=estimate.recommended_n)
# Step 3: Compare fingerprints for regression detection
baseline_fp = BehavioralFingerprint.from_traces(baseline_traces)
current_fp = BehavioralFingerprint.from_traces(current_traces)
drift = baseline_fp.distance(current_fp)
# Step 4: Get a statistically-backed verdict
verdict = VerdictFunction(alpha=0.05).evaluate(results)
print(f"Verdict: {verdict.status}") # PASS / FAIL / INCONCLUSIVE
print(f"Pass rate: {verdict.pass_rate:.1%} [{verdict.ci_lower:.1%}, {verdict.ci_upper:.1%}]") Token-Efficient Testing Pipeline
+-----------------------------------------------------------------+
| |
| Production Traces -----> Trace Store -----> Offline Analysis |
| (already paid for) (coverage, contracts,|
| metamorphic -- FREE)|
| | |
| New Agent Version --> Calibration (5-10 runs) --> Budget Estimate|
| | |
| Targeted Testing (optimal N) --> Fingerprint |
| Comparison |
| | |
| Statistical Verdict |
| (5-20x cheaper) |
+-----------------------------------------------------------------+
The core insight: most of the information you need to test an agent is already in traces you have collected. AgentAssay extracts maximum signal from minimum runs.
| Feature | Description |
|---|---|
| Behavioral fingerprinting | Detect regression from behavioral patterns, not raw text. Fewer samples needed. |
| Adaptive budget optimization | Calibrate variance, compute exact minimum N. No over-testing. |
| Trace-first offline analysis | Run coverage, contracts, and metamorphic checks on existing traces. Zero token cost. |
| Multi-fidelity proxy testing | Use cheaper models for initial screening, expensive models only for confirmation. |
| Warm-start sequential testing | Incorporate prior results to reach verdicts faster. |
| Three-valued verdicts | PASS, FAIL, or INCONCLUSIVE -- never a misleading binary answer. |
| Confidence intervals | Know the true pass rate range, not a point estimate. |
| Statistical regression detection | Hypothesis tests catch regressions before production. |
| 5D coverage metrics | Measure tool, path, state, boundary, and model coverage. |
| Mutation testing | Perturb your agent to validate test sensitivity. |
| Metamorphic testing | Verify behavioral invariants across input transformations. |
| Contract oracle | Check behavioral specifications from AgentAssert contracts. |
| Deployment gates | Block broken deployments in CI/CD with statistical evidence. |
| Framework adapters | Works with popular agent frameworks out of the box. |
| pytest integration | Use familiar pytest conventions with statistical assertions. |
| CLI | Five commands: run, compare, mutate, coverage, report. |
| Feature | AgentAssay | deepeval | agentrial | LangSmith |
|---|---|---|---|---|
| Statistical regression testing | ✅ | ❌ | ❌ | |
| Three-valued verdicts | ✅ | ❌ | ❌ | ❌ |
| Token-efficient testing | ✅ | ❌ | ❌ | ❌ |
| Behavioral fingerprinting | ✅ | ❌ | ❌ | ❌ |
| Adaptive budget optimization | ✅ | ❌ | ❌ | ❌ |
| Trace-first offline analysis | ✅ | ❌ | ❌ | ❌ |
| 5D coverage metrics | ✅ | ❌ | ❌ | ❌ |
| Mutation testing | ✅ | ❌ | ❌ | ❌ |
| Metamorphic testing | ✅ | ❌ | ❌ | ❌ |
| CI/CD deployment gates | ✅ | ❌ | ✅ | ❌ |
| Published research paper | ✅ | ❌ | ❌ | ❌ |
+-------------------------------------------------------------------+
| Layer 6: Efficiency |
| Fingerprinting | Budget Optimization | Trace Analysis |
| Multi-Fidelity | Warm-Start Sequential |
+-------------------------------------------------------------------+
| Layer 5: Integration |
| Framework Adapters | pytest Plugin | CLI | Reporting |
+-------------------------------------------------------------------+
| Layer 4: Analysis |
| Coverage (5D) | Mutation | Metamorphic | Contract Oracle |
+-------------------------------------------------------------------+
| Layer 3: Verdicts |
| Stochastic Verdicts | Deployment Gates |
+-------------------------------------------------------------------+
| Layer 2: Statistics |
| Hypothesis Tests | Confidence Intervals | SPRT | Effect Size |
+-------------------------------------------------------------------+
| Layer 1: Core |
| Data Models | Execution Engine | Trace Format |
+-------------------------------------------------------------------+
Layer 6 (Efficiency) is the differentiator. It sits atop the full statistical testing stack, optimizing how many runs are needed while Layers 1-5 ensure every run produces rigorous results.
import pytest
@pytest.mark.agentassay(n=30, threshold=0.80)
def test_agent_booking_flow(trial_runner):
runner = trial_runner(my_agent)
scenario = TestScenario(
scenario_id="booking",
name="Flight booking",
input_data={"task": "Book a flight from NYC to London"},
expected_properties={"max_steps": 10, "must_use_tools": ["search", "book"]},
)
results = runner.run_trials(scenario)
assert_pass_rate(results, threshold=0.80, confidence=0.95)python -m pytest tests/ -v --agentassayAgentAssay provides 8 commands for testing, analysis, and reporting:
# Try the interactive demo (no setup needed)
agentassay demo
# Run trials with adaptive budget
agentassay run --scenario booking.yaml --budget-mode adaptive
# Compare two versions for regression
agentassay compare --baseline v1.json --current v2.json
# Analyze coverage from existing traces
agentassay coverage --traces production-traces/ --tools search,book,cancel
# Mutation testing
agentassay mutate --scenario booking.yaml --operators prompt,tool,model
# Generate test reports
agentassay test-report --results trials.json --format html
# Generate full HTML report
agentassay report --results trials.json --output report.html
# Check version
agentassay --version- Installation
- Quickstart
- Token-Efficient Testing -- The core concept
- Stochastic Testing -- Why agent testing needs statistics
- Coverage Metrics -- Five-dimensional coverage model
- Architecture Overview
- CLI Reference
AgentAssay is backed by a published research paper with formal definitions, theorems, and proofs.
Paper: arXiv:2603.02601 (cs.AI + cs.SE) DOI: 10.5281/zenodo.18842011
@article{bhardwaj2026agentassay,
title={AgentAssay: Formal Regression Testing for Non-Deterministic AI Agent Workflows},
author={Bhardwaj, Varun Pratap},
journal={arXiv preprint arXiv:2603.02601},
year={2026},
doi={10.5281/zenodo.18842011}
}Contributions welcome. See CONTRIBUTING.md for guidelines.
Apache-2.0 — forever free, never paid. See LICENSE.
Part of Qualixar — The Complete Agent Development Platform
A research initiative by Varun Pratap Bhardwaj
qualixar.com · varunpratap.com · arXiv:2603.02601