-
Notifications
You must be signed in to change notification settings - Fork 0
FAQ
Frequently asked questions about AgentAssay.
AgentAssay is a formal regression testing framework for AI agents that delivers statistical guarantees without burning your token budget. It combines behavioral fingerprinting, adaptive budget optimization, and trace-first offline analysis to achieve 5-20x cost reduction.
- Teams building production AI agents
- Researchers evaluating agent reliability
- Anyone running automated tests on non-deterministic systems
Yes. AgentAssay is open-source under the Apache-2.0 license. Forever free, never paid.
Currently: LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, smolagents, and custom agents. See Supported Frameworks for the full list.
Typical savings are 40-83% compared to fixed-100-trial testing, depending on:
- Test suite composition (mix of simple and complex scenarios)
- Agent variance (low-variance agents save more)
- Offline analysis opportunities (coverage, contracts)
Our experiments show 5-20x total cost reduction when all techniques are combined.
No. AgentAssay maintains the same error guarantees (Type I and Type II error rates) as fixed-sample testing. The savings come from not wasting trials on scenarios that converge quickly, not from cutting corners on statistics.
Adaptive budgeting computes the minimum number of trials each scenario needs based on its actual behavioral variance. Low-variance scenarios need fewer trials. High-variance scenarios get more. This eliminates over-testing and under-testing.
Yes. AgentAssay is designed for CI/CD. See CI/CD Integration for deployment gate setup.
Traditional testing gives binary results: PASS or FAIL. AgentAssay adds a third verdict: INCONCLUSIVE, which means there's not enough statistical evidence to determine PASS or FAIL. This is honest about uncertainty.
Instead of comparing raw outputs, AgentAssay extracts a compact behavioral signature (tools used, action sequence, states visited, cost, duration) and compares those. This reduces noise and sample size requirements. See Behavioral Fingerprinting.
- Wilson score intervals for confidence intervals
- Fisher's exact test for regression detection
- Sequential Probability Ratio Test (SPRT) for early stopping
- Hotelling's T² for multivariate fingerprint comparison
- Cohen's h for effect size
See Statistical Methods for details.
Yes! Coverage analysis, contract checking, and metamorphic relation verification can all run on existing traces at zero token cost. See Token-Efficient Testing.
Mutation testing evaluates how sensitive your test suite is. It perturbs the agent (remove a tool, change a prompt, swap a model) and checks if your tests catch the change. High mutation score = sensitive tests. See Mutation Testing.
Install AgentAssay and use the pytest plugin:
import pytest
from agentassay.plugin.pytest_plugin import assert_pass_rate
@pytest.mark.agentassay(n=30, threshold=0.80)
def test_my_agent(trial_runner):
runner = trial_runner(my_agent)
results = runner.run_trials(scenario)
assert_pass_rate([r.passed for r in results], threshold=0.80)See Pytest Plugin for full guide.
Yes. Any callable that returns an ExecutionTrace can be tested. See Supported Frameworks for custom adapter examples.
Use the compare command:
agentassay compare --baseline v1-results.json --current v2-results.jsonExit code 0 = no regression, exit code 1 = regression detected.
Yes:
agentassay report --results trials.json --output report.htmlThe report is self-contained and includes verdicts, confidence intervals, and methodology.
Yes. arXiv:2603.02601 (cs.AI + cs.SE). See arXiv.
Dataset: Zenodo DOI: 10.5281/zenodo.18842011
AgentAssay is developed by Varun Pratap Bhardwaj (Independent Researcher) as part of the Qualixar research initiative. See varunpratap.com and qualixar.com.
Yes:
@article{bhardwaj2026agentassay,
title={AgentAssay: Token-Efficient Stochastic Testing for AI Agents},
author={Bhardwaj, Varun Pratap},
journal={arXiv preprint arXiv:2603.02601},
year={2026}
}Yes! See CONTRIBUTING.md in the GitHub repository.
INCONCLUSIVE means the confidence interval straddles the threshold. Solutions:
- Increase trials — More samples narrow the confidence interval
- Use adaptive budgeting — Let AgentAssay compute the right N
- Adjust threshold — If threshold is too close to the observed pass rate, verdicts will be borderline
High sample size requirements indicate high variance. Options:
- Stabilize the agent — High variance often comes from poorly tuned temperature, ambiguous prompts, or flaky tools
- Use behavioral fingerprinting — Reduces variance by focusing on behavior, not raw output
- Multi-fidelity proxy — Screen with a cheaper model first
See Coverage Model for dimension-specific strategies. Generally:
- Tool coverage: Add scenarios that require different tools
- Path coverage: Add edge cases (errors, ambiguous inputs)
- Boundary coverage: Add extreme inputs (empty, max-length, timeouts)
- Model coverage: Test against multiple models
Low mutation score means your tests pass even when the agent is perturbed. Inspect surviving mutants to see what your tests miss, then add scenarios to kill them. See Mutation Testing.
LangGraph (built on LangChain) is supported. Pure LangChain chains can use the custom adapter.
Yes. AgentAssay is model-agnostic — it tests agent behavior, not specific models. It works with any LLM your agent uses.
Yes, but the dashboard requires a persistent server. CLI commands work fine in serverless environments.
- Quick Start — Test your first agent in 5 minutes
- Installation — Get set up
- Token-Efficient Testing — Understand the core innovation
Part of Qualixar | Author: Varun Pratap Bhardwaj
Getting Started
Core Concepts
- Token-Efficient Testing
- Behavioral Fingerprinting
- Statistical Methods
- Coverage Model
- Mutation Testing
Guides
Reference