Skip to content
Varun Pratap Bhardwaj edited this page Mar 6, 2026 · 1 revision

FAQ

Frequently asked questions about AgentAssay.

General

What is AgentAssay?

AgentAssay is a formal regression testing framework for AI agents that delivers statistical guarantees without burning your token budget. It combines behavioral fingerprinting, adaptive budget optimization, and trace-first offline analysis to achieve 5-20x cost reduction.

Who is AgentAssay for?

  • Teams building production AI agents
  • Researchers evaluating agent reliability
  • Anyone running automated tests on non-deterministic systems

Is AgentAssay free?

Yes. AgentAssay is open-source under the Apache-2.0 license. Forever free, never paid.

Which agent frameworks does AgentAssay support?

Currently: LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, smolagents, and custom agents. See Supported Frameworks for the full list.


Cost & Efficiency

How much can I actually save?

Typical savings are 40-83% compared to fixed-100-trial testing, depending on:

  • Test suite composition (mix of simple and complex scenarios)
  • Agent variance (low-variance agents save more)
  • Offline analysis opportunities (coverage, contracts)

Our experiments show 5-20x total cost reduction when all techniques are combined.

Does AgentAssay reduce statistical confidence?

No. AgentAssay maintains the same error guarantees (Type I and Type II error rates) as fixed-sample testing. The savings come from not wasting trials on scenarios that converge quickly, not from cutting corners on statistics.

What is adaptive budget optimization?

Adaptive budgeting computes the minimum number of trials each scenario needs based on its actual behavioral variance. Low-variance scenarios need fewer trials. High-variance scenarios get more. This eliminates over-testing and under-testing.

Can I use AgentAssay in CI/CD?

Yes. AgentAssay is designed for CI/CD. See CI/CD Integration for deployment gate setup.


Technical

What are three-valued verdicts?

Traditional testing gives binary results: PASS or FAIL. AgentAssay adds a third verdict: INCONCLUSIVE, which means there's not enough statistical evidence to determine PASS or FAIL. This is honest about uncertainty.

How does behavioral fingerprinting work?

Instead of comparing raw outputs, AgentAssay extracts a compact behavioral signature (tools used, action sequence, states visited, cost, duration) and compares those. This reduces noise and sample size requirements. See Behavioral Fingerprinting.

What statistical methods does AgentAssay use?

  • Wilson score intervals for confidence intervals
  • Fisher's exact test for regression detection
  • Sequential Probability Ratio Test (SPRT) for early stopping
  • Hotelling's T² for multivariate fingerprint comparison
  • Cohen's h for effect size

See Statistical Methods for details.

Can I test offline using existing traces?

Yes! Coverage analysis, contract checking, and metamorphic relation verification can all run on existing traces at zero token cost. See Token-Efficient Testing.

What is mutation testing?

Mutation testing evaluates how sensitive your test suite is. It perturbs the agent (remove a tool, change a prompt, swap a model) and checks if your tests catch the change. High mutation score = sensitive tests. See Mutation Testing.


Integration

How do I integrate with pytest?

Install AgentAssay and use the pytest plugin:

import pytest
from agentassay.plugin.pytest_plugin import assert_pass_rate

@pytest.mark.agentassay(n=30, threshold=0.80)
def test_my_agent(trial_runner):
    runner = trial_runner(my_agent)
    results = runner.run_trials(scenario)
    assert_pass_rate([r.passed for r in results], threshold=0.80)

See Pytest Plugin for full guide.

Can I use AgentAssay with custom agents?

Yes. Any callable that returns an ExecutionTrace can be tested. See Supported Frameworks for custom adapter examples.

How do I compare baseline vs. current versions?

Use the compare command:

agentassay compare --baseline v1-results.json --current v2-results.json

Exit code 0 = no regression, exit code 1 = regression detected.

Can I generate HTML reports?

Yes:

agentassay report --results trials.json --output report.html

The report is self-contained and includes verdicts, confidence intervals, and methodology.


Research & Publication

Is there a paper?

Yes. arXiv:2603.02601 (cs.AI + cs.SE). See arXiv.

Dataset: Zenodo DOI: 10.5281/zenodo.18842011

Who maintains AgentAssay?

AgentAssay is developed by Varun Pratap Bhardwaj (Independent Researcher) as part of the Qualixar research initiative. See varunpratap.com and qualixar.com.

Can I cite AgentAssay in my research?

Yes:

@article{bhardwaj2026agentassay,
  title={AgentAssay: Token-Efficient Stochastic Testing for AI Agents},
  author={Bhardwaj, Varun Pratap},
  journal={arXiv preprint arXiv:2603.02601},
  year={2026}
}

Can I contribute?

Yes! See CONTRIBUTING.md in the GitHub repository.


Troubleshooting

My tests return INCONCLUSIVE. What do I do?

INCONCLUSIVE means the confidence interval straddles the threshold. Solutions:

  1. Increase trials — More samples narrow the confidence interval
  2. Use adaptive budgeting — Let AgentAssay compute the right N
  3. Adjust threshold — If threshold is too close to the observed pass rate, verdicts will be borderline

AgentAssay says I need 200 trials. That's expensive!

High sample size requirements indicate high variance. Options:

  1. Stabilize the agent — High variance often comes from poorly tuned temperature, ambiguous prompts, or flaky tools
  2. Use behavioral fingerprinting — Reduces variance by focusing on behavior, not raw output
  3. Multi-fidelity proxy — Screen with a cheaper model first

Coverage is low. How do I improve it?

See Coverage Model for dimension-specific strategies. Generally:

  • Tool coverage: Add scenarios that require different tools
  • Path coverage: Add edge cases (errors, ambiguous inputs)
  • Boundary coverage: Add extreme inputs (empty, max-length, timeouts)
  • Model coverage: Test against multiple models

Mutation score is low. What does that mean?

Low mutation score means your tests pass even when the agent is perturbed. Inspect surviving mutants to see what your tests miss, then add scenarios to kill them. See Mutation Testing.


Platform-Specific

Does AgentAssay work with LangChain?

LangGraph (built on LangChain) is supported. Pure LangChain chains can use the custom adapter.

Does AgentAssay work with GPT-4? Claude? Llama?

Yes. AgentAssay is model-agnostic — it tests agent behavior, not specific models. It works with any LLM your agent uses.

Can I run AgentAssay in AWS Lambda?

Yes, but the dashboard requires a persistent server. CLI commands work fine in serverless environments.


Next Steps


Part of Qualixar | Author: Varun Pratap Bhardwaj

Clone this wiki locally