agent-eval-kit

TypeScript-native eval framework for AI agent workflows. Record once, replay forever, grade instantly.

Testing AI agents is expensive, slow, and non-deterministic. agent-eval-kit fixes this with a record-replay workflow:

Record — capture live agent responses as fixtures (one-time API cost)
Replay — grade recorded outputs instantly at zero cost
Gate — enforce pass rates, cost budgets, and latency limits in CI
Compare — diff two runs to catch regressions

Quick Start

npm install agent-eval-kit

Requires Node.js 20+. Generate a starter config with agent-eval-kit init, or write one manually:

// eval.config.ts
import { defineConfig, contains, latency } from "agent-eval-kit";

export default defineConfig({
  suites: [
    {
      name: "basic-qa",
      target: async (input) => {
        const response = await myAgent(input.prompt);
        return { text: response.text, latencyMs: response.duration };
      },
      cases: [
        {
          id: "capital-france",
          input: { prompt: "What is the capital of France?" },
          expected: { text: "Paris" },
        },
      ],
      defaultGraders: [
        { grader: contains("Paris"), required: true },
        { grader: latency(5000) },
      ],
      gates: { passRate: 0.95 },
    },
  ],
});

agent-eval-kit record --suite basic-qa   # record fixtures (live API calls)
agent-eval-kit run --mode replay         # replay instantly (after generation), $0 cost

Features

20 built-in graders — text (contains, regex, exactMatch), tool calls (toolSequence, toolArgsMatch), metrics (latency, cost, tokenCount), safety (safetyKeywords, noHallucinatedNumbers), structured output (jsonSchema), and LLM-as-judge (llmRubric, factuality, llmClassify)
Grader composition — combine with all(), any(), not()
3 execution modes — live (real calls), replay (cached fixtures), judge-only (re-grade with new graders, no re-run)
Quality gates — enforce pass rate, max cost, and p95 latency thresholds; non-zero exit on failure
Run comparison — diff any two runs to surface regressions and improvements
Multi-trial runs — flakiness detection with Wilson score confidence intervals
Watch mode — re-run evals on file changes (--watch)
External cases — load from JSONL or YAML files alongside inline cases
Plugin system — custom graders and lifecycle hooks (beforeRun, afterTrial, afterRun)
4 reporters — console, JSON, JUnit XML, Markdown
MCP server — 8 tools + 3 resources for AI assistant integration
CI-native — JUnit reporter, GitHub Actions Step Summary, git hook installation

Examples

Example	What it covers	Run it
`quickstart/`	Minimal setup — 1 case, 2 graders	`agent-eval-kit run --config examples/quickstart`
`text-grading/`	Text, safety, metric, composition, and LLM judge graders	`agent-eval-kit run --config examples/text-grading`
`tool-agent/`	Tool call grading, hallucination detection, plugins	`agent-eval-kit run --config examples/tool-agent`

See examples/README.md for setup details.

Documentation

Full docs at flanaganse.github.io/agent-eval-kit:

Quick Start — first eval in 5 minutes
Graders Guide — all graders with examples
CLI Reference — every command and flag
Config Reference — full config schema
Programmatic API — use as a library

Contributing

Contributions welcome — please open an issue first to discuss changes.

pnpm install && pnpm test && pnpm lint

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.claude		.claude
.github/workflows		.github/workflows
bin		bin
docs		docs
examples		examples
src		src
test/fixtures		test/fixtures
.gitattributes		.gitattributes
.gitignore		.gitignore
.npmignore		.npmignore
LICENSE.md		LICENSE.md
README.md		README.md
biome.json		biome.json
lefthook.yml		lefthook.yml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-eval-kit

Quick Start

Features

Examples

Documentation

Contributing

License

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-eval-kit

Quick Start

Features

Examples

Documentation

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages