Agent-EvalKit

AI assistant that automates evaluation processes for your AI agents.

Overview

Agent-EvalKit automates the complex evaluation process for your AI agents.

Key Features:

1. Create evaluation plan by analyzing agent and user requirements
2. Generate test cases for evaluation
3. Add tracing instrumentation to your agent (Optional)
4. Run your agent and collect execution traces
5. Write and run evaluation code to assess performance
6. Generate report with agent improvement recommendations

Requirements

System: Linux/macOS • Python 3.11+ • uv • Git

AI Assistant: Currently supports Kiro CLI, Claude Code, and Kilo Code.

Quick Start

1. Install EvalKit

# Install once and use everywhere
uv tool install evalkit --from git+https://github.com/awslabs/Agent-EvalKit.git

# To upgrade later
uv tool install evalkit --force --from git+https://github.com/awslabs/Agent-EvalKit.git

2. Initialize Evaluation Project

Important: Ensure your agent to be evaluated runs successfully with all dependencies and API keys available locally before proceeding with evaluation using EvalKit.

# Create dedicated evaluation project
evalkit init my-agent-evaluation
cd my-agent-evaluation

# Copy your agent folder into the evaluation project
cp -r /path/to/your/agent-folder .
# This ensures reliable path resolution and artifact management throughout the evaluation process

# Start your AI assistant (example shown for Claude Code)
claude
# When prompted, agree to use Context7 MCP for documentation access
# Type /evalkit to see available commands

# Note: For Kilo Code and Kiro CLI, detailed setup instructions will be shown
# in the terminal after running 'evalkit init my-agent-evaluation'

3. Evaluate Your Agent

See Complete Example: Check out examples/qa_agent_evaluation/ for a full evaluation workflow demonstration.

Option A: Guided workflow (recommended for first-time users)

/evalkit.quick  # user input required
# Example: /evalkit.quick Evaluate my search agent at ./search_agent for final response quality
# This command will guide you through the entire evaluation process

Option B: Individual commands (for experienced users)

Step 1: Analyze agent and design evaluation strategy

/evalkit.plan  # user input required
# Example: /evalkit.plan Evaluate my search agent at ./search_agent for final response quality

Step 2: Generate test cases for evaluation

/evalkit.data  # user input optional
# Example: /evalkit.data Focus on edge cases

Step 3: Add tracing to your agent

/evalkit.trace  # user input optional

Step 4: Run agent and collect traces

/evalkit.run_agent  # user input optional

Step 5: Write and execute evaluation code over traces

/evalkit.eval  # user input optional

Step 6: Analyze results and provide improvement recommendations

/evalkit.report  # user input optional

What to Expect from EvalKit

EvalKit helps you quickly generate an evaluation pipeline that you can further refine according to your specific requirements.

What You Do Next

Review the code: Check if it works as expected for your agent
Customize based on your needs: Adapt the evaluation pipeline for your specific requirements

Reference

CLI Commands

Command	Description
`evalkit init <project-name>`	Initialize new evaluation project
`evalkit check`	Check system prerequisites

EvalKit Commands (Available after `evalkit init`)

Command	Description
`/evalkit.quick`	Step-by-step evaluation guide
`/evalkit.plan`	Analyze agent and design evaluation strategy
`/evalkit.data`	Generate test cases for evaluation
`/evalkit.trace`	Add tracing to your agent
`/evalkit.run_agent`	Run agent and collect traces
`/evalkit.eval`	Write and execute evaluation code over traces
`/evalkit.report`	Analyze results and provide improvement recommendations

Acknowledgements

Agent-EvalKit evolved from our autonomous Evaluation Agent project. Inspired by spec-kit, we packaged it as a toolkit compatible with multiple coding assistants.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github/workflows		.github/workflows
commands		commands
examples/qa_agent_evaluation		examples/qa_agent_evaluation
mcps		mcps
media		media
reference/deepeval		reference/deepeval
scripts/bash		scripts/bash
src/evalkit		src/evalkit
templates		templates
tracing		tracing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent-EvalKit

Table of Contents

Overview

Requirements

Quick Start

1. Install EvalKit

2. Initialize Evaluation Project

3. Evaluate Your Agent

What to Expect from EvalKit

What You Do Next

Reference

CLI Commands

EvalKit Commands (Available after `evalkit init`)

Acknowledgements

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

awslabs/Agent-EvalKit

Folders and files

Latest commit

History

Repository files navigation

Agent-EvalKit

Table of Contents

Overview

Requirements

Quick Start

1. Install EvalKit

2. Initialize Evaluation Project

3. Evaluate Your Agent

What to Expect from EvalKit

What You Do Next

Reference

CLI Commands

EvalKit Commands (Available after evalkit init)

Acknowledgements

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

EvalKit Commands (Available after `evalkit init`)

Packages