Skip to content

AI-driven toolkit that automates evaluation processes for AI agents

License

Notifications You must be signed in to change notification settings

awslabs/Agent-EvalKit

Agent-EvalKit

AI assistant that automates evaluation processes for your AI agents.

Agent-EvalKit demo

Table of Contents

Overview

Agent-EvalKit automates the complex evaluation process for your AI agents.

Key Features:

1. Create evaluation plan by analyzing agent and user requirements
2. Generate test cases for evaluation
3. Add tracing instrumentation to your agent (Optional)
4. Run your agent and collect execution traces
5. Write and run evaluation code to assess performance
6. Generate report with agent improvement recommendations

Requirements

System: Linux/macOS • Python 3.11+uvGit

AI Assistant: Currently supports Kiro CLI, Claude Code, and Kilo Code.

Quick Start

1. Install EvalKit

# Install once and use everywhere
uv tool install evalkit --from git+https://github.com/awslabs/Agent-EvalKit.git

# To upgrade later
uv tool install evalkit --force --from git+https://github.com/awslabs/Agent-EvalKit.git

2. Initialize Evaluation Project

Important: Ensure your agent to be evaluated runs successfully with all dependencies and API keys available locally before proceeding with evaluation using EvalKit.

# Create dedicated evaluation project
evalkit init my-agent-evaluation
cd my-agent-evaluation

# Copy your agent folder into the evaluation project
cp -r /path/to/your/agent-folder .
# This ensures reliable path resolution and artifact management throughout the evaluation process

# Start your AI assistant (example shown for Claude Code)
claude
# When prompted, agree to use Context7 MCP for documentation access
# Type /evalkit to see available commands

# Note: For Kilo Code and Kiro CLI, detailed setup instructions will be shown
# in the terminal after running 'evalkit init my-agent-evaluation'

3. Evaluate Your Agent

See Complete Example: Check out examples/qa_agent_evaluation/ for a full evaluation workflow demonstration.

Option A: Guided workflow (recommended for first-time users)

/evalkit.quick  # user input required
# Example: /evalkit.quick Evaluate my search agent at ./search_agent for final response quality
# This command will guide you through the entire evaluation process

Option B: Individual commands (for experienced users)

Step 1: Analyze agent and design evaluation strategy

/evalkit.plan  # user input required
# Example: /evalkit.plan Evaluate my search agent at ./search_agent for final response quality

Step 2: Generate test cases for evaluation

/evalkit.data  # user input optional
# Example: /evalkit.data Focus on edge cases

Step 3: Add tracing to your agent

/evalkit.trace  # user input optional

Step 4: Run agent and collect traces

/evalkit.run_agent  # user input optional

Step 5: Write and execute evaluation code over traces

/evalkit.eval  # user input optional

Step 6: Analyze results and provide improvement recommendations

/evalkit.report  # user input optional

What to Expect from EvalKit

EvalKit helps you quickly generate an evaluation pipeline that you can further refine according to your specific requirements.

What You Do Next

  • Review the code: Check if it works as expected for your agent
  • Customize based on your needs: Adapt the evaluation pipeline for your specific requirements

Reference

CLI Commands

Command Description
evalkit init <project-name> Initialize new evaluation project
evalkit check Check system prerequisites

EvalKit Commands (Available after evalkit init)

Command Description
/evalkit.quick Step-by-step evaluation guide
/evalkit.plan Analyze agent and design evaluation strategy
/evalkit.data Generate test cases for evaluation
/evalkit.trace Add tracing to your agent
/evalkit.run_agent Run agent and collect traces
/evalkit.eval Write and execute evaluation code over traces
/evalkit.report Analyze results and provide improvement recommendations

Acknowledgements

Agent-EvalKit evolved from our autonomous Evaluation Agent project. Inspired by spec-kit, we packaged it as a toolkit compatible with multiple coding assistants.

About

AI-driven toolkit that automates evaluation processes for AI agents

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •