AI assistant that automates evaluation processes for your AI agents.
Agent-EvalKit automates the complex evaluation process for your AI agents.
Key Features:
1. Create evaluation plan by analyzing agent and user requirements
2. Generate test cases for evaluation
3. Add tracing instrumentation to your agent (Optional)
4. Run your agent and collect execution traces
5. Write and run evaluation code to assess performance
6. Generate report with agent improvement recommendations
System: Linux/macOS • Python 3.11+ • uv • Git
AI Assistant: Currently supports Kiro CLI, Claude Code, and Kilo Code.
# Install once and use everywhere
uv tool install evalkit --from git+https://github.com/awslabs/Agent-EvalKit.git
# To upgrade later
uv tool install evalkit --force --from git+https://github.com/awslabs/Agent-EvalKit.gitImportant: Ensure your agent to be evaluated runs successfully with all dependencies and API keys available locally before proceeding with evaluation using EvalKit.
# Create dedicated evaluation project
evalkit init my-agent-evaluation
cd my-agent-evaluation
# Copy your agent folder into the evaluation project
cp -r /path/to/your/agent-folder .
# This ensures reliable path resolution and artifact management throughout the evaluation process
# Start your AI assistant (example shown for Claude Code)
claude
# When prompted, agree to use Context7 MCP for documentation access
# Type /evalkit to see available commands
# Note: For Kilo Code and Kiro CLI, detailed setup instructions will be shown
# in the terminal after running 'evalkit init my-agent-evaluation'See Complete Example: Check out
examples/qa_agent_evaluation/for a full evaluation workflow demonstration.
Option A: Guided workflow (recommended for first-time users)
/evalkit.quick # user input required
# Example: /evalkit.quick Evaluate my search agent at ./search_agent for final response quality
# This command will guide you through the entire evaluation processOption B: Individual commands (for experienced users)
Step 1: Analyze agent and design evaluation strategy
/evalkit.plan # user input required
# Example: /evalkit.plan Evaluate my search agent at ./search_agent for final response qualityStep 2: Generate test cases for evaluation
/evalkit.data # user input optional
# Example: /evalkit.data Focus on edge casesStep 3: Add tracing to your agent
/evalkit.trace # user input optionalStep 4: Run agent and collect traces
/evalkit.run_agent # user input optionalStep 5: Write and execute evaluation code over traces
/evalkit.eval # user input optionalStep 6: Analyze results and provide improvement recommendations
/evalkit.report # user input optionalEvalKit helps you quickly generate an evaluation pipeline that you can further refine according to your specific requirements.
- Review the code: Check if it works as expected for your agent
- Customize based on your needs: Adapt the evaluation pipeline for your specific requirements
| Command | Description |
|---|---|
evalkit init <project-name> |
Initialize new evaluation project |
evalkit check |
Check system prerequisites |
| Command | Description |
|---|---|
/evalkit.quick |
Step-by-step evaluation guide |
/evalkit.plan |
Analyze agent and design evaluation strategy |
/evalkit.data |
Generate test cases for evaluation |
/evalkit.trace |
Add tracing to your agent |
/evalkit.run_agent |
Run agent and collect traces |
/evalkit.eval |
Write and execute evaluation code over traces |
/evalkit.report |
Analyze results and provide improvement recommendations |
Agent-EvalKit evolved from our autonomous Evaluation Agent project. Inspired by spec-kit, we packaged it as a toolkit compatible with multiple coding assistants.
