Agent Benchmark System

A benchmarking system to compare single-agent vs multi-agent approaches for completing goals using the Strands Agents framework.

DISCLAIMER

This repository contains example code provided for illustrative purposes only. It is not a supported product, and no guarantees are made regarding its functionality, accuracy, security, or fitness for any particular purpose. Use at your own risk. Workato does not provide support, maintenance, or updates for this code. It is not intended for use in production environments without thorough review and testing.

Overview

This system runs the same goal through two different agent architectures and compares their performance:

Single Agent: One agent with full context, using summarization for long conversations
Multi-Agent: An orchestrator that decomposes the goal and delegates to specialized sub-agents

Both approaches are given identical verification requirements to ensure an apples-to-apples comparison.

Workflow

---
config:
  theme: dark
---
flowchart TB
    subgraph Input["📥 Input"]
        G[Goal File]
        MC[MCP Config]
        ENV[.env]
    end

    subgraph Validation["🔍 Validation"]
        CLI[benchmark.py] --> VAL[Goal Validator]
        VAL --> REQ[Requirements Extraction]
    end

    subgraph Execution["⚡ Execution"]
        REQ --> SA[Single Agent]
        REQ --> MA[Multi-Agent]

        SA --> SAW[(single_agent<br/>workspace)]
        MA --> TD[Task Decomposer]
        TD --> OR[Orchestrator]
        OR --> SUB[Sub-Agents]
        SUB --> MAW[(multi_agent<br/>workspace)]
    end

    subgraph Evaluation["📊 Evaluation"]
        SAW --> OV[Output Validator]
        MAW --> OV
        OV --> RG[Rubric Generator]
        RG --> RE[Rubric Evaluator]
    end

    subgraph Output["📈 Output"]
        RE --> RPT[Report Generator]
        RPT --> MD[Markdown Report]
        RPT --> JSON[JSON Report]
    end

    G --> CLI
    MC --> CLI
    ENV --> CLI

Apples-to-Apples Comparison

To ensure fair comparison, both agent types must perform equivalent work:

---
config:
  theme: dark
---
flowchart LR
    SA[Single Agent] --> V1
    OR[Orchestrator] --> V1

    V1[List workspace] --> V2[Read outputs]
    V2 --> V3[Run verification]
    V3 --> V4[Report results]

How It Works

Requirements Extraction: The goal is parsed to identify expected output files
Verification Instructions: Both agents receive mandatory verification steps in their prompts
Enforced Verification: Agents must verify their work before completing
Post-Validation: The benchmark independently validates outputs to confirm compliance
Fair Metrics: Token usage reflects equivalent work performed by both approaches

Single Agent Approach

---
config:
  theme: dark
---
sequenceDiagram
    participant U as User
    participant SA as Single Agent
    participant T as Tools
    participant W as Workspace

    U->>SA: Goal + Verification Requirements
    SA->>T: Execute task
    T->>W: Create outputs
    T-->>SA: Done
    Note over SA: Verification
    SA->>T: List workspace
    SA->>T: Read outputs
    SA->>T: Run verification
    SA-->>U: Result + Summary

Creates one agent with all available tools
Uses SummarizingConversationManager to handle context overflow
Summarization triggers at 80% of model's context window
Must complete verification steps before finishing
Tracks all token usage and tool calls

Multi-Agent Approach

---
config:
  theme: dark
---
sequenceDiagram
    participant U as User
    participant TD as Decomposer
    participant OR as Orchestrator
    participant S as Sub-Agents
    participant T as Tools
    participant W as Workspace

    U->>TD: Goal
    TD-->>OR: Task list
    OR->>S: Invoke tasks
    S->>T: Execute sub-tasks
    T->>W: Create outputs
    S-->>OR: Tasks complete
    Note over OR: Verification
    OR->>T: List workspace
    OR->>T: Read outputs
    OR->>T: Run verification
    OR-->>U: Result + Summary

Task Decomposition: Uses Claude to break down the goal into sub-tasks
- Separates by concern and tool requirements
- Identifies dependencies between tasks
- Documents rationale for separation
Orchestration: A coordinator agent manages sub-agents
- Executes tasks in dependency order
- Passes context between dependent tasks
- Performs verification after all tasks complete
- Aggregates results
Sub-Agents: Specialized agents for each task
- Focused prompts for specific responsibilities
- Independent token tracking
- Isolated execution

Rubric-Based Evaluation

---
config:
  theme: dark
---
flowchart LR
    subgraph Generation
        G[Goal] --> RG[Rubric Generator]
        RG --> R[Rubric<br/>100 points]
    end

    subgraph Collection
        W1[(Single)] --> EC[Evidence Collector]
        W2[(Multi)] --> EC
        EC --> E[File Tree + Contents<br/>+ Test Results]
    end

    subgraph Scoring
        R --> RE[Rubric Evaluator]
        E --> RE
        RE --> S1[Scores + Grades]
    end

The evaluation system:

Rubric Generator: Creates a 100-point rubric with categories and criteria from the goal
Evidence Collector: Gathers file trees, contents, test results, and keyword matches
Rubric Evaluator: LLM-based scoring with detailed reasoning per criterion

Installation

cd agent_benchmark
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install -r requirements.txt

Configuration

Create a .env file with your Anthropic API key:

ANTHROPIC_API_KEY=your_api_key_here

Environment Variables

Variable	Description
`ANTHROPIC_API_KEY`	Required. Your Anthropic API key
`BENCHMARK_MODEL`	Model ID to use
`BENCHMARK_MCP_CONFIG`	Path to MCP config file
`BENCHMARK_OUTPUT`	Output directory
`BENCHMARK_TEMPERATURE`	Model temperature
`BENCHMARK_TOP_P`	Top-p sampling
`BENCHMARK_TOP_K`	Top-k sampling
`BENCHMARK_MAX_TOKENS`	Max output tokens
`BENCHMARK_SKIP_VALIDATION`	Skip goal validation (true/false)
`BENCHMARK_QUIET`	Disable spinners (true/false)
`BENCHMARK_YES`	Auto-confirm (true/false)

CLI arguments override environment variables.

MCP Configuration

Configure MCP tools in mcp.json:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "./workspace"]
    }
  }
}

Agent Tools

Filesystem Tools (MCP)

The MCP filesystem server (@modelcontextprotocol/server-filesystem) provides sandboxed file operations:

list_allowed_directories - List directories the server can access
list_directory - List contents of a directory
directory_tree - Show directory structure as a tree
read_text_file - Read content of a text file
read_multiple_files - Read multiple files at once
write_file - Create or overwrite files
edit_file - Edit an existing file
create_directory - Create directories
search_files - Search for files matching patterns

Code Execution (Sandbox)

The execute_command tool allows agents to run shell commands within the workspace:

execute_command(command="python -m pytest test_calculator.py -v")
execute_command(command="python calculator.py")
execute_command(command="ls -la")

Commands are executed with a 60-second timeout and are restricted to the workspace directory.

Usage

Basic Usage

python benchmark.py --goal goal.md

Full Options

python benchmark.py \
  --goal goal.md \
  --mcp-config mcp.json \
  --output ./results \
  --model claude-sonnet-4-6 \
  --temperature 0.7 \
  --max-tokens 8192 \
  --skip-validation \
  --quiet \
  --yes

CLI Arguments

Argument	Short	Description	Default
`--goal`	`-g`	Path to goal markdown file	Required
`--mcp-config`	`-m`	Path to MCP config file	`mcp.json`
`--output`	`-o`	Output directory	Auto-generated
`--model`		Model ID to use	Interactive
`--skip-validation`		Skip goal validation step	False
`--temperature`		Model temperature (0.0-1.0)	1.0
`--top-p`		Top-p sampling (nucleus sampling)	None
`--top-k`		Top-k sampling	None
`--max-tokens`		Max output tokens	Model default
`--quiet`	`-q`	Disable spinner/progress indicators	False
`--yes`	`-y`	Auto-confirm prompts (skip confirmation)	False

Output Structure

benchmark_results_YYYYMMDD_HHMMSS/
├── goal.md                      # Copy of the input goal
├── rubric.json                  # Generated assessment rubric
├── benchmark_report.md          # Human-readable comparison report
├── benchmark_report.json        # Machine-readable report data
├── single_agent/
│   ├── master_prompt.md         # System prompt used
│   ├── conversation.jsonl       # Streaming conversation log
│   ├── conversation_meta.json   # Conversation metadata
│   ├── messages.json            # Full message history
│   ├── raw_messages.json        # Raw agent messages for debugging
│   ├── metrics.json             # Token usage and timing metrics
│   ├── tool_calls.json          # Consolidated tool call summary
│   ├── tool_calls.jsonl         # Streaming tool call log
│   └── result.json              # Execution result (success/failure)
├── multi_agent/
│   ├── master_prompt.md         # Orchestrator system prompt
│   ├── task_decomposition.json  # How goal was broken into tasks
│   ├── aggregate_metrics.json   # Combined metrics across all agents
│   ├── result.json              # Overall execution result
│   ├── orchestrator/
│   │   ├── conversation.jsonl   # Orchestrator conversation log
│   │   ├── conversation_meta.json
│   │   ├── messages.json        # Orchestrator message history
│   │   ├── metrics.json         # Orchestrator token/timing metrics
│   │   ├── tool_calls.json      # Tool call summary
│   │   └── tool_calls.jsonl     # Streaming tool calls
│   └── sub_agent_task_N/        # One directory per decomposed task
│       ├── prompt.md            # Sub-agent's task prompt
│       ├── conversation.jsonl   # Sub-agent conversation log
│       ├── conversation_meta.json
│       ├── messages.json        # Sub-agent message history
│       ├── raw_messages.json    # Raw messages for debugging
│       ├── metrics.json         # Sub-agent token/timing metrics
│       ├── tool_calls.json      # Tool call summary
│       ├── tool_calls.jsonl     # Streaming tool calls
│       └── result.json          # Sub-agent execution result
└── workspace/
    ├── single_agent/            # Files created by single agent
    └── multi_agent/             # Files created by multi-agent

Benchmark Report

The generated report includes:

Executive Summary: Quick comparison table with validation status
Output Validation: File existence, syntax checks, test results
Rubric Evaluation: Quality scores, grades, strengths, weaknesses
Token Usage: Detailed breakdown by category with costs
Error Metrics: Error rates, retries, time spent on retries
Execution Metrics: Timing and tool call counts
Task Decomposition Analysis: How the goal was broken down
Post-Mortem Analysis: Output comparisons beyond rubric scoring
Conclusions: When to use each approach

Sample Report Output

| Metric           | Single Agent | Multi-Agent | Winner    |
|------------------|--------------|-------------|-----------|
| Total Tokens     | 29,860       | 159,374     | 🏆 Single |
| Execution Time   | 40.07s       | 240.70s     | 🏆 Single |
| Total Cost       | $0.34        | $3.13       | 🏆 Single |
| Quality Score    | 100/100 (A)  | 100/100 (A) | Tie       |

✅ Apples-to-apples comparison: Both outputs validated successfully

Supported Models

Model	Context Window	Max Output	Input Cost	Output Cost
claude-sonnet-4-6	1,000,000	128,000	$3.00/1M	$15.00/1M
claude-opus-4-6	1,000,000	128,000	$15.00/1M	$75.00/1M
claude-opus-4-5-20251101	200,000	64,000	$15.00/1M	$75.00/1M
claude-sonnet-4-5-20250929	1,000,000	64,000	$3.00/1M	$15.00/1M
claude-haiku-4-5-20251001	200,000	64,000	$0.80/1M	$4.00/1M
claude-opus-4-20250514	200,000	32,000	$15.00/1M	$75.00/1M
claude-sonnet-4-20250514	1,000,000	64,000	$3.00/1M	$15.00/1M

Example

# Run with the example goal (interactive mode)
python benchmark.py --goal examples/goals/simple_goal.md

# Run non-interactively for CI/automation
python benchmark.py --goal examples/goals/simple_goal.md --model claude-sonnet-4-6 --skip-validation --quiet --yes

Future Opportunities

Summarization Strategies - Explore different approaches to context summarization for long-running agents
Optimization Options - Goal-based optimization for speed, quality, or cost; no-summary documents; constrained complexity
Domain Specific Rubrics - Pre-built evaluation rubrics tailored to specific domains (web apps, APIs, data pipelines, etc.)
Task Decomposition Strategies - Alternative approaches to breaking down goals (by layer, by feature, by complexity)
Semantic Tool Selection - Intelligent tool filtering based on goal analysis to reduce context overhead

Contributing

Contributions are welcome! Please submit a Pull Request with your changes.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples/goals/coding		examples/goals/coding
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
mcp.json		mcp.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Agent Benchmark System

DISCLAIMER

Overview

Workflow

Apples-to-Apples Comparison

How It Works

Single Agent Approach

Multi-Agent Approach

Rubric-Based Evaluation

Installation

Configuration

Environment Variables

MCP Configuration

Agent Tools

Filesystem Tools (MCP)

Code Execution (Sandbox)

Usage

Basic Usage

Full Options

CLI Arguments

Output Structure

Benchmark Report

Sample Report Output

Supported Models

Example

Future Opportunities

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages