Skip to content

workato-devs/agentonomics-sample

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent Benchmark System

A benchmarking system to compare single-agent vs multi-agent approaches for completing goals using the Strands Agents framework.

DISCLAIMER

This repository contains example code provided for illustrative purposes only. It is not a supported product, and no guarantees are made regarding its functionality, accuracy, security, or fitness for any particular purpose. Use at your own risk. Workato does not provide support, maintenance, or updates for this code. It is not intended for use in production environments without thorough review and testing.

Overview

This system runs the same goal through two different agent architectures and compares their performance:

  • Single Agent: One agent with full context, using summarization for long conversations
  • Multi-Agent: An orchestrator that decomposes the goal and delegates to specialized sub-agents

Both approaches are given identical verification requirements to ensure an apples-to-apples comparison.

Workflow

---
config:
  theme: dark
---
flowchart TB
    subgraph Input["📥 Input"]
        G[Goal File]
        MC[MCP Config]
        ENV[.env]
    end

    subgraph Validation["🔍 Validation"]
        CLI[benchmark.py] --> VAL[Goal Validator]
        VAL --> REQ[Requirements Extraction]
    end

    subgraph Execution["⚡ Execution"]
        REQ --> SA[Single Agent]
        REQ --> MA[Multi-Agent]

        SA --> SAW[(single_agent<br/>workspace)]
        MA --> TD[Task Decomposer]
        TD --> OR[Orchestrator]
        OR --> SUB[Sub-Agents]
        SUB --> MAW[(multi_agent<br/>workspace)]
    end

    subgraph Evaluation["📊 Evaluation"]
        SAW --> OV[Output Validator]
        MAW --> OV
        OV --> RG[Rubric Generator]
        RG --> RE[Rubric Evaluator]
    end

    subgraph Output["📈 Output"]
        RE --> RPT[Report Generator]
        RPT --> MD[Markdown Report]
        RPT --> JSON[JSON Report]
    end

    G --> CLI
    MC --> CLI
    ENV --> CLI
Loading

Apples-to-Apples Comparison

To ensure fair comparison, both agent types must perform equivalent work:

---
config:
  theme: dark
---
flowchart LR
    SA[Single Agent] --> V1
    OR[Orchestrator] --> V1

    V1[List workspace] --> V2[Read outputs]
    V2 --> V3[Run verification]
    V3 --> V4[Report results]
Loading

How It Works

  1. Requirements Extraction: The goal is parsed to identify expected output files
  2. Verification Instructions: Both agents receive mandatory verification steps in their prompts
  3. Enforced Verification: Agents must verify their work before completing
  4. Post-Validation: The benchmark independently validates outputs to confirm compliance
  5. Fair Metrics: Token usage reflects equivalent work performed by both approaches

Single Agent Approach

---
config:
  theme: dark
---
sequenceDiagram
    participant U as User
    participant SA as Single Agent
    participant T as Tools
    participant W as Workspace

    U->>SA: Goal + Verification Requirements
    SA->>T: Execute task
    T->>W: Create outputs
    T-->>SA: Done
    Note over SA: Verification
    SA->>T: List workspace
    SA->>T: Read outputs
    SA->>T: Run verification
    SA-->>U: Result + Summary
Loading
  1. Creates one agent with all available tools
  2. Uses SummarizingConversationManager to handle context overflow
  3. Summarization triggers at 80% of model's context window
  4. Must complete verification steps before finishing
  5. Tracks all token usage and tool calls

Multi-Agent Approach

---
config:
  theme: dark
---
sequenceDiagram
    participant U as User
    participant TD as Decomposer
    participant OR as Orchestrator
    participant S as Sub-Agents
    participant T as Tools
    participant W as Workspace

    U->>TD: Goal
    TD-->>OR: Task list
    OR->>S: Invoke tasks
    S->>T: Execute sub-tasks
    T->>W: Create outputs
    S-->>OR: Tasks complete
    Note over OR: Verification
    OR->>T: List workspace
    OR->>T: Read outputs
    OR->>T: Run verification
    OR-->>U: Result + Summary
Loading
  1. Task Decomposition: Uses Claude to break down the goal into sub-tasks

    • Separates by concern and tool requirements
    • Identifies dependencies between tasks
    • Documents rationale for separation
  2. Orchestration: A coordinator agent manages sub-agents

    • Executes tasks in dependency order
    • Passes context between dependent tasks
    • Performs verification after all tasks complete
    • Aggregates results
  3. Sub-Agents: Specialized agents for each task

    • Focused prompts for specific responsibilities
    • Independent token tracking
    • Isolated execution

Rubric-Based Evaluation

---
config:
  theme: dark
---
flowchart LR
    subgraph Generation
        G[Goal] --> RG[Rubric Generator]
        RG --> R[Rubric<br/>100 points]
    end

    subgraph Collection
        W1[(Single)] --> EC[Evidence Collector]
        W2[(Multi)] --> EC
        EC --> E[File Tree + Contents<br/>+ Test Results]
    end

    subgraph Scoring
        R --> RE[Rubric Evaluator]
        E --> RE
        RE --> S1[Scores + Grades]
    end
Loading

The evaluation system:

  • Rubric Generator: Creates a 100-point rubric with categories and criteria from the goal
  • Evidence Collector: Gathers file trees, contents, test results, and keyword matches
  • Rubric Evaluator: LLM-based scoring with detailed reasoning per criterion

Installation

cd agent_benchmark
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install -r requirements.txt

Configuration

Create a .env file with your Anthropic API key:

ANTHROPIC_API_KEY=your_api_key_here

Environment Variables

Variable Description
ANTHROPIC_API_KEY Required. Your Anthropic API key
BENCHMARK_MODEL Model ID to use
BENCHMARK_MCP_CONFIG Path to MCP config file
BENCHMARK_OUTPUT Output directory
BENCHMARK_TEMPERATURE Model temperature
BENCHMARK_TOP_P Top-p sampling
BENCHMARK_TOP_K Top-k sampling
BENCHMARK_MAX_TOKENS Max output tokens
BENCHMARK_SKIP_VALIDATION Skip goal validation (true/false)
BENCHMARK_QUIET Disable spinners (true/false)
BENCHMARK_YES Auto-confirm (true/false)

CLI arguments override environment variables.

MCP Configuration

Configure MCP tools in mcp.json:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "./workspace"]
    }
  }
}

Agent Tools

Filesystem Tools (MCP)

The MCP filesystem server (@modelcontextprotocol/server-filesystem) provides sandboxed file operations:

  • list_allowed_directories - List directories the server can access
  • list_directory - List contents of a directory
  • directory_tree - Show directory structure as a tree
  • read_text_file - Read content of a text file
  • read_multiple_files - Read multiple files at once
  • write_file - Create or overwrite files
  • edit_file - Edit an existing file
  • create_directory - Create directories
  • search_files - Search for files matching patterns

Code Execution (Sandbox)

The execute_command tool allows agents to run shell commands within the workspace:

execute_command(command="python -m pytest test_calculator.py -v")
execute_command(command="python calculator.py")
execute_command(command="ls -la")

Commands are executed with a 60-second timeout and are restricted to the workspace directory.

Usage

Basic Usage

python benchmark.py --goal goal.md

Full Options

python benchmark.py \
  --goal goal.md \
  --mcp-config mcp.json \
  --output ./results \
  --model claude-sonnet-4-6 \
  --temperature 0.7 \
  --max-tokens 8192 \
  --skip-validation \
  --quiet \
  --yes

CLI Arguments

Argument Short Description Default
--goal -g Path to goal markdown file Required
--mcp-config -m Path to MCP config file mcp.json
--output -o Output directory Auto-generated
--model Model ID to use Interactive
--skip-validation Skip goal validation step False
--temperature Model temperature (0.0-1.0) 1.0
--top-p Top-p sampling (nucleus sampling) None
--top-k Top-k sampling None
--max-tokens Max output tokens Model default
--quiet -q Disable spinner/progress indicators False
--yes -y Auto-confirm prompts (skip confirmation) False

Output Structure

benchmark_results_YYYYMMDD_HHMMSS/
├── goal.md                      # Copy of the input goal
├── rubric.json                  # Generated assessment rubric
├── benchmark_report.md          # Human-readable comparison report
├── benchmark_report.json        # Machine-readable report data
├── single_agent/
│   ├── master_prompt.md         # System prompt used
│   ├── conversation.jsonl       # Streaming conversation log
│   ├── conversation_meta.json   # Conversation metadata
│   ├── messages.json            # Full message history
│   ├── raw_messages.json        # Raw agent messages for debugging
│   ├── metrics.json             # Token usage and timing metrics
│   ├── tool_calls.json          # Consolidated tool call summary
│   ├── tool_calls.jsonl         # Streaming tool call log
│   └── result.json              # Execution result (success/failure)
├── multi_agent/
│   ├── master_prompt.md         # Orchestrator system prompt
│   ├── task_decomposition.json  # How goal was broken into tasks
│   ├── aggregate_metrics.json   # Combined metrics across all agents
│   ├── result.json              # Overall execution result
│   ├── orchestrator/
│   │   ├── conversation.jsonl   # Orchestrator conversation log
│   │   ├── conversation_meta.json
│   │   ├── messages.json        # Orchestrator message history
│   │   ├── metrics.json         # Orchestrator token/timing metrics
│   │   ├── tool_calls.json      # Tool call summary
│   │   └── tool_calls.jsonl     # Streaming tool calls
│   └── sub_agent_task_N/        # One directory per decomposed task
│       ├── prompt.md            # Sub-agent's task prompt
│       ├── conversation.jsonl   # Sub-agent conversation log
│       ├── conversation_meta.json
│       ├── messages.json        # Sub-agent message history
│       ├── raw_messages.json    # Raw messages for debugging
│       ├── metrics.json         # Sub-agent token/timing metrics
│       ├── tool_calls.json      # Tool call summary
│       ├── tool_calls.jsonl     # Streaming tool calls
│       └── result.json          # Sub-agent execution result
└── workspace/
    ├── single_agent/            # Files created by single agent
    └── multi_agent/             # Files created by multi-agent

Benchmark Report

The generated report includes:

  • Executive Summary: Quick comparison table with validation status
  • Output Validation: File existence, syntax checks, test results
  • Rubric Evaluation: Quality scores, grades, strengths, weaknesses
  • Token Usage: Detailed breakdown by category with costs
  • Error Metrics: Error rates, retries, time spent on retries
  • Execution Metrics: Timing and tool call counts
  • Task Decomposition Analysis: How the goal was broken down
  • Post-Mortem Analysis: Output comparisons beyond rubric scoring
  • Conclusions: When to use each approach

Sample Report Output

| Metric           | Single Agent | Multi-Agent | Winner    |
|------------------|--------------|-------------|-----------|
| Total Tokens     | 29,860       | 159,374     | 🏆 Single |
| Execution Time   | 40.07s       | 240.70s     | 🏆 Single |
| Total Cost       | $0.34        | $3.13       | 🏆 Single |
| Quality Score    | 100/100 (A)  | 100/100 (A) | Tie       |

✅ Apples-to-apples comparison: Both outputs validated successfully

Supported Models

Model Context Window Max Output Input Cost Output Cost
claude-sonnet-4-6 1,000,000 128,000 $3.00/1M $15.00/1M
claude-opus-4-6 1,000,000 128,000 $15.00/1M $75.00/1M
claude-opus-4-5-20251101 200,000 64,000 $15.00/1M $75.00/1M
claude-sonnet-4-5-20250929 1,000,000 64,000 $3.00/1M $15.00/1M
claude-haiku-4-5-20251001 200,000 64,000 $0.80/1M $4.00/1M
claude-opus-4-20250514 200,000 32,000 $15.00/1M $75.00/1M
claude-sonnet-4-20250514 1,000,000 64,000 $3.00/1M $15.00/1M

Example

# Run with the example goal (interactive mode)
python benchmark.py --goal examples/goals/simple_goal.md

# Run non-interactively for CI/automation
python benchmark.py --goal examples/goals/simple_goal.md --model claude-sonnet-4-6 --skip-validation --quiet --yes

Future Opportunities

  • Summarization Strategies - Explore different approaches to context summarization for long-running agents
  • Optimization Options - Goal-based optimization for speed, quality, or cost; no-summary documents; constrained complexity
  • Domain Specific Rubrics - Pre-built evaluation rubrics tailored to specific domains (web apps, APIs, data pipelines, etc.)
  • Task Decomposition Strategies - Alternative approaches to breaking down goals (by layer, by feature, by complexity)
  • Semantic Tool Selection - Intelligent tool filtering based on goal analysis to reduce context overhead

Contributing

Contributions are welcome! Please submit a Pull Request with your changes.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Benchmark tool to optimize for single vs multi agent workloads

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages