AgentBench

Framework-agnostic CLI tool for benchmarking AI agents across standardized tasks.

Features

Minimal adapter interface — implement a single async function: (task: string) => Promise<string>
25 built-in tasks across 5 categories: tool-use, reasoning, code-gen, research, multi-step
3 scoring modes — exact-match, regex-match, LLM-judge
Framework adapters — LangChain, CrewAI, OpenAI Assistants
HTML reports — standalone reports with Chart.js visualizations
Agent comparison — compare two agents side-by-side with diff scores

Quick Start

# Install
npm install -g @agentbench/cli

# Scaffold an adapter
agentbench init my-agent

# Run the benchmark
agentbench run -a ./my-agent.ts -n "My Agent" --framework openai --model gpt-4o

# Compare two runs
agentbench compare results-a.json results-b.json

# List available tasks
agentbench tasks

Agent Adapter

The simplest possible interface — your agent just needs to be an async function:

import type { AgentAdapter } from '@agentbench/cli';

const myAgent: AgentAdapter = async (task: string): Promise<string> => {
  // Call your agent here
  const response = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
    },
    body: JSON.stringify({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: task }],
    }),
  });
  const data = await response.json();
  return data.choices[0].message.content;
};

export default myAgent;

Framework Adapters

// LangChain
import { createLangChainAdapter } from '@agentbench/cli';
const adapter = createLangChainAdapter(myChain);

// CrewAI
import { createCrewAIAdapter } from '@agentbench/cli';
const adapter = createCrewAIAdapter(myCrew);

// OpenAI Assistants
import { createOpenAIAssistantAdapter } from '@agentbench/cli';
const adapter = createOpenAIAssistantAdapter(client, assistantId);

Task Categories

Category	Tasks	Description
Tool Use	4	Calculator, JSON parsing, pattern extraction, unit conversion
Reasoning	5	Logic puzzles, sequences, syllogisms, analogies, counterfactuals
Code Gen	5	FizzBuzz, palindromes, API fetch, SQL queries, algorithms
Research	5	Summarization, fact extraction, comparison, definitions
Multi-Step	6	Data pipelines, planning, text analysis, code review

Custom Tasks

Add your own tasks as YAML files:

- id: my-custom-task
  name: My Custom Task
  description: Test a specific capability
  prompt: "What is the capital of France?"
  category: reasoning
  difficulty: easy
  timeout_ms: 10000
  scoring_mode: exact-match
  expected_output: "Paris"

agentbench run -a ./my-agent.ts --task-dir ./my-tasks/

Scoring Modes

exact-match — Binary pass/fail against expected_output (whitespace-normalized)
regex-match — Binary pass/fail against expected_pattern
llm-judge — LLM scores 0–100 based on evaluation_criteria (falls back to heuristic if no API key)

CLI Commands

agentbench init [name]        Scaffold an adapter file and config
agentbench run                Execute benchmark against an agent
agentbench compare <a> <b>    Compare two result files side-by-side
agentbench tasks              List available benchmark tasks

`agentbench run` Options

Flag	Description	Default
`-a, --adapter <path>`	Path to adapter module	Required
`-n, --name <name>`	Agent name	`"unnamed"`
`--framework <name>`	Framework identifier	—
`--model <name>`	Model identifier	—
`-c, --categories <list>`	Comma-separated categories to run	All
`-d, --difficulties <list>`	Comma-separated difficulties	All
`--concurrency <n>`	Max parallel tasks	`3`
`-o, --output <dir>`	Output directory	`./agentbench-results`
`--no-html`	Skip HTML report generation	—
`--task-dir <path>`	Additional task directory	—
`--judge-key <key>`	OpenAI API key for LLM judge	—

Leaderboard

Submit your benchmark results to the public leaderboard at aiagentdirectory.com/benchmark.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentBench

Features

Quick Start

Agent Adapter

Framework Adapters

Task Categories

Custom Tasks

Scoring Modes

CLI Commands

`agentbench run` Options

Leaderboard

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentBench

Features

Quick Start

Agent Adapter

Framework Adapters

Task Categories

Custom Tasks

Scoring Modes

CLI Commands

agentbench run Options

Leaderboard

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`agentbench run` Options

Packages