Framework-agnostic CLI tool for benchmarking AI agents across standardized tasks.
- Minimal adapter interface — implement a single async function:
(task: string) => Promise<string> - 25 built-in tasks across 5 categories: tool-use, reasoning, code-gen, research, multi-step
- 3 scoring modes — exact-match, regex-match, LLM-judge
- Framework adapters — LangChain, CrewAI, OpenAI Assistants
- HTML reports — standalone reports with Chart.js visualizations
- Agent comparison — compare two agents side-by-side with diff scores
# Install
npm install -g @agentbench/cli
# Scaffold an adapter
agentbench init my-agent
# Run the benchmark
agentbench run -a ./my-agent.ts -n "My Agent" --framework openai --model gpt-4o
# Compare two runs
agentbench compare results-a.json results-b.json
# List available tasks
agentbench tasksThe simplest possible interface — your agent just needs to be an async function:
import type { AgentAdapter } from '@agentbench/cli';
const myAgent: AgentAdapter = async (task: string): Promise<string> => {
// Call your agent here
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
},
body: JSON.stringify({
model: 'gpt-4o',
messages: [{ role: 'user', content: task }],
}),
});
const data = await response.json();
return data.choices[0].message.content;
};
export default myAgent;// LangChain
import { createLangChainAdapter } from '@agentbench/cli';
const adapter = createLangChainAdapter(myChain);
// CrewAI
import { createCrewAIAdapter } from '@agentbench/cli';
const adapter = createCrewAIAdapter(myCrew);
// OpenAI Assistants
import { createOpenAIAssistantAdapter } from '@agentbench/cli';
const adapter = createOpenAIAssistantAdapter(client, assistantId);| Category | Tasks | Description |
|---|---|---|
| Tool Use | 4 | Calculator, JSON parsing, pattern extraction, unit conversion |
| Reasoning | 5 | Logic puzzles, sequences, syllogisms, analogies, counterfactuals |
| Code Gen | 5 | FizzBuzz, palindromes, API fetch, SQL queries, algorithms |
| Research | 5 | Summarization, fact extraction, comparison, definitions |
| Multi-Step | 6 | Data pipelines, planning, text analysis, code review |
Add your own tasks as YAML files:
- id: my-custom-task
name: My Custom Task
description: Test a specific capability
prompt: "What is the capital of France?"
category: reasoning
difficulty: easy
timeout_ms: 10000
scoring_mode: exact-match
expected_output: "Paris"agentbench run -a ./my-agent.ts --task-dir ./my-tasks/- exact-match — Binary pass/fail against
expected_output(whitespace-normalized) - regex-match — Binary pass/fail against
expected_pattern - llm-judge — LLM scores 0–100 based on
evaluation_criteria(falls back to heuristic if no API key)
agentbench init [name] Scaffold an adapter file and config
agentbench run Execute benchmark against an agent
agentbench compare <a> <b> Compare two result files side-by-side
agentbench tasks List available benchmark tasks
| Flag | Description | Default |
|---|---|---|
-a, --adapter <path> |
Path to adapter module | Required |
-n, --name <name> |
Agent name | "unnamed" |
--framework <name> |
Framework identifier | — |
--model <name> |
Model identifier | — |
-c, --categories <list> |
Comma-separated categories to run | All |
-d, --difficulties <list> |
Comma-separated difficulties | All |
--concurrency <n> |
Max parallel tasks | 3 |
-o, --output <dir> |
Output directory | ./agentbench-results |
--no-html |
Skip HTML report generation | — |
--task-dir <path> |
Additional task directory | — |
--judge-key <key> |
OpenAI API key for LLM judge | — |
Submit your benchmark results to the public leaderboard at aiagentdirectory.com/benchmark.
MIT