gevals

🧪 Test your MCP servers by having AI agents complete real tasks.

What It Does

gevals validates MCP servers by:

🔧 Running setup scripts (e.g., create test namespace)
🤖 Giving an AI agent a task prompt (e.g., "create a nginx pod")
📝 Recording which MCP tools the agent uses
✅ Verifying the task succeeded via scripts OR LLM judge (e.g., pod is running, or response contains expected content)
🔍 Checking assertions (e.g., did agent call pods_create?)
🧹 Running cleanup scripts

If agents successfully complete tasks using your MCP server, your tools are well-designed.

Quick Start

# Build
go build -o gevals ./cmd/gevals

# Run the example (requires Kubernetes cluster + MCP server)
./gevals eval examples/kubernetes/eval.yaml

The tool will:

Display progress in real-time
Save results to gevals-<name>-out.json
Show pass/fail summary

Example Setup

eval.yaml - Main config:

kind: Eval
metadata:
  name: "kubernetes-test"
config:
  # Option 1: Inline builtin agent (no separate file needed)
  agent:
    type: "builtin.claude-code"

  # Option 2: OpenAI-compatible builtin agent
  # agent:
  #   type: "builtin.openai-agent"
  #   model: "gpt-4"

  # Option 3: Reference a custom agent file
  # agent:
  #   type: "file"
  #   path: agent.yaml

  mcpConfigFile: mcp-config.yaml  # Your MCP server config
  llmJudge:                        # Optional: LLM judge for semantic verification
    env:
      baseUrlKey: JUDGE_BASE_URL   # Env var name for LLM API base URL
      apiKeyKey: JUDGE_API_KEY     # Env var name for LLM API key
      modelNameKey: JUDGE_MODEL_NAME # Env var name for model name
  taskSets:
    - path: tasks/create-pod.yaml
      assertions:
        toolsUsed:
          - server: kubernetes
            toolPattern: "pods_.*"  # Agent must use pod-related tools
        minToolCalls: 1
        maxToolCalls: 10

mcp-config.yaml - MCP server to test:

mcpServers:
  kubernetes:
    type: http
    url: http://localhost:8080/mcp
    enableAllTools: true

agent.yaml - AI agent configuration:

kind: Agent
metadata:
  name: "claude-code"
builtin:
  type: "claude-code"  # Use built-in Claude Code configuration

Or with OpenAI-compatible agents:

kind: Agent
metadata:
  name: "my-agent"
builtin:
  type: "openai-agent"
  model: "gpt-4"
# Set these environment variables:
# export MODEL_BASE_URL="https://api.openai.com/v1"
# export MODEL_KEY="sk-..."

For custom configurations, specify the commands section manually (see "Agent Configuration" below).

tasks/create-pod.yaml - Test task:

kind: Task
metadata:
  name: "create-nginx-pod"
  difficulty: easy
steps:
  setup:
    file: setup.sh      # Creates test namespace
  verify:
    file: verify.sh     # Script-based: Checks pod is running
    # OR use LLM judge (requires llmJudge config in eval.yaml):
    # contains: "pod is running"  # Semantic check: response contains this text
    # exact: "The pod web-server is running"  # Semantic check: exact match
  cleanup:
    file: cleanup.sh    # Deletes pod
  prompt:
    inline: Create a nginx pod named web-server in the test-namespace

Note: You must choose either script-based verification (file or inline) OR LLM judge verification (contains or exact), not both.

Assertions

Validate agent behavior:

assertions:
  # Must call these tools
  toolsUsed:
    - server: kubernetes
      tool: pods_create              # Exact tool name
    - server: kubernetes
      toolPattern: "pods_.*"         # Regex pattern

  # Must call at least one of these
  requireAny:
    - server: kubernetes
      tool: pods_create

  # Must NOT call these
  toolsNotUsed:
    - server: kubernetes
      tool: namespaces_delete

  # Call limits
  minToolCalls: 1
  maxToolCalls: 10

  # Resource access
  resourcesRead:
    - server: filesystem
      uriPattern: "/data/.*\\.json$"
  resourcesNotRead:
    - server: filesystem
      uri: /etc/secrets/password

  # Prompt usage
  promptsUsed:
    - server: templates
      prompt: deployment-template

  # Call order (can have other calls between)
  callOrder:
    - type: tool
      server: kubernetes
      name: namespaces_create
    - type: tool
      server: kubernetes
      name: pods_create

  # No duplicate calls
  noDuplicateCalls: true

Test Scripts

Scripts return exit 0 for success, non-zero for failure:

setup.sh - Prepare environment:

#!/usr/bin/env bash
kubectl create namespace test-ns

verify.sh - Check task succeeded:

#!/usr/bin/env bash
kubectl wait --for=condition=Ready pod/web-server -n test-ns --timeout=120s

cleanup.sh - Remove resources:

#!/usr/bin/env bash
kubectl delete pod web-server -n test-ns

Or use inline scripts in the task YAML:

steps:
  setup:
    inline: |-
      #!/usr/bin/env bash
      kubectl create namespace test-ns

LLM Judge Verification

Instead of script-based verification, you can use an LLM judge to semantically evaluate agent responses. This is useful when:

You want to check if the agent's response contains specific information (semantic matching)
The expected output format may vary but the meaning should be consistent
You're testing tasks where the agent provides text responses rather than performing actions

Configuration

First, configure the LLM judge in your eval.yaml:

config:
  llmJudge:
    env:
      baseUrlKey: JUDGE_BASE_URL    # Environment variable for LLM API base URL
      apiKeyKey: JUDGE_API_KEY      # Environment variable for LLM API key
      modelNameKey: JUDGE_MODEL_NAME # Environment variable for model name

Set the required environment variables before running:

export JUDGE_BASE_URL="https://api.openai.com/v1"
export JUDGE_API_KEY="sk-..."
export JUDGE_MODEL_NAME="gpt-4o"

Note: The LLM judge currently only supports OpenAI-compatible APIs (APIs that follow the OpenAI API format). The implementation uses the OpenAI Go SDK with a configurable base URL, so you can use any OpenAI-compatible endpoint, but APIs with different formats are not supported.

Evaluation Modes

The LLM judge supports two evaluation modes:

CONTAINS mode (verify.contains):

Checks if the agent's response semantically contains all core information from the reference answer
Extra, correct, and non-contradictory information is acceptable
Format and phrasing differences are ignored (semantic matching)
Use when you want to verify the response includes specific information

EXACT mode (verify.exact):

Checks if the agent's response is semantically equivalent to the reference answer
Simple rephrasing is acceptable (e.g., "Paris is the capital" vs "The capital is Paris")
Adding or omitting information will fail
Use when you need precise semantic equivalence

Note: Both modes use the same LLM-based semantic evaluation approach. The difference is only in the system prompt instructions given to the judge LLM. See pkg/llmjudge/prompts.go for the implementation details.

Usage in Tasks

In your task YAML, use verify.contains or verify.exact instead of verify.file or verify.inline:

steps:
  verify:
    contains: "mysql:8.0.36"  # Response must contain this information

steps:
  verify:
    exact: "The pod web-server is running in namespace test-ns"  # Response must match exactly (semantically)

Important: You cannot use both script-based verification and LLM judge verification in the same task. Choose one method:

Script-based: verify.file or verify.inline (runs a script that returns exit code 0 for success)
LLM judge: verify.contains or verify.exact (semantically evaluates the agent's text response)

Results

Pass/fail means:

✅ Pass → Your MCP server is well-designed

Tools are discoverable
Descriptions are clear
Schemas work
Implementation is correct

❌ Fail → Needs improvement

Tool descriptions unclear
Schema too complex
Missing functionality
Implementation bugs

Output

Results saved to gevals-<eval-name>-out.json:

{
  "taskName": "create-nginx-pod",
  "taskPassed": true,
  "allAssertionsPassed": true,
  "assertionResults": {
    "toolsUsed": { "passed": true },
    "minToolCalls": { "passed": true }
  },
  "callHistory": {
    "toolCalls": [
      {
        "serverName": "kubernetes",
        "toolName": "pods_create",
        "timestamp": "2025-01-15T10:30:00Z"
      }
    ]
  }
}

Agent Configuration

Inline vs File-based Configuration

You can configure agents in two ways:

Inline builtin agent (recommended for simple setups):

kind: Eval
config:
  agent:
    type: "builtin.claude-code"

Custom agent file:

kind: Eval
config:
  agent:
    type: "file"
    path: agent.yaml

Use inline configuration for simple setups with built-in agents. Use a separate file when you need custom commands or want to reuse the same agent across multiple evals.

Built-in Agent Types

gevals provides built-in configurations for popular AI agents to eliminate boilerplate:

Claude Code:

kind: Eval
config:
  agent:
    type: "builtin.claude-code"

OpenAI-compatible agents:

kind: Eval
config:
  agent:
    type: "builtin.openai-agent"
    model: "gpt-4"  # or any OpenAI-compatible model

Set environment variables for API access:

# Generic environment variables used by all OpenAI-compatible models
export MODEL_BASE_URL="https://api.openai.com/v1"
export MODEL_KEY="sk-..."

# For other providers (e.g., granite, custom endpoints):
# export MODEL_BASE_URL="https://your-endpoint/v1"
# export MODEL_KEY="your-key"

Available Built-in Types

claude-code - Anthropic's Claude Code CLI
openai-agent - OpenAI-compatible agents using direct API calls (requires model)

Custom Agent Configuration

For custom setups, specify the commands section:

kind: Agent
metadata:
  name: "custom-agent"
commands:
  useVirtualHome: false
  argTemplateMcpServer: "--mcp {{ .File }}"
  argTemplateAllowedTools: "{{ .ToolName }}"
  runPrompt: |-
    my-agent --mcp-config {{ .McpServerFileArgs }} --prompt "{{ .Prompt }}"

Overriding Built-in Defaults

You can use a built-in type and override specific settings:

kind: Agent
metadata:
  name: "claude-custom"
builtin:
  type: "claude-code"
commands:
  useVirtualHome: true  # Override just this setting

How It Works

The tool creates an MCP proxy that sits between the AI agent and your MCP server:

AI Agent → MCP Proxy (recording) → Your MCP Server

Everything gets recorded:

Which tools were called
What arguments were passed
When calls happened
What responses came back

Then assertions validate the recorded behavior matches your expectations.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.claude/skills/create-eval		.claude/skills/create-eval
.github		.github
cmd		cmd
docs/dev		docs/dev
examples		examples
pkg		pkg
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASING.md		RELEASING.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gevals

What It Does

Quick Start

Example Setup

Assertions

Test Scripts

LLM Judge Verification

Configuration

Evaluation Modes

Usage in Tasks

Results

Output

Agent Configuration

Inline vs File-based Configuration

Built-in Agent Types

Available Built-in Types

Custom Agent Configuration

Overriding Built-in Defaults

How It Works

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

genmcp/gevals

Folders and files

Latest commit

History

Repository files navigation

gevals

What It Does

Quick Start

Example Setup

Assertions

Test Scripts

LLM Judge Verification

Configuration

Evaluation Modes

Usage in Tasks

Results

Output

Agent Configuration

Inline vs File-based Configuration

Built-in Agent Types

Available Built-in Types

Custom Agent Configuration

Overriding Built-in Defaults

How It Works

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages