🧪 Test your MCP servers by having AI agents complete real tasks.
gevals validates MCP servers by:
- 🔧 Running setup scripts (e.g., create test namespace)
- 🤖 Giving an AI agent a task prompt (e.g., "create a nginx pod")
- 📝 Recording which MCP tools the agent uses
- ✅ Verifying the task succeeded via scripts OR LLM judge (e.g., pod is running, or response contains expected content)
- 🔍 Checking assertions (e.g., did agent call
pods_create?) - 🧹 Running cleanup scripts
If agents successfully complete tasks using your MCP server, your tools are well-designed.
# Build
go build -o gevals ./cmd/gevals
# Run the example (requires Kubernetes cluster + MCP server)
./gevals eval examples/kubernetes/eval.yamlThe tool will:
- Display progress in real-time
- Save results to
gevals-<name>-out.json - Show pass/fail summary
eval.yaml - Main config:
kind: Eval
metadata:
name: "kubernetes-test"
config:
# Option 1: Inline builtin agent (no separate file needed)
agent:
type: "builtin.claude-code"
# Option 2: OpenAI-compatible builtin agent
# agent:
# type: "builtin.openai-agent"
# model: "gpt-4"
# Option 3: Reference a custom agent file
# agent:
# type: "file"
# path: agent.yaml
mcpConfigFile: mcp-config.yaml # Your MCP server config
llmJudge: # Optional: LLM judge for semantic verification
env:
baseUrlKey: JUDGE_BASE_URL # Env var name for LLM API base URL
apiKeyKey: JUDGE_API_KEY # Env var name for LLM API key
modelNameKey: JUDGE_MODEL_NAME # Env var name for model name
taskSets:
- path: tasks/create-pod.yaml
assertions:
toolsUsed:
- server: kubernetes
toolPattern: "pods_.*" # Agent must use pod-related tools
minToolCalls: 1
maxToolCalls: 10mcp-config.yaml - MCP server to test:
mcpServers:
kubernetes:
type: http
url: http://localhost:8080/mcp
enableAllTools: trueagent.yaml - AI agent configuration:
kind: Agent
metadata:
name: "claude-code"
builtin:
type: "claude-code" # Use built-in Claude Code configurationOr with OpenAI-compatible agents:
kind: Agent
metadata:
name: "my-agent"
builtin:
type: "openai-agent"
model: "gpt-4"
# Set these environment variables:
# export MODEL_BASE_URL="https://api.openai.com/v1"
# export MODEL_KEY="sk-..."For custom configurations, specify the commands section manually (see "Agent Configuration" below).
tasks/create-pod.yaml - Test task:
kind: Task
metadata:
name: "create-nginx-pod"
difficulty: easy
steps:
setup:
file: setup.sh # Creates test namespace
verify:
file: verify.sh # Script-based: Checks pod is running
# OR use LLM judge (requires llmJudge config in eval.yaml):
# contains: "pod is running" # Semantic check: response contains this text
# exact: "The pod web-server is running" # Semantic check: exact match
cleanup:
file: cleanup.sh # Deletes pod
prompt:
inline: Create a nginx pod named web-server in the test-namespaceNote: You must choose either script-based verification (file or inline) OR LLM judge verification (contains or exact), not both.
Validate agent behavior:
assertions:
# Must call these tools
toolsUsed:
- server: kubernetes
tool: pods_create # Exact tool name
- server: kubernetes
toolPattern: "pods_.*" # Regex pattern
# Must call at least one of these
requireAny:
- server: kubernetes
tool: pods_create
# Must NOT call these
toolsNotUsed:
- server: kubernetes
tool: namespaces_delete
# Call limits
minToolCalls: 1
maxToolCalls: 10
# Resource access
resourcesRead:
- server: filesystem
uriPattern: "/data/.*\\.json$"
resourcesNotRead:
- server: filesystem
uri: /etc/secrets/password
# Prompt usage
promptsUsed:
- server: templates
prompt: deployment-template
# Call order (can have other calls between)
callOrder:
- type: tool
server: kubernetes
name: namespaces_create
- type: tool
server: kubernetes
name: pods_create
# No duplicate calls
noDuplicateCalls: trueScripts return exit 0 for success, non-zero for failure:
setup.sh - Prepare environment:
#!/usr/bin/env bash
kubectl create namespace test-nsverify.sh - Check task succeeded:
#!/usr/bin/env bash
kubectl wait --for=condition=Ready pod/web-server -n test-ns --timeout=120scleanup.sh - Remove resources:
#!/usr/bin/env bash
kubectl delete pod web-server -n test-nsOr use inline scripts in the task YAML:
steps:
setup:
inline: |-
#!/usr/bin/env bash
kubectl create namespace test-nsInstead of script-based verification, you can use an LLM judge to semantically evaluate agent responses. This is useful when:
- You want to check if the agent's response contains specific information (semantic matching)
- The expected output format may vary but the meaning should be consistent
- You're testing tasks where the agent provides text responses rather than performing actions
First, configure the LLM judge in your eval.yaml:
config:
llmJudge:
env:
baseUrlKey: JUDGE_BASE_URL # Environment variable for LLM API base URL
apiKeyKey: JUDGE_API_KEY # Environment variable for LLM API key
modelNameKey: JUDGE_MODEL_NAME # Environment variable for model nameSet the required environment variables before running:
export JUDGE_BASE_URL="https://api.openai.com/v1"
export JUDGE_API_KEY="sk-..."
export JUDGE_MODEL_NAME="gpt-4o"Note: The LLM judge currently only supports OpenAI-compatible APIs (APIs that follow the OpenAI API format). The implementation uses the OpenAI Go SDK with a configurable base URL, so you can use any OpenAI-compatible endpoint, but APIs with different formats are not supported.
The LLM judge supports two evaluation modes:
CONTAINS mode (verify.contains):
- Checks if the agent's response semantically contains all core information from the reference answer
- Extra, correct, and non-contradictory information is acceptable
- Format and phrasing differences are ignored (semantic matching)
- Use when you want to verify the response includes specific information
EXACT mode (verify.exact):
- Checks if the agent's response is semantically equivalent to the reference answer
- Simple rephrasing is acceptable (e.g., "Paris is the capital" vs "The capital is Paris")
- Adding or omitting information will fail
- Use when you need precise semantic equivalence
Note: Both modes use the same LLM-based semantic evaluation approach. The difference is only in the system prompt instructions given to the judge LLM. See pkg/llmjudge/prompts.go for the implementation details.
In your task YAML, use verify.contains or verify.exact instead of verify.file or verify.inline:
steps:
verify:
contains: "mysql:8.0.36" # Response must contain this informationsteps:
verify:
exact: "The pod web-server is running in namespace test-ns" # Response must match exactly (semantically)Important: You cannot use both script-based verification and LLM judge verification in the same task. Choose one method:
- Script-based:
verify.fileorverify.inline(runs a script that returns exit code 0 for success) - LLM judge:
verify.containsorverify.exact(semantically evaluates the agent's text response)
Pass/fail means:
✅ Pass → Your MCP server is well-designed
- Tools are discoverable
- Descriptions are clear
- Schemas work
- Implementation is correct
❌ Fail → Needs improvement
- Tool descriptions unclear
- Schema too complex
- Missing functionality
- Implementation bugs
Results saved to gevals-<eval-name>-out.json:
{
"taskName": "create-nginx-pod",
"taskPassed": true,
"allAssertionsPassed": true,
"assertionResults": {
"toolsUsed": { "passed": true },
"minToolCalls": { "passed": true }
},
"callHistory": {
"toolCalls": [
{
"serverName": "kubernetes",
"toolName": "pods_create",
"timestamp": "2025-01-15T10:30:00Z"
}
]
}
}You can configure agents in two ways:
- Inline builtin agent (recommended for simple setups):
kind: Eval
config:
agent:
type: "builtin.claude-code"- Custom agent file:
kind: Eval
config:
agent:
type: "file"
path: agent.yamlUse inline configuration for simple setups with built-in agents. Use a separate file when you need custom commands or want to reuse the same agent across multiple evals.
gevals provides built-in configurations for popular AI agents to eliminate boilerplate:
Claude Code:
kind: Eval
config:
agent:
type: "builtin.claude-code"OpenAI-compatible agents:
kind: Eval
config:
agent:
type: "builtin.openai-agent"
model: "gpt-4" # or any OpenAI-compatible modelSet environment variables for API access:
# Generic environment variables used by all OpenAI-compatible models
export MODEL_BASE_URL="https://api.openai.com/v1"
export MODEL_KEY="sk-..."
# For other providers (e.g., granite, custom endpoints):
# export MODEL_BASE_URL="https://your-endpoint/v1"
# export MODEL_KEY="your-key"claude-code- Anthropic's Claude Code CLIopenai-agent- OpenAI-compatible agents using direct API calls (requires model)
For custom setups, specify the commands section:
kind: Agent
metadata:
name: "custom-agent"
commands:
useVirtualHome: false
argTemplateMcpServer: "--mcp {{ .File }}"
argTemplateAllowedTools: "{{ .ToolName }}"
runPrompt: |-
my-agent --mcp-config {{ .McpServerFileArgs }} --prompt "{{ .Prompt }}"You can use a built-in type and override specific settings:
kind: Agent
metadata:
name: "claude-custom"
builtin:
type: "claude-code"
commands:
useVirtualHome: true # Override just this settingThe tool creates an MCP proxy that sits between the AI agent and your MCP server:
AI Agent → MCP Proxy (recording) → Your MCP Server
Everything gets recorded:
- Which tools were called
- What arguments were passed
- When calls happened
- What responses came back
Then assertions validate the recorded behavior matches your expectations.