End-to-end testing suite for LLM inference engines. Measures quality consistency and performance across vLLM, llama.cpp, Ollama, and ExLlamaV2.
Full Documentation | CLI Reference
- Multi-engine support — benchmark across vLLM, llama.cpp, Ollama, and ExLlamaV2 with a unified interface
- ARM64 and multi-arch support — platform-aware image selection for ARM64 boards (DGX Spark, Jetson Orin) with automatic KITT-managed builds for engines that lack multi-arch images
- Docker and native engine modes — engines run in Docker containers (default) or as native host processes. DGX Spark defaults to native for Ollama and llama.cpp. Each engine declares which modes it supports
- Engine profiles — named configurations with build flags and runtime args, saveable and selectable when creating benchmarks or campaigns
- Quality benchmarks — MMLU, GSM8K, TruthfulQA, and HellaSwag evaluations
- Performance benchmarks — throughput, latency, memory usage, and warmup analysis
- Hardware fingerprinting — automatic system identification for reproducible results
- KARR results storage — Kitt's AI Results Repository. SQLite (default) or PostgreSQL with queryable schema and full JSON round-tripping
- Docker deployment stacks — composable
docker-composestacks viakitt stack - Devon integration — embedded Devon web UI via server-side reverse proxy, with automatic fallback to local Devon
- Model format validation — preflight checks prevent launching containers with incompatible model formats (e.g. safetensors on llama.cpp)
- Web dashboard & REST API — browse results, manage agents, configure engine profiles, and view engine status with TLS and per-agent token auth
- Local model browser — scan and display models from a local directory, with engine and platform compatibility filtering in the quick test form
- Remote agents — deploy thin agents to GPU servers via
curl | bash; agents copy models from NFS shares, run benchmarks locally, and clean up. Per-agent settings are configurable from the web UI and synced via heartbeat - Test agents — virtual agents with configurable hardware specs for UI testing without real GPUs. Quick tests and campaigns simulate execution with realistic delays, live log streaming, and fake result generation
- Campaigns — multi-model, multi-engine benchmark sweeps with a step-by-step web wizard and CLI. Campaigns dispatch tests one at a time via the heartbeat mechanism and stream live progress logs
- Configurable engine images — override default Docker images per engine via
~/.kitt/engines.yaml - Monitoring — Prometheus + Grafana + InfluxDB stack generation
- Custom benchmarks — define evaluations with YAML configuration files
# Build the KITT image
docker build -t kitt .
# Run a benchmark (mounts Docker socket for sibling containers)
docker run --rm --network host \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /path/to/models:/models:ro \
-v ./kitt-results:/app/kitt-results \
kitt run -m /models/llama-7b -e vllm
# Or use docker-compose
MODEL_PATH=/path/to/models docker compose run kitt run -m /models/llama-7b -e vllmGenerate a full deployment stack with web UI, database, and monitoring:
kitt stack generate prod --web --postgres --monitoring
kitt stack start --name prodpoetry install # core dependencies
eval $(poetry env activate)
poetry install -E all # optional: install all extras (web, datasets, TUI, devon)
poetry install -E devon # optional: just remote Devon support (httpx)Requires Python 3.10+, Poetry, and Docker.
kitt fingerprint # detect hardware
kitt engines setup vllm # pull engine Docker image
kitt engines list # check engine status and supported modes
kitt engines status # detect native engine binaries
kitt engines profiles list # list saved engine profiles
kitt run -m /models/llama-7b -e vllm -s standard -o ./results
kitt run -m /models/model.gguf -e llama_cpp --auto-pull # auto-pull engine image if missing
kitt storage init # initialize results database
kitt storage list # browse stored runs
kitt storage stats # summary statistics
kitt web # launch web dashboardKITT validates model format compatibility before launching. Each engine declares the formats it supports:
| Engine | Supported Formats | Modes | Default Mode |
|---|---|---|---|
| vLLM | safetensors, pytorch | docker, native | docker |
| llama.cpp | gguf | docker, native | docker |
| Ollama | gguf | docker, native | docker |
| ExLlamaV2 | gptq, exl2, gguf | docker | docker |
| MLX | mlx, safetensors | native | native |
On DGX Spark, Ollama and llama.cpp default to native mode automatically.
If you attempt to run a safetensors model with llama.cpp (or a GGUF model with vLLM), KITT exits with a clear error before any engine starts. The web UI quick test form also filters models by engine compatibility.
When launching a quick test from the web UI, KITT checks whether each engine is compatible with the selected agent's CPU architecture. Engines that lack Docker images or required CUDA kernels for the agent's platform are marked as incompatible and disabled in the engine dropdown.
| Engine | ARM64 (aarch64) | x86_64 |
|---|---|---|
| vLLM | Yes | Yes |
| llama.cpp | Yes | Yes |
| Ollama | Yes | Yes |
| ExLlamaV2 | No | Yes |
To bypass this filtering (for example, to test a custom-built image), check Show all engines (override compatibility filtering) in the quick test form. When the override is active, the API receives a force flag that skips both platform and model-format validation.
| Variable | Default | Description |
|---|---|---|
KITT_MODEL_DIR |
~/.kitt/models |
Directory the Models tab scans for local model files |
DEVON_URL |
(none) | Devon server URL (proxied server-side for the Devon tab) |
DEVON_API_KEY |
(none) | API key injected server-side when proxying Devon requests |
KITT_AUTH_TOKEN |
(none) | Bearer token for web dashboard API authentication |
KITT_MODEL_DIR, DEVON_URL, and --results-dir can also be configured from the web UI Settings page. UI-saved values take priority over environment variables.
Agent authentication uses per-agent tokens provisioned during installation — see Agent Installation below.
| Section | Description |
|---|---|
| Getting Started | Installation, first benchmark tutorial, Docker quickstart |
| Guides | Engines, benchmarks, results, campaigns, deployment, monitoring |
| Reference | CLI reference, config schemas, REST API, environment variables |
| Concepts | Architecture, fingerprinting, results storage, engine lifecycle |
| Variable | Description | Default |
|---|---|---|
KITT_MODEL_DIR |
Directory to scan for local model files (Models tab) | ~/.kitt/models |
DEVON_URL |
Devon server URL — proxied server-side for the Devon tab | (none) |
DEVON_API_KEY |
API key injected server-side when proxying Devon requests | (none) |
KITT_AUTH_TOKEN |
Bearer token for web dashboard API authentication | (none) |
Create ~/.kitt/engines.yaml to override the default Docker images for any engine:
image_overrides:
vllm: "vllm/vllm-openai:latest"
llama_cpp: "ghcr.io/ggml-org/llama.cpp:server-cuda"User overrides take the highest priority, followed by KITT's hardware-aware image selection, then the engine's built-in default.
KITT automatically selects the best Docker image for the host CPU architecture and GPU compute capability. This is transparent — no user configuration is needed.
| Platform | Engine | Image Selected |
|---|---|---|
| x86_64 + Blackwell (cc >= 10.0) | vLLM | NGC nvcr.io/nvidia/vllm |
| x86_64 + Blackwell | llama.cpp | kitt/llama-cpp:spark (built locally) |
| ARM64 + Blackwell (DGX Spark, Jetson Orin) | llama.cpp | kitt/llama-cpp:arm64 (built locally) |
| Any architecture | Ollama | Default image (bundles its own llama.cpp) |
KITT-managed images (prefixed kitt/) are built locally from Dockerfiles in docker/. The first run on a new platform may take 10-20 minutes to compile. Subsequent runs use the cached image.
The --auto-pull flag on kitt run automatically pulls (or builds) the engine image if it's not available locally:
kitt run -m /models/llama-7b -e vllm --auto-pullWhen running tests via a remote agent, --auto-pull is passed automatically.
Engine profiles store named configurations with build flags and runtime args. Create profiles via the web UI (Engines → New Profile) or manage them from the CLI.
kitt engines profiles list # list all profiles
kitt engines profiles list --engine vllm # filter by engine
kitt engines profiles show my-profile # show profile detailsProfiles are selectable in the quick test form and campaign wizard. This lets you test different engine configurations (quantization settings, context sizes, GPU layer counts) without reconfiguring each time.
Check which engines have native binaries installed on the current system:
kitt engines statusThis shows each engine's native support, binary location, and installation status.
Manage remote GPU servers via SSH for direct campaign execution (separate from the agent-based workflow).
kitt remote setup user@spark.local # register a remote host
kitt remote list # list configured hosts
kitt remote test spark.local # test SSH connectivity
kitt remote engines setup vllm --host spark.local # pull/build engine image
kitt remote engines setup llama_cpp --host spark.local --dry-run # dry-run
kitt remote run campaign.yaml --host spark.local --wait # run a campaign
kitt remote status --host spark.local # check campaign status
kitt remote logs --host spark.local # view campaign logs
kitt remote sync --host spark.local -o ./results # sync results locally
kitt remote remove spark.local # remove a hostKITT can connect to a containerized Devon instance for model management — search, download, list, and delete models on a remote server without installing Devon locally.
Resolution order: Remote Devon (HTTP) → Local DevonBridge (Python import) → Devon CLI (subprocess)
KITT proxies all Devon requests server-side at /devon-app/, injecting the Devon API key automatically. The browser never sees or needs Devon's credentials — no cross-origin issues, no re-authentication on page navigation.
Configure the Devon URL and API key via environment variables or the web UI:
export DEVON_URL="http://192.168.1.50:8000"
export DEVON_API_KEY="your-token" # omit if Devon has no auth
kitt webOr set them from Settings > Devon Integration in the web dashboard. UI-saved settings override environment variables. You can hide the Devon tab from Settings > Devon Integration > Show Devon Tab.
Add devon_url and devon_api_key to your campaign YAML:
devon_managed: true
devon_url: "http://192.168.1.50:8000"
devon_api_key: "your-token" # omit if Devon has no authKITT agents run on remote GPU servers and receive Docker orchestration commands from the KITT server. The agent is installed via curl from the running KITT instance, ensuring version compatibility.
curl -fL https://your-kitt-server:8080/api/v1/agent/install.sh | bashThis creates a virtual environment at ~/.kitt/agent-venv, downloads the agent package from the KITT server, provisions a unique authentication token, and configures the agent. The agent version always matches the server.
~/.kitt/agent-venv/bin/kitt-agent start~/.kitt/agent-venv/bin/kitt-agent update # download & install latest from server
~/.kitt/agent-venv/bin/kitt-agent update --restart # update and restart in one stepThe update command downloads the latest agent package from the KITT server and reinstalls it into the agent's virtual environment. Use --restart to automatically stop the running agent and start the new version.
~/.kitt/agent-venv/bin/kitt-agent service installThis generates a systemd unit file, installs it, and starts the service. The agent will survive reboots and restart automatically on failure. Use kitt-agent service uninstall to remove it.
Agents automatically recover from transient server issues:
- If a heartbeat receives HTTP 404 (e.g. after server restart or database reset), the agent re-registers and syncs its canonical agent ID
- The server falls back to hostname-based lookup when an agent ID is not found, so heartbeats and results are not lost during recovery
- Engine images are auto-pulled during remote test execution, so agents don't fail on missing images
Each agent receives a unique 256-bit random token during installation:
- The install script provisions the token at download time — each
curl | bashgenerates a new token - The server stores only the SHA-256 hash of each token, never the raw value
- Compromising one agent does not affect other agents
- Tokens can be rotated via the API (
POST /api/v1/agents/<id>/rotate-token)
Agent configuration is stored at ~/.kitt/agent.yaml and includes the server URL, agent name, port, and token.
kitt-agent test list # list tests for this agent
kitt-agent test list --status running # filter by status
kitt-agent test stop <test_id> # cancel a running testThe agent is a lightweight daemon (kitt-agent) that:
- Registers with the KITT server and sends periodic heartbeats
- Authenticates with its unique per-agent token
- Receives commands via heartbeat dispatch (run benchmark, stop container, cleanup storage, start/stop/install engines)
- Manages native engine lifecycle — discovers binaries, starts/stops processes, reports engine status via heartbeat
- Resolves models from NFS shares, copies to local storage, runs benchmarks, and cleans up
- Runs benchmarks inside a locally-built KITT Docker container (falls back to local CLI)
- Streams container logs back via SSE
- Reports full hardware fingerprint and engine status during registration (GPU, CPU, CPU architecture, RAM, storage, CUDA, driver, environment type, compute capability)
- Handles unified memory architectures (e.g. DGX Spark GB10) where dedicated VRAM is shared with system RAM
- Self-updates from the server via
kitt-agent update - Does not install the full KITT Python package — benchmarks run inside a Docker container built from the KITT source
Campaigns run a matrix of models, engines, and benchmarks on an agent, producing a full result set in one operation.
Navigate to Campaigns → Create Campaign in the web dashboard. The step-by-step wizard walks through:
- Basics — campaign name and description
- Agent — select the target agent (determines engine compatibility)
- Engines — pick one or more engines, with format badges, mode selection (docker/native), and optional profile per engine
- Models — searchable multi-select checklist filtered by the selected engines' supported formats
- Settings — suite selection, Devon-managed model toggle, and post-run cleanup option
- Review — compatibility matrix showing which model/engine combinations will run
After creation, click Launch to start the campaign. The detail page shows live log streaming and per-test status updates in real time.
kitt campaign run campaign.yaml # run from a YAML config
kitt campaign run campaign.yaml --dry-run # preview planned runs
kitt campaign run campaign.yaml --resume # resume a failed campaign
kitt campaign list # list all campaigns
kitt campaign status [CAMPAIGN_ID] # show status (latest if no ID)
kitt campaign create --from-results ./results # generate config from existing results
kitt campaign wizard # interactive TUI builder
kitt campaign schedule campaign.yaml --cron "0 2 * * *" # schedule recurring runs
kitt campaign cron-status # show scheduled campaignsWhen a campaign launches on a real agent, the server breaks the campaign config into individual quick test rows and queues them one at a time. The agent's heartbeat picks up each queued test, runs it, and reports completion. The campaign executor polls for completion between tests, publishes live progress logs, and handles cancellation and timeouts (30-minute per-test limit).
When a campaign launches on a test agent, execution is simulated with realistic delays and fake metrics — see Test Agents below.
Campaign logs are persisted to the database so they survive page refreshes and are available after the campaign completes.
Test agents are virtual agents that simulate benchmark execution without real GPU hardware. They enable end-to-end UI testing — creating campaigns, running quick tests, viewing live logs, and inspecting results — all without a real agent daemon.
Navigate to Agents → Create Test Agent in the web dashboard, or click the "Create Test Agent" button on the agents list page. Configure hardware specs (GPU model, count, CPU, architecture, RAM, environment type) to match your testing scenario. Test agents appear in the agent list with a TEST badge and are always shown as online.
When you launch a quick test or campaign on a test agent:
- The test transitions through the same status lifecycle as a real test (queued → running → completed)
- Log lines stream in real-time over SSE with realistic 0.5–1.5s delays between messages
- Fake but logically consistent benchmark metrics are generated (throughput, latency, memory, accuracy)
- Results are persisted through the normal
ResultStorepipeline and appear in the Results page
- Test agents never go offline (the stale heartbeat check skips them)
- Storage and NFS settings are hidden on the agent detail page
- No authentication token is provisioned (port is set to 0)
- Benchmark metrics are randomly generated within realistic ranges
poetry install --with dev
poetry run pytest
poetry run ruff check src/ tests/Apache 2.0