AI agent that investigates Kubernetes clusters and makes their capabilities searchable by natural language. Available as a CLI, MCP server, and REST API — with OpenTelemetry tracing throughout.
An AI agent that lets you ask questions about your Kubernetes cluster in plain English. It investigates using kubectl, searches a vector database of cluster knowledge, and explains what it finds. Available via CLI for direct terminal use, as an MCP server for integration with Claude Code, Cursor, and other MCP clients, or as a REST API for receiving live updates from a Kubernetes controller.
$ cluster-whisperer "Why are pods failing in the payments namespace?"
Thinking: I need to list pods in the payments namespace to see their current status...
🔧 Tool: kubectl_get
Args: {"resource":"pods","namespace":"payments"}
Result:
NAME READY STATUS RESTARTS
payments-api-7d4f9-x2k 0/1 CrashLoopBackOff 5
Thinking: The pod is in CrashLoopBackOff. Let me check the logs to see why...
🔧 Tool: kubectl_logs
Args: {"pod":"payments-api-7d4f9-x2k","namespace":"payments"}
Result:
Error: Cannot find module '/app/server.js'
────────────────────────────────────────────────────────────
Answer:
The payments-api pod is crashing because it can't find the entrypoint
file '/app/server.js'. This usually means the Docker image was built
incorrectly or the working directory is misconfigured.The agent can also search the cluster's knowledge base to discover what's available — not just what's broken:
$ cluster-whisperer "What types of databases can I provision?"
Thinking: This is a question about available capabilities. Let me search
the vector database for database-related resource types...
🔧 Tool: vector_search
Args: {"query":"managed database provisioning","collection":"capabilities"}
Result:
1. PostgreSQL (acid.zalan.do/v1) — Managed PostgreSQL clusters with
automated failover, backups, and connection pooling. Complexity: high.
2. SQL (devopstoolkit.live/v1) — Composite resource for managed SQL
databases across cloud providers. Complexity: medium.
────────────────────────────────────────────────────────────
Answer:
Your cluster has two database-related resource types:
- PostgreSQL (Zalando operator) for managed PostgreSQL with automated
failover and backups
- SQL (DevOps Toolkit) as a cloud-agnostic abstraction for managed
databasesThe first example shows investigation — the agent runs kubectl commands and reasons about what it finds. The second shows discovery — the agent searches pre-indexed cluster knowledge using semantic similarity, finding relevant resources even when the exact words don't match.
This agent uses the ReAct pattern (Reasoning + Acting):
Think → Act → Observe → Think → Act → Observe → ... → Answer
- Reason - Agent thinks about what to do next
- Act - Agent calls a tool (kubectl or vector search)
- Observe - Agent sees the result
- Repeat until the agent has enough information to answer
Note: "ReAct" is an AI agent pattern from a 2022 research paper. It has nothing to do with the React.js frontend framework.
- CLI Agent - Ask questions directly from the terminal with visible reasoning
- Tool-Set Filtering - Control which tools the agent has with
--tools kubectl,vector,apply(progressive capability) - Agent Selection - Switch between agent frameworks with
--agent langgraphor--agent vercel - Vector Backend Switching - Choose between Chroma and Qdrant with
--vector-backend qdrant - Conversation Memory - Multi-turn conversations with
--thread <id>— the agent remembers prior context - kubectl_apply - Deploy resources from the platform's approved catalog (code-enforced, not prompt-level)
- MCP Server - Use kubectl tools from Claude Code, Cursor, or any MCP-compatible client
- REST API - Receive live instance updates from a Kubernetes controller, keeping the vector database in sync automatically
- Knowledge Pipeline - Pre-index cluster capabilities and running instances into a vector database for semantic search
- Vector Search - Unified search tool with semantic, keyword, and metadata filtering — the agent uses this to discover what your cluster can do
- OpenTelemetry Tracing - Full observability with traces exportable to Datadog, Jaeger, etc.
- Extended Thinking - See the agent's reasoning process as it investigates
- Env Var Support - All CLI flags have
CLUSTER_WHISPERER_*env var equivalents for demo ergonomics
- Node.js 18+
- kubectl CLI installed and configured (for investigation and sync commands)
ANTHROPIC_API_KEYenvironment variable (for the investigation agent and capability sync)VOYAGE_API_KEYenvironment variable (for vector database embedding)- Chroma or Qdrant vector database running locally (for knowledge pipeline and vector search)
Not every command needs everything:
| Command | kubectl | Anthropic API Key | Voyage API Key | Chroma |
|---|---|---|---|---|
<question> (investigate) |
Yes | Yes | Optional | Optional |
sync (capabilities) |
Yes | Yes | Yes | Yes |
sync-instances |
Yes | No | Yes | Yes |
serve (REST API) |
Optional* | Optional* | Yes | Yes |
*Required for the /api/v1/capabilities/scan endpoint, which runs kubectl api-resources and kubectl explain for discovery, and calls the Anthropic API for inference. Without these, only instance sync is available.
npm install
npm run build# Run with vals to inject ANTHROPIC_API_KEY (-i inherits PATH so kubectl is found)
vals exec -i -f .vals.yaml -- node dist/index.js "What's running in the default namespace?"
# With specific tools (progressive capability)
vals exec -i -f .vals.yaml -- node dist/index.js --tools kubectl "Why is my app broken?"
vals exec -i -f .vals.yaml -- node dist/index.js --tools kubectl,vector "What database should I use?"
vals exec -i -f .vals.yaml -- node dist/index.js --tools kubectl,vector,apply "Deploy the database"
# With Qdrant instead of Chroma
vals exec -i -f .vals.yaml -- node dist/index.js --vector-backend qdrant "What databases are available?"
# Multi-turn conversation (same thread ID resumes prior context)
vals exec -i -f .vals.yaml -- node dist/index.js --thread demo "What database should I deploy?"
vals exec -i -f .vals.yaml -- node dist/index.js --thread demo "I'm on the You Choose team"
vals exec -i -f .vals.yaml -- node dist/index.js --thread demo "Go ahead and deploy it"
# With the Vercel AI SDK agent (same tools, same output, different framework)
cluster-whisperer --agent vercel --tools kubectl "Why is my app broken?"
# With tracing to Datadog (via local agent)
OTEL_TRACING_ENABLED=true \
OTEL_EXPORTER_TYPE=otlp \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
vals exec -i -f .vals.yaml -- node dist/index.js "Find the broken pod"All CLI flags have environment variable equivalents (set via CLUSTER_WHISPERER_* prefix). This is useful for demos where you set env vars once after an audience vote:
export CLUSTER_WHISPERER_TOOLS=kubectl,vector
export CLUSTER_WHISPERER_VECTOR_BACKEND=qdrant
export CLUSTER_WHISPERER_THREAD=demo
cluster-whisperer "What database should I deploy?"The agent can pre-index cluster knowledge into a vector database for faster, more comprehensive answers.
Sync resource capabilities (what resource types exist and what they can do):
vals exec -i -f .vals.yaml -- node dist/index.js syncSync resource instances (what's currently running in the cluster):
vals exec -i -f .vals.yaml -- node dist/index.js sync-instances
# Preview what would be synced without writing to the database
vals exec -i -f .vals.yaml -- node dist/index.js sync-instances --dry-runTogether these enable the "Semantic Bridge" pattern: capabilities tell the agent what's possible, instances tell it what exists. When a user asks "what databases are running?", the agent searches capabilities to find database-related resource types, then searches instances filtered to those types to find actual running resources.
See docs/capability-inference-pipeline.md and docs/resource-instance-sync.md for details.
Add to your .mcp.json (in project root or ~/.claude/):
{
"mcpServers": {
"cluster-whisperer": {
"command": "node",
"args": ["/path/to/cluster-whisperer/dist/mcp-server.js"],
"env": {
"OTEL_TRACING_ENABLED": "true",
"OTEL_EXPORTER_TYPE": "otlp",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4318"
}
}
}
}Note: Use an absolute path in args. MCP clients spawn the server as a subprocess, and relative paths resolve from the client's working directory.
See docs/agent/mcp-server.md for details on how MCP works.
Start the HTTP server to receive instance sync payloads from a Kubernetes controller:
vals exec -i -f .vals.yaml -- node dist/index.js serve
# Custom port
vals exec -i -f .vals.yaml -- node dist/index.js serve --port 8080The server exposes:
| Endpoint | Method | Description |
|---|---|---|
/healthz |
GET | Liveness probe — always returns 200 if the process is running |
/readyz |
GET | Readiness probe — returns 200 only when Chroma is reachable |
/api/v1/instances/sync |
POST | Receives batched instance upserts and deletes |
/api/v1/capabilities/scan |
POST | Triggers capability inference for specific CRDs (optional — requires ANTHROPIC_API_KEY) |
The instance sync endpoint accepts a JSON payload with two arrays:
{
"upserts": [
{
"id": "default/apps/v1/Deployment/nginx",
"namespace": "default",
"name": "nginx",
"kind": "Deployment",
"apiVersion": "apps/v1",
"apiGroup": "apps",
"labels": {},
"annotations": {},
"createdAt": "2025-01-15T10:30:00Z"
}
],
"deletes": ["default/apps/v1/Deployment/old-nginx"]
}The capability scan endpoint accepts a list of fully qualified CRD resource names:
{
"upserts": ["certificates.cert-manager.io", "issuers.cert-manager.io"],
"deletes": ["old-resource.example.io"]
}Unlike instance sync (which returns 200 synchronously), the capability scan returns 202 Accepted immediately and processes in the background — LLM inference takes ~4-6 seconds per resource. See docs/capability-inference-pipeline.md for details.
Both endpoints are designed to work with the k8s-vectordb-sync controller, which watches Kubernetes clusters for resource and CRD changes and pushes them here. Any client can POST to either endpoint — the contract is the JSON schema above.
The server handles graceful shutdown on SIGTERM, making it Kubernetes-deployment friendly.
cluster-whisperer exposes kubectl and vector search tools via three interfaces:
User Question → ReAct Agent → [kubectl + vector search tools] → Cluster / Vector DB → Answer
↑ |
└───────────────────────┘
(agent sees result,
decides next action)
The CLI agent has its own reasoning loop - it decides which tools to call and interprets the results.
User Question → [Claude Code / Cursor] → MCP → investigate tool → ReAct Agent → Cluster / Vector DB
↑ |
└──────────────────┘
(agent reasons internally)
The MCP server exposes a single investigate tool that wraps the same ReAct agent used by the CLI. This gives MCP clients complete investigations with full tracing - one call captures the entire reasoning chain.
k8s-vectordb-sync controller
|
├── POST /api/v1/instances/sync (resource changes)
├── POST /api/v1/capabilities/scan (CRD changes)
v
cluster-whisperer serve (Hono server) → Vector DB
^
|
Kubernetes cluster ──(watches)──┘
The REST API receives pushed data from the k8s-vectordb-sync controller. Instance sync keeps the vector database up-to-date as resources change. Capability scan triggers LLM inference when new CRDs are installed, so the agent discovers new resource types automatically.
CLI Agent: Uses these tools internally during investigation. Which tools are available depends on the --tools flag:
| Tool Group | Tools | Purpose |
|---|---|---|
kubectl |
kubectl_get, kubectl_describe, kubectl_logs |
Cluster investigation |
vector |
vector_search |
Semantic discovery of cluster capabilities |
apply |
kubectl_apply |
Deploy resources from the approved catalog |
Default: --tools kubectl,vector (backwards compatible).
-
kubectl_get- List resources and their status -
kubectl_describe- Get detailed resource information -
kubectl_logs- Check container logs -
vector_search- Search the vector database with three composable dimensions:- Semantic search (
query) — natural language similarity via embeddings (e.g., "managed database" finds SQL CRDs) - Keyword search (
keyword) — exact substring match, no embedding call (e.g., "backup" finds docs mentioning backup) - Metadata filters (
kind,apiGroup,namespace,complexity) — exact match on structured fields
- Semantic search (
-
kubectl_apply- Deploy a Kubernetes resource by applying a YAML manifest. Validates the resource type against the platform's approved catalog before applying — enforcement is in code, not in the prompt. If the resource type isn't in the capabilities collection, the apply is rejected.The agent uses kubectl tools for investigation ("why is this pod failing?"), vector search for discovery ("what databases can I provision?"), and kubectl_apply for deployment ("deploy the database for my team").
MCP Server: Exposes a single high-level tool:
investigate- Ask a question, get a complete answer (wraps the ReAct agent with all tools above)
OpenTelemetry tracing provides visibility into agent operations. OTel SDK packages are optional peer dependencies — tracing works when installed but everything runs fine without them. See docs/observability/opentelemetry.md for installation and configuration details.
cluster-whisperer.investigate (root span)
├── kubectl_get.tool
│ └── kubectl get pods -n default
├── kubectl_describe.tool
│ └── kubectl describe pod broken-pod
├── kubectl_logs.tool
│ └── kubectl logs broken-pod
└── vector_search.tool
└── query: "managed database provisioning"
Environment Variables:
| Variable | Default | Description |
|---|---|---|
CLUSTER_WHISPERER_TOOLS |
kubectl,vector |
Comma-separated tool groups: kubectl, vector, apply |
CLUSTER_WHISPERER_AGENT |
langgraph |
Agent framework: langgraph or vercel |
CLUSTER_WHISPERER_VECTOR_BACKEND |
chroma |
Vector database: chroma or qdrant |
CLUSTER_WHISPERER_THREAD |
- | Conversation thread ID for multi-turn memory |
CLUSTER_WHISPERER_KUBECONFIG |
- | Kubeconfig path passed to kubectl (agent-only cluster access) |
CLUSTER_WHISPERER_CHROMA_URL |
http://localhost:8000 |
Chroma vector database URL |
CLUSTER_WHISPERER_QDRANT_URL |
http://localhost:6333 |
Qdrant vector database URL |
CLUSTER_WHISPERER_QUIET |
false |
Suppress OTel init messages and Chroma warnings |
OTEL_TRACING_ENABLED |
false |
Enable tracing |
OTEL_EXPORTER_TYPE |
console |
console or otlp |
OTEL_EXPORTER_OTLP_ENDPOINT |
- | OTLP collector URL (e.g., http://localhost:4318) |
OTEL_CAPTURE_AI_PAYLOADS |
false |
Capture tool inputs/outputs in traces |
VOYAGE_API_KEY |
- | Voyage AI API key (required by sync, sync-instances, and serve) |
Schema Validation:
Custom span attributes (cluster_whisperer.*, traceloop.*, gen_ai.*) are formally defined in a Weaver registry at telemetry/registry/attributes.yaml. This is the single source of truth for attribute names, types, and descriptions. Weaver validates the schema and resolves references to OTel semantic conventions:
npm run telemetry:check # Validate registry structure and references
npm run telemetry:resolve # Resolve all references to flat JSONSee docs/observability/tracing-conventions.md for tracing architecture and design rationale, and docs/observability/telemetry-generated/attributes/cluster-whisperer.md for the auto-generated attribute reference.
src/
├── index.ts # CLI entry point (agent + sync + serve commands)
├── mcp-server.ts # MCP server entry point
├── agent/
│ ├── agent-events.ts # AgentEvent union type (shared between agents)
│ ├── agent-interface.ts # InvestigationAgent interface
│ ├── investigator.ts # ReAct agent setup (LangGraph)
│ ├── langgraph-adapter.ts # Wraps LangGraph agent as InvestigationAgent
│ ├── file-checkpointer.ts # Persistent conversation memory for LangGraph --thread
│ ├── vercel-agent.ts # Vercel AI SDK agent implementation
│ └── vercel-thread-store.ts # Conversation memory for Vercel agent --thread
├── api/ # REST API for controller-pushed sync
│ ├── server.ts # Hono HTTP server with health probes
│ ├── routes/
│ │ ├── instances.ts # POST /api/v1/instances/sync endpoint
│ │ └── capabilities.ts # POST /api/v1/capabilities/scan endpoint
│ └── schemas/
│ ├── sync-payload.ts # Zod validation for instance sync payloads
│ └── scan-payload.ts # Zod validation for capability scan payloads
├── pipeline/ # Knowledge sync pipelines
│ ├── discovery.ts # Resource type discovery (kubectl api-resources)
│ ├── inference.ts # Capability inference (kubectl explain → LLM)
│ ├── storage.ts # Capability document storage
│ ├── runner.ts # Capability sync orchestrator
│ ├── instance-discovery.ts # Resource instance discovery (kubectl get)
│ ├── instance-storage.ts # Instance document storage
│ └── instance-runner.ts # Instance sync orchestrator
├── vectorstore/ # Vector database abstraction
│ ├── types.ts # VectorStore interface
│ ├── chroma-backend.ts # Chroma implementation
│ ├── qdrant-backend.ts # Qdrant implementation
│ ├── multi-backend.ts # Writes to multiple backends in parallel (for sync)
│ └── embeddings.ts # Voyage AI embedding provider
├── tools/
│ ├── core/ # Shared tool logic (schemas, execution)
│ │ ├── kubectl-get.ts
│ │ ├── kubectl-describe.ts
│ │ ├── kubectl-logs.ts
│ │ ├── kubectl-apply.ts # Deploy with catalog validation
│ │ ├── vector-search.ts # Unified semantic/keyword/metadata search
│ │ └── format-results.ts # Search result formatting
│ ├── tool-groups.ts # Tool group definitions (kubectl, vector, apply)
│ ├── langchain/ # LangGraph tool wrappers
│ ├── vercel/ # Vercel AI SDK tool wrappers
│ └── mcp/ # MCP server wrappers
├── tracing/ # OpenTelemetry instrumentation
│ ├── index.ts # OTel initialization, exporter setup
│ ├── context-bridge.ts # AsyncLocalStorage workaround for LangGraph
│ ├── tool-tracing.ts # Tool span wrapper
│ ├── tool-definitions-processor.ts # Adds tool definitions to LLM spans
│ ├── vercel-span-processor.ts # Enriches Vercel SDK spans for Datadog LLM Obs
│ └── optional-deps.ts # Graceful loading of optional OTel packages
└── utils/
└── kubectl.ts # Shared kubectl execution helper
prompts/
├── investigator.md # Agent system prompt (investigation behavior)
└── capability-inference.md # Capability inference prompt (sync pipeline)
telemetry/
└── registry/ # OpenTelemetry Weaver schema
├── attributes.yaml # Custom attribute definitions
└── registry_manifest.yaml # Schema metadata + OTel semconv dependency
scripts/
└── seed-test-data.ts # Load sample data into Chroma for testing
docs/
├── agentic-loop.md # How the ReAct agent works
├── capability-inference-pipeline.md # How capability sync works
├── kubectl-tools.md # How kubectl tools work
├── langgraph-vs-langchain.md # LangChain vs LangGraph explained
├── mcp-server.md # MCP server architecture
├── opentelemetry.md # OpenTelemetry implementation guide
├── resource-instance-sync.md # How instance sync works
├── tracing-conventions.md # Tracing architecture and design rationale
└── vector-database.md # Vector database architecture
demo/
├── app/ # Demo app — intentionally broken prop for KubeCon talk
│ ├── src/ # Hono server with DATABASE_URL connection logic
│ ├── k8s/ # Deployment + Service manifests
│ └── Dockerfile # Multi-stage build
└── cluster/ # Demo cluster provisioning (GKE)
├── setup.sh # Create cluster with all demo components
├── teardown.sh # Destroy clusters and clean up kubeconfig
├── reset-demo.sh # Reset between demo runs (cleanup ManagedService, restart app, clear threads)
├── kind-config.yaml # Kind cluster configuration (experimental)
├── helm-values/ # Helm values for Crossplane, Chroma, Qdrant, Jaeger, OTel Collector
└── manifests/ # Crossplane providers, XRDs, Compositions, decoy resources
The demo/app/ directory contains a minimal Hono web server that requires a PostgreSQL database. It exists as a prop for the KubeCon "Choose Your Own Adventure" demo — when deployed to Kubernetes without a database, it crashes immediately and enters CrashLoopBackOff. The cluster-whisperer agent then investigates why the app is broken, discovers the missing database, and deploys one.
The app is intentionally simple. It connects to DATABASE_URL on startup: if the connection succeeds, it serves HTTP traffic; if it fails (or the variable is missing), it crashes with a clear, single-line error message designed for the agent to parse from kubectl logs.
demo/app/
├── src/
│ ├── index.ts # Entry point — reads DATABASE_URL, attempts connection, crashes or starts server
│ ├── server.ts # Hono app factory with GET / (DB status) and GET /healthz (liveness probe)
│ └── server.test.ts # Unit tests for routes, startup behavior, and error message format
├── k8s/
│ ├── deployment.yaml # Deployment with DATABASE_URL pointing to a non-existent service
│ └── service.yaml # ClusterIP service exposing port 80 → 3000
├── Dockerfile # Multi-stage build (node:22-alpine)
├── package.json
└── tsconfig.json
Build the container image:
cd demo/app
docker build -t demo-app:latest .Without DATABASE_URL, the app crashes immediately:
$ docker run --rm demo-app:latest
[demo-app] Starting server...
[demo-app] FATAL: DATABASE_URL environment variable is required
[demo-app] Exiting with code 1With an unreachable DATABASE_URL, it crashes with a connection error:
$ docker run --rm -e DATABASE_URL=postgres://db-service:5432/myapp demo-app:latest
[demo-app] Starting server...
[demo-app] Connecting to database at postgres://db-service:5432/myapp...
[demo-app] FATAL: Cannot connect to database at postgres://db-service:5432/myapp - getaddrinfo ENOTFOUND db-service
[demo-app] Exiting with code 1Both crash modes are intentional — this is the behavior the agent investigates during the demo.
Load the image into a Kind cluster and apply the manifests:
kind load docker-image demo-app:latest --name <cluster-name>
kubectl apply -f demo/app/k8s/The Deployment sets DATABASE_URL to postgres://db-service:5432/myapp — a service that doesn't exist in the cluster. The app crashes on startup and Kubernetes restarts it, producing CrashLoopBackOff within seconds:
$ kubectl get pods -l app=demo-app
NAME READY STATUS RESTARTS AGE
demo-app-748c9d8c54-8mngm 0/1 CrashLoopBackOff 41 (4m ago) 3h9mThe logs show the same connection error from the Build and Run section:
$ kubectl logs --previous -l app=demo-app
[demo-app] Starting server...
[demo-app] Connecting to database at postgres://db-service:5432/myapp...
[demo-app] FATAL: Cannot connect to database at postgres://db-service:5432/myapp - getaddrinfo ENOTFOUND db-service
[demo-app] Exiting with code 1This is what the cluster-whisperer agent sees when it investigates. The error messages are designed to be agent-friendly — single-line, containing the word "database" and the connection target, so the agent can diagnose the missing database from kubectl logs output alone.
The demo/cluster/ directory contains scripts to provision a complete demo environment on GKE. A single command creates a Kubernetes cluster with ~360 Crossplane CRDs, two vector databases, two observability backends, the demo app in CrashLoopBackOff, and a live cluster-whisperer instance — everything needed for the KubeCon "Choose Your Own Adventure" demo.
- Google Cloud SDK (
gcloud) withgke-gcloud-auth-plugin - Helm 3.x
- kubectl
- Docker (for building container images)
- Node.js 18+ (for the capability inference pipeline)
- API keys in a
.envfile at the repo root (see.env.example):ANTHROPIC_API_KEY— for capability inferenceVOYAGE_API_KEY— for vector embeddingsDD_API_KEY— for Datadog trace export (optional)
./demo/cluster/setup.sh gcpThe script auto-detects the nearest GCP zone (override with GCP_ZONE=europe-west1-b). It creates a 3-node GKE cluster, installs all components, runs the capability inference pipeline, and prints a summary when complete:
[ok] ==============================================
[ok] Demo Cluster Ready (gcp mode)
[ok] ==============================================
==> Mode: gcp
==> Cluster: cluster-whisperer-20260312-155916
==> KUBECONFIG: /Users/whitney.lee/.kube/config-cluster-whisperer
==> CRDs: 1041
==> Demo app: CrashLoopBackOff
==> Chroma: Running
==> Qdrant: Running
==> Jaeger: Running
==> OTel Collector: Running
==> Ingress NGINX: Running
==> CW serve: Running
==> vectordb-sync: Running
==> Ingress URLs:
==> cluster-whisperer: http://cluster-whisperer.34.123.173.28.nip.io
==> Jaeger UI: http://jaeger.34.123.173.28.nip.io
==> To use this cluster:
export KUBECONFIG=/Users/whitney.lee/.kube/config-cluster-whisperer
The setup script writes credentials to ~/.kube/config-cluster-whisperer. Your default kubeconfig is also modified during setup; teardown removes those entries.
Setup takes approximately 45-55 minutes on a cold start (GKE creation ~8 min, CRD registration ~23 min, capability inference ~12 min).
| Component | Namespace | Purpose |
|---|---|---|
| GKE cluster (3x n2-standard-4) | — | Kubernetes environment |
| Crossplane + 16 sub-providers | crossplane-system |
~360 CRDs for discovery |
| 20 ManagedService XRDs + Compositions | crossplane-system |
1 real + 19 decoys — "needle in the haystack" |
| Chroma | chroma |
Vector database option A (capabilities + instances) |
| Qdrant | qdrant |
Vector database option B (capabilities + instances) |
| Jaeger v2 | jaeger |
Trace UI backend |
| OTel Collector | otel-collector |
OTLP to Jaeger + Datadog fan-out |
| Demo app | default |
Intentionally broken (CrashLoopBackOff) |
| cluster-whisperer serve | cluster-whisperer |
REST API for live sync |
| k8s-vectordb-sync | k8s-vectordb-sync |
Controller pushing resource changes |
| NGINX Ingress | ingress-nginx |
External access via nip.io DNS |
The setup script also runs the capability inference pipeline, which analyzes all ~360 CRDs via LLM and stores natural-language descriptions in Chroma. This is what enables semantic search — when the agent searches for "PostgreSQL database for my application", it finds the platform Composition among ~360 CRDs because the pipeline generated a description like "Platform-approved PostgreSQL database for application teams."
./demo/cluster/teardown.shDiscovers and deletes all cluster-whisperer clusters (both Kind and GKE), removes their kubeconfig entries, and cleans up the dedicated kubeconfig file if empty. GKE clusters incur billing until fully deleted.
MIT