cluster-whisperer

AI agent that investigates Kubernetes clusters and makes their capabilities searchable by natural language. Available as a CLI, MCP server, and REST API — with OpenTelemetry tracing throughout.

What is this?

An AI agent that lets you ask questions about your Kubernetes cluster in plain English. It investigates using kubectl, searches a vector database of cluster knowledge, and explains what it finds. Available via CLI for direct terminal use, as an MCP server for integration with Claude Code, Cursor, and other MCP clients, or as a REST API for receiving live updates from a Kubernetes controller.

$ cluster-whisperer "Why are pods failing in the payments namespace?"

Thinking: I need to list pods in the payments namespace to see their current status...

🔧 Tool: kubectl_get
   Args: {"resource":"pods","namespace":"payments"}
   Result:
   NAME                      READY   STATUS             RESTARTS
   payments-api-7d4f9-x2k    0/1     CrashLoopBackOff   5

Thinking: The pod is in CrashLoopBackOff. Let me check the logs to see why...

🔧 Tool: kubectl_logs
   Args: {"pod":"payments-api-7d4f9-x2k","namespace":"payments"}
   Result:
Error: Cannot find module '/app/server.js'

────────────────────────────────────────────────────────────
Answer:
The payments-api pod is crashing because it can't find the entrypoint
file '/app/server.js'. This usually means the Docker image was built
incorrectly or the working directory is misconfigured.

The agent can also search the cluster's knowledge base to discover what's available — not just what's broken:

$ cluster-whisperer "What types of databases can I provision?"

Thinking: This is a question about available capabilities. Let me search
the vector database for database-related resource types...

🔧 Tool: vector_search
   Args: {"query":"managed database provisioning","collection":"capabilities"}
   Result:
   1. PostgreSQL (acid.zalan.do/v1) — Managed PostgreSQL clusters with
      automated failover, backups, and connection pooling. Complexity: high.
   2. SQL (devopstoolkit.live/v1) — Composite resource for managed SQL
      databases across cloud providers. Complexity: medium.

────────────────────────────────────────────────────────────
Answer:
Your cluster has two database-related resource types:
- PostgreSQL (Zalando operator) for managed PostgreSQL with automated
  failover and backups
- SQL (DevOps Toolkit) as a cloud-agnostic abstraction for managed
  databases

The first example shows investigation — the agent runs kubectl commands and reasons about what it finds. The second shows discovery — the agent searches pre-indexed cluster knowledge using semantic similarity, finding relevant resources even when the exact words don't match.

How it works: The ReAct Pattern

This agent uses the ReAct pattern (Reasoning + Acting):

Think → Act → Observe → Think → Act → Observe → ... → Answer

Reason - Agent thinks about what to do next
Act - Agent calls a tool (kubectl or vector search)
Observe - Agent sees the result
Repeat until the agent has enough information to answer

Note: "ReAct" is an AI agent pattern from a 2022 research paper. It has nothing to do with the React.js frontend framework.

Features

CLI Agent - Ask questions directly from the terminal with visible reasoning
Tool-Set Filtering - Control which tools the agent has with --tools kubectl,vector,apply (progressive capability)
Agent Selection - Switch between agent frameworks with --agent langgraph or --agent vercel
Vector Backend Switching - Choose between Chroma and Qdrant with --vector-backend qdrant
Conversation Memory - Multi-turn conversations with --thread <id> — the agent remembers prior context
kubectl_apply - Deploy resources from the platform's approved catalog (code-enforced, not prompt-level)
MCP Server - Use kubectl tools from Claude Code, Cursor, or any MCP-compatible client
REST API - Receive live instance updates from a Kubernetes controller, keeping the vector database in sync automatically
Knowledge Pipeline - Pre-index cluster capabilities and running instances into a vector database for semantic search
Vector Search - Unified search tool with semantic, keyword, and metadata filtering — the agent uses this to discover what your cluster can do
OpenTelemetry Tracing - Full observability with traces exportable to Datadog, Jaeger, etc.
Extended Thinking - See the agent's reasoning process as it investigates
Env Var Support - All CLI flags have CLUSTER_WHISPERER_* env var equivalents for demo ergonomics

Prerequisites

Node.js 18+
kubectl CLI installed and configured (for investigation and sync commands)
ANTHROPIC_API_KEY environment variable (for the investigation agent and capability sync)
VOYAGE_API_KEY environment variable (for vector database embedding)
Chroma or Qdrant vector database running locally (for knowledge pipeline and vector search)

Not every command needs everything:

Command	kubectl	Anthropic API Key	Voyage API Key	Chroma
`<question>` (investigate)	Yes	Yes	Optional	Optional
`sync` (capabilities)	Yes	Yes	Yes	Yes
`sync-instances`	Yes	No	Yes	Yes
`serve` (REST API)	Optional*	Optional*	Yes	Yes

*Required for the /api/v1/capabilities/scan endpoint, which runs kubectl api-resources and kubectl explain for discovery, and calls the Anthropic API for inference. Without these, only instance sync is available.

Setup

npm install
npm run build

Usage

CLI Agent

# Run with vals to inject ANTHROPIC_API_KEY (-i inherits PATH so kubectl is found)
vals exec -i -f .vals.yaml -- node dist/index.js "What's running in the default namespace?"

# With specific tools (progressive capability)
vals exec -i -f .vals.yaml -- node dist/index.js --tools kubectl "Why is my app broken?"
vals exec -i -f .vals.yaml -- node dist/index.js --tools kubectl,vector "What database should I use?"
vals exec -i -f .vals.yaml -- node dist/index.js --tools kubectl,vector,apply "Deploy the database"

# With Qdrant instead of Chroma
vals exec -i -f .vals.yaml -- node dist/index.js --vector-backend qdrant "What databases are available?"

# Multi-turn conversation (same thread ID resumes prior context)
vals exec -i -f .vals.yaml -- node dist/index.js --thread demo "What database should I deploy?"
vals exec -i -f .vals.yaml -- node dist/index.js --thread demo "I'm on the You Choose team"
vals exec -i -f .vals.yaml -- node dist/index.js --thread demo "Go ahead and deploy it"

# With the Vercel AI SDK agent (same tools, same output, different framework)
cluster-whisperer --agent vercel --tools kubectl "Why is my app broken?"

# With tracing to Datadog (via local agent)
OTEL_TRACING_ENABLED=true \
OTEL_EXPORTER_TYPE=otlp \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
vals exec -i -f .vals.yaml -- node dist/index.js "Find the broken pod"

All CLI flags have environment variable equivalents (set via CLUSTER_WHISPERER_* prefix). This is useful for demos where you set env vars once after an audience vote:

export CLUSTER_WHISPERER_TOOLS=kubectl,vector
export CLUSTER_WHISPERER_VECTOR_BACKEND=qdrant
export CLUSTER_WHISPERER_THREAD=demo
cluster-whisperer "What database should I deploy?"

Knowledge Pipeline

The agent can pre-index cluster knowledge into a vector database for faster, more comprehensive answers.

Sync resource capabilities (what resource types exist and what they can do):

vals exec -i -f .vals.yaml -- node dist/index.js sync

Sync resource instances (what's currently running in the cluster):

vals exec -i -f .vals.yaml -- node dist/index.js sync-instances

# Preview what would be synced without writing to the database
vals exec -i -f .vals.yaml -- node dist/index.js sync-instances --dry-run

Together these enable the "Semantic Bridge" pattern: capabilities tell the agent what's possible, instances tell it what exists. When a user asks "what databases are running?", the agent searches capabilities to find database-related resource types, then searches instances filtered to those types to find actual running resources.

See docs/capability-inference-pipeline.md and docs/resource-instance-sync.md for details.

MCP Server (Claude Code, Cursor, etc.)

Add to your .mcp.json (in project root or ~/.claude/):

{
  "mcpServers": {
    "cluster-whisperer": {
      "command": "node",
      "args": ["/path/to/cluster-whisperer/dist/mcp-server.js"],
      "env": {
        "OTEL_TRACING_ENABLED": "true",
        "OTEL_EXPORTER_TYPE": "otlp",
        "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4318"
      }
    }
  }
}

Note: Use an absolute path in args. MCP clients spawn the server as a subprocess, and relative paths resolve from the client's working directory.

See docs/agent/mcp-server.md for details on how MCP works.

REST API

Start the HTTP server to receive instance sync payloads from a Kubernetes controller:

vals exec -i -f .vals.yaml -- node dist/index.js serve

# Custom port
vals exec -i -f .vals.yaml -- node dist/index.js serve --port 8080

The server exposes:

Endpoint	Method	Description
`/healthz`	GET	Liveness probe — always returns 200 if the process is running
`/readyz`	GET	Readiness probe — returns 200 only when Chroma is reachable
`/api/v1/instances/sync`	POST	Receives batched instance upserts and deletes
`/api/v1/capabilities/scan`	POST	Triggers capability inference for specific CRDs (optional — requires `ANTHROPIC_API_KEY`)

The instance sync endpoint accepts a JSON payload with two arrays:

{
  "upserts": [
    {
      "id": "default/apps/v1/Deployment/nginx",
      "namespace": "default",
      "name": "nginx",
      "kind": "Deployment",
      "apiVersion": "apps/v1",
      "apiGroup": "apps",
      "labels": {},
      "annotations": {},
      "createdAt": "2025-01-15T10:30:00Z"
    }
  ],
  "deletes": ["default/apps/v1/Deployment/old-nginx"]
}

The capability scan endpoint accepts a list of fully qualified CRD resource names:

{
  "upserts": ["certificates.cert-manager.io", "issuers.cert-manager.io"],
  "deletes": ["old-resource.example.io"]
}

Unlike instance sync (which returns 200 synchronously), the capability scan returns 202 Accepted immediately and processes in the background — LLM inference takes ~4-6 seconds per resource. See docs/capability-inference-pipeline.md for details.

Both endpoints are designed to work with the k8s-vectordb-sync controller, which watches Kubernetes clusters for resource and CRD changes and pushes them here. Any client can POST to either endpoint — the contract is the JSON schema above.

The server handles graceful shutdown on SIGTERM, making it Kubernetes-deployment friendly.

Architecture

cluster-whisperer exposes kubectl and vector search tools via three interfaces:

CLI Agent

User Question → ReAct Agent → [kubectl + vector search tools] → Cluster / Vector DB → Answer
                    ↑                       |
                    └───────────────────────┘
                   (agent sees result,
                    decides next action)

The CLI agent has its own reasoning loop - it decides which tools to call and interprets the results.

MCP Server

User Question → [Claude Code / Cursor] → MCP → investigate tool → ReAct Agent → Cluster / Vector DB
                                                      ↑                  |
                                                      └──────────────────┘
                                                     (agent reasons internally)

The MCP server exposes a single investigate tool that wraps the same ReAct agent used by the CLI. This gives MCP clients complete investigations with full tracing - one call captures the entire reasoning chain.

REST API

k8s-vectordb-sync controller
        |
        ├── POST /api/v1/instances/sync      (resource changes)
        ├── POST /api/v1/capabilities/scan   (CRD changes)
        v
cluster-whisperer serve (Hono server) → Vector DB
        ^
        |
Kubernetes cluster ──(watches)──┘

The REST API receives pushed data from the k8s-vectordb-sync controller. Instance sync keeps the vector database up-to-date as resources change. Capability scan triggers LLM inference when new CRDs are installed, so the agent discovers new resource types automatically.

Available Tools

CLI Agent: Uses these tools internally during investigation. Which tools are available depends on the --tools flag:

Tool Group	Tools	Purpose
`kubectl`	`kubectl_get`, `kubectl_describe`, `kubectl_logs`	Cluster investigation
`vector`	`vector_search`	Semantic discovery of cluster capabilities
`apply`	`kubectl_apply`	Deploy resources from the approved catalog

Default: --tools kubectl,vector (backwards compatible).

kubectl_get - List resources and their status
kubectl_describe - Get detailed resource information
kubectl_logs - Check container logs
vector_search - Search the vector database with three composable dimensions:
- Semantic search (query) — natural language similarity via embeddings (e.g., "managed database" finds SQL CRDs)
- Keyword search (keyword) — exact substring match, no embedding call (e.g., "backup" finds docs mentioning backup)
- Metadata filters (kind, apiGroup, namespace, complexity) — exact match on structured fields
kubectl_apply - Deploy a Kubernetes resource by applying a YAML manifest. Validates the resource type against the platform's approved catalog before applying — enforcement is in code, not in the prompt. If the resource type isn't in the capabilities collection, the apply is rejected.

The agent uses kubectl tools for investigation ("why is this pod failing?"), vector search for discovery ("what databases can I provision?"), and kubectl_apply for deployment ("deploy the database for my team").

MCP Server: Exposes a single high-level tool:

investigate - Ask a question, get a complete answer (wraps the ReAct agent with all tools above)

Observability

OpenTelemetry tracing provides visibility into agent operations. OTel SDK packages are optional peer dependencies — tracing works when installed but everything runs fine without them. See docs/observability/opentelemetry.md for installation and configuration details.

cluster-whisperer.investigate (root span)
├── kubectl_get.tool
│   └── kubectl get pods -n default
├── kubectl_describe.tool
│   └── kubectl describe pod broken-pod
├── kubectl_logs.tool
│   └── kubectl logs broken-pod
└── vector_search.tool
    └── query: "managed database provisioning"

Environment Variables:

Variable	Default	Description
`CLUSTER_WHISPERER_TOOLS`	`kubectl,vector`	Comma-separated tool groups: `kubectl`, `vector`, `apply`
`CLUSTER_WHISPERER_AGENT`	`langgraph`	Agent framework: `langgraph` or `vercel`
`CLUSTER_WHISPERER_VECTOR_BACKEND`	`chroma`	Vector database: `chroma` or `qdrant`
`CLUSTER_WHISPERER_THREAD`	-	Conversation thread ID for multi-turn memory
`CLUSTER_WHISPERER_KUBECONFIG`	-	Kubeconfig path passed to kubectl (agent-only cluster access)
`CLUSTER_WHISPERER_CHROMA_URL`	`http://localhost:8000`	Chroma vector database URL
`CLUSTER_WHISPERER_QDRANT_URL`	`http://localhost:6333`	Qdrant vector database URL
`CLUSTER_WHISPERER_QUIET`	`false`	Suppress OTel init messages and Chroma warnings
`OTEL_TRACING_ENABLED`	`false`	Enable tracing
`OTEL_EXPORTER_TYPE`	`console`	`console` or `otlp`
`OTEL_EXPORTER_OTLP_ENDPOINT`	-	OTLP collector URL (e.g., `http://localhost:4318`)
`OTEL_CAPTURE_AI_PAYLOADS`	`false`	Capture tool inputs/outputs in traces
`VOYAGE_API_KEY`	-	Voyage AI API key (required by sync, sync-instances, and serve)

Schema Validation:

Custom span attributes (cluster_whisperer.*, traceloop.*, gen_ai.*) are formally defined in a Weaver registry at telemetry/registry/attributes.yaml. This is the single source of truth for attribute names, types, and descriptions. Weaver validates the schema and resolves references to OTel semantic conventions:

npm run telemetry:check     # Validate registry structure and references
npm run telemetry:resolve   # Resolve all references to flat JSON

See docs/observability/tracing-conventions.md for tracing architecture and design rationale, and docs/observability/telemetry-generated/attributes/cluster-whisperer.md for the auto-generated attribute reference.

Project Structure

src/
├── index.ts               # CLI entry point (agent + sync + serve commands)
├── mcp-server.ts          # MCP server entry point
├── agent/
│   ├── agent-events.ts       # AgentEvent union type (shared between agents)
│   ├── agent-interface.ts    # InvestigationAgent interface
│   ├── investigator.ts       # ReAct agent setup (LangGraph)
│   ├── langgraph-adapter.ts  # Wraps LangGraph agent as InvestigationAgent
│   ├── file-checkpointer.ts  # Persistent conversation memory for LangGraph --thread
│   ├── vercel-agent.ts       # Vercel AI SDK agent implementation
│   └── vercel-thread-store.ts # Conversation memory for Vercel agent --thread
├── api/                   # REST API for controller-pushed sync
│   ├── server.ts          # Hono HTTP server with health probes
│   ├── routes/
│   │   ├── instances.ts   # POST /api/v1/instances/sync endpoint
│   │   └── capabilities.ts # POST /api/v1/capabilities/scan endpoint
│   └── schemas/
│       ├── sync-payload.ts # Zod validation for instance sync payloads
│       └── scan-payload.ts # Zod validation for capability scan payloads
├── pipeline/              # Knowledge sync pipelines
│   ├── discovery.ts       # Resource type discovery (kubectl api-resources)
│   ├── inference.ts       # Capability inference (kubectl explain → LLM)
│   ├── storage.ts         # Capability document storage
│   ├── runner.ts          # Capability sync orchestrator
│   ├── instance-discovery.ts  # Resource instance discovery (kubectl get)
│   ├── instance-storage.ts    # Instance document storage
│   └── instance-runner.ts     # Instance sync orchestrator
├── vectorstore/           # Vector database abstraction
│   ├── types.ts           # VectorStore interface
│   ├── chroma-backend.ts  # Chroma implementation
│   ├── qdrant-backend.ts  # Qdrant implementation
│   ├── multi-backend.ts   # Writes to multiple backends in parallel (for sync)
│   └── embeddings.ts      # Voyage AI embedding provider
├── tools/
│   ├── core/              # Shared tool logic (schemas, execution)
│   │   ├── kubectl-get.ts
│   │   ├── kubectl-describe.ts
│   │   ├── kubectl-logs.ts
│   │   ├── kubectl-apply.ts   # Deploy with catalog validation
│   │   ├── vector-search.ts   # Unified semantic/keyword/metadata search
│   │   └── format-results.ts  # Search result formatting
│   ├── tool-groups.ts     # Tool group definitions (kubectl, vector, apply)
│   ├── langchain/         # LangGraph tool wrappers
│   ├── vercel/            # Vercel AI SDK tool wrappers
│   └── mcp/               # MCP server wrappers
├── tracing/               # OpenTelemetry instrumentation
│   ├── index.ts           # OTel initialization, exporter setup
│   ├── context-bridge.ts  # AsyncLocalStorage workaround for LangGraph
│   ├── tool-tracing.ts    # Tool span wrapper
│   ├── tool-definitions-processor.ts  # Adds tool definitions to LLM spans
│   ├── vercel-span-processor.ts  # Enriches Vercel SDK spans for Datadog LLM Obs
│   └── optional-deps.ts   # Graceful loading of optional OTel packages
└── utils/
    └── kubectl.ts         # Shared kubectl execution helper

prompts/
├── investigator.md        # Agent system prompt (investigation behavior)
└── capability-inference.md # Capability inference prompt (sync pipeline)

telemetry/
└── registry/              # OpenTelemetry Weaver schema
    ├── attributes.yaml    # Custom attribute definitions
    └── registry_manifest.yaml  # Schema metadata + OTel semconv dependency

scripts/
└── seed-test-data.ts      # Load sample data into Chroma for testing

docs/
├── agentic-loop.md                  # How the ReAct agent works
├── capability-inference-pipeline.md # How capability sync works
├── kubectl-tools.md                 # How kubectl tools work
├── langgraph-vs-langchain.md        # LangChain vs LangGraph explained
├── mcp-server.md                    # MCP server architecture
├── opentelemetry.md                 # OpenTelemetry implementation guide
├── resource-instance-sync.md        # How instance sync works
├── tracing-conventions.md           # Tracing architecture and design rationale
└── vector-database.md               # Vector database architecture

demo/
├── app/                   # Demo app — intentionally broken prop for KubeCon talk
│   ├── src/               # Hono server with DATABASE_URL connection logic
│   ├── k8s/               # Deployment + Service manifests
│   └── Dockerfile         # Multi-stage build
└── cluster/               # Demo cluster provisioning (GKE)
    ├── setup.sh           # Create cluster with all demo components
    ├── teardown.sh        # Destroy clusters and clean up kubeconfig
    ├── reset-demo.sh      # Reset between demo runs (cleanup ManagedService, restart app, clear threads)
    ├── kind-config.yaml   # Kind cluster configuration (experimental)
    ├── helm-values/       # Helm values for Crossplane, Chroma, Qdrant, Jaeger, OTel Collector
    └── manifests/         # Crossplane providers, XRDs, Compositions, decoy resources

Demo App

The demo/app/ directory contains a minimal Hono web server that requires a PostgreSQL database. It exists as a prop for the KubeCon "Choose Your Own Adventure" demo — when deployed to Kubernetes without a database, it crashes immediately and enters CrashLoopBackOff. The cluster-whisperer agent then investigates why the app is broken, discovers the missing database, and deploys one.

The app is intentionally simple. It connects to DATABASE_URL on startup: if the connection succeeds, it serves HTTP traffic; if it fails (or the variable is missing), it crashes with a clear, single-line error message designed for the agent to parse from kubectl logs.

Structure

demo/app/
├── src/
│   ├── index.ts          # Entry point — reads DATABASE_URL, attempts connection, crashes or starts server
│   ├── server.ts         # Hono app factory with GET / (DB status) and GET /healthz (liveness probe)
│   └── server.test.ts    # Unit tests for routes, startup behavior, and error message format
├── k8s/
│   ├── deployment.yaml   # Deployment with DATABASE_URL pointing to a non-existent service
│   └── service.yaml      # ClusterIP service exposing port 80 → 3000
├── Dockerfile            # Multi-stage build (node:22-alpine)
├── package.json
└── tsconfig.json

Build and Run

Build the container image:

cd demo/app
docker build -t demo-app:latest .

Without DATABASE_URL, the app crashes immediately:

$ docker run --rm demo-app:latest
[demo-app] Starting server...
[demo-app] FATAL: DATABASE_URL environment variable is required
[demo-app] Exiting with code 1

With an unreachable DATABASE_URL, it crashes with a connection error:

$ docker run --rm -e DATABASE_URL=postgres://db-service:5432/myapp demo-app:latest
[demo-app] Starting server...
[demo-app] Connecting to database at postgres://db-service:5432/myapp...
[demo-app] FATAL: Cannot connect to database at postgres://db-service:5432/myapp - getaddrinfo ENOTFOUND db-service
[demo-app] Exiting with code 1

Both crash modes are intentional — this is the behavior the agent investigates during the demo.

Deploy to Kubernetes

Load the image into a Kind cluster and apply the manifests:

kind load docker-image demo-app:latest --name <cluster-name>
kubectl apply -f demo/app/k8s/

The Deployment sets DATABASE_URL to postgres://db-service:5432/myapp — a service that doesn't exist in the cluster. The app crashes on startup and Kubernetes restarts it, producing CrashLoopBackOff within seconds:

$ kubectl get pods -l app=demo-app
NAME                        READY   STATUS             RESTARTS        AGE
demo-app-748c9d8c54-8mngm   0/1     CrashLoopBackOff   41 (4m ago)    3h9m

The logs show the same connection error from the Build and Run section:

$ kubectl logs --previous -l app=demo-app
[demo-app] Starting server...
[demo-app] Connecting to database at postgres://db-service:5432/myapp...
[demo-app] FATAL: Cannot connect to database at postgres://db-service:5432/myapp - getaddrinfo ENOTFOUND db-service
[demo-app] Exiting with code 1

This is what the cluster-whisperer agent sees when it investigates. The error messages are designed to be agent-friendly — single-line, containing the word "database" and the connection target, so the agent can diagnose the missing database from kubectl logs output alone.

Demo Cluster

The demo/cluster/ directory contains scripts to provision a complete demo environment on GKE. A single command creates a Kubernetes cluster with ~360 Crossplane CRDs, two vector databases, two observability backends, the demo app in CrashLoopBackOff, and a live cluster-whisperer instance — everything needed for the KubeCon "Choose Your Own Adventure" demo.

Prerequisites

Google Cloud SDK (gcloud) with gke-gcloud-auth-plugin
Helm 3.x
kubectl
Docker (for building container images)
Node.js 18+ (for the capability inference pipeline)
API keys in a .env file at the repo root (see .env.example):
- ANTHROPIC_API_KEY — for capability inference
- VOYAGE_API_KEY — for vector embeddings
- DD_API_KEY — for Datadog trace export (optional)

Setup

./demo/cluster/setup.sh gcp

The script auto-detects the nearest GCP zone (override with GCP_ZONE=europe-west1-b). It creates a 3-node GKE cluster, installs all components, runs the capability inference pipeline, and prints a summary when complete:

[ok] ==============================================
[ok] Demo Cluster Ready (gcp mode)
[ok] ==============================================

==> Mode:           gcp
==> Cluster:        cluster-whisperer-20260312-155916
==> KUBECONFIG:     /Users/whitney.lee/.kube/config-cluster-whisperer
==> CRDs:           1041
==> Demo app:       CrashLoopBackOff
==> Chroma:         Running
==> Qdrant:         Running
==> Jaeger:         Running
==> OTel Collector: Running
==> Ingress NGINX:  Running
==> CW serve:       Running
==> vectordb-sync:  Running

==> Ingress URLs:
==>   cluster-whisperer: http://cluster-whisperer.34.123.173.28.nip.io
==>   Jaeger UI:         http://jaeger.34.123.173.28.nip.io

==> To use this cluster:
  export KUBECONFIG=/Users/whitney.lee/.kube/config-cluster-whisperer

The setup script writes credentials to ~/.kube/config-cluster-whisperer. Your default kubeconfig is also modified during setup; teardown removes those entries.

Setup takes approximately 45-55 minutes on a cold start (GKE creation ~8 min, CRD registration ~23 min, capability inference ~12 min).

What Gets Created

Component	Namespace	Purpose
GKE cluster (3x n2-standard-4)	—	Kubernetes environment
Crossplane + 16 sub-providers	`crossplane-system`	~360 CRDs for discovery
20 ManagedService XRDs + Compositions	`crossplane-system`	1 real + 19 decoys — "needle in the haystack"
Chroma	`chroma`	Vector database option A (capabilities + instances)
Qdrant	`qdrant`	Vector database option B (capabilities + instances)
Jaeger v2	`jaeger`	Trace UI backend
OTel Collector	`otel-collector`	OTLP to Jaeger + Datadog fan-out
Demo app	`default`	Intentionally broken (CrashLoopBackOff)
cluster-whisperer serve	`cluster-whisperer`	REST API for live sync
k8s-vectordb-sync	`k8s-vectordb-sync`	Controller pushing resource changes
NGINX Ingress	`ingress-nginx`	External access via nip.io DNS

The setup script also runs the capability inference pipeline, which analyzes all ~360 CRDs via LLM and stores natural-language descriptions in Chroma. This is what enables semantic search — when the agent searches for "PostgreSQL database for my application", it finds the platform Composition among ~360 CRDs because the pipeline generated a description like "Platform-approved PostgreSQL database for application teams."

Teardown

./demo/cluster/teardown.sh

Discovers and deletes all cluster-whisperer clusters (both Kind and GKE), removes their kubeconfig entries, and cleans up the dedicated kubeconfig file if empty. GKE clusters incur billing until fully deleted.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cluster-whisperer

What is this?

How it works: The ReAct Pattern

Features

Prerequisites

Setup

Usage

CLI Agent

Knowledge Pipeline

MCP Server (Claude Code, Cursor, etc.)

REST API

Architecture

CLI Agent

MCP Server

REST API

Available Tools

Observability

Project Structure

Demo App

Structure

Build and Run

Deploy to Kubernetes

Demo Cluster

Prerequisites

Setup

What Gets Created

Teardown

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 507 Commits
.claude		.claude
.github		.github
demo		demo
docs		docs
k8s		k8s
prds		prds
prompts		prompts
scripts		scripts
src		src
telemetry/registry		telemetry/registry
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.vals.yaml		.vals.yaml
.verify-skip		.verify-skip
Dockerfile		Dockerfile
PROGRESS.md		PROGRESS.md
README.md		README.md
kcd-texas-abstract.md		kcd-texas-abstract.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

cluster-whisperer

What is this?

How it works: The ReAct Pattern

Features

Prerequisites

Setup

Usage

CLI Agent

Knowledge Pipeline

MCP Server (Claude Code, Cursor, etc.)

REST API

Architecture

CLI Agent

MCP Server

REST API

Available Tools

Observability

Project Structure

Demo App

Structure

Build and Run

Deploy to Kubernetes

Demo Cluster

Prerequisites

Setup

What Gets Created

Teardown

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages