Production-grade LLM inference orchestration (control plane + worker)
A pragmatic, auditable system for routing and executing LLM/VLM inference across heterogeneous backends (local GPU, cloud GPU, external APIs). Built as a control plane (FastAPI) + stateless worker runtime with Redis queues, OTEL instrumentation and Prometheus / Grafana observability.
Note:
docker-compose.dev.ymlin this repo is intended for local development only — it brings up the app, workers, Redis and a local OTEL/Prometheus/Grafana stack for rapid iteration. Production deployment should use Cloud Run / managed services (see Deploy → GCP).
- ✅ Control plane (FastAPI) with job submission / status endpoints
- ✅ Generic worker runtime with pluggable backend adapter model
- ✅ Redis-backed priority queues (
high,normal,low) - ✅ Structured JSON logging and OTEL metrics + Prometheus integration
⚠️ No autoscaling yet (planned)⚠️ File/object storage intentionally out of scope
- Clear separation of concerns: control plane vs workers
- Deterministic perception of system state via Redis primitives (inspectable)
- Cost/latency aware routing is implemented in a testable policy module (pluggable)
- Observability-first: OTEL → Prometheus → Grafana + structured logs
- Minimal cloud migration path (Cloud Run + managed Redis recommended)
Prerequisites
- Docker (Docker Compose)
git
Run the full local dev stack (this runs app, worker, redis, otel-collector, prometheus, grafana):
docker compose -f docker-compose.dev.yml up --build -dOpen:
- API:
http://localhost:8000(FastAPI docshttp://localhost:8000/docs) - Grafana:
http://localhost:3001(defaultadmin:admin) - Prometheus:
http://localhost:9090
Submit a job (example)
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-H "Idempotency-Key: demo-1" \
-d '{"prompt":"Summarize the page", "priority":"normal", "requires_multimodal": false}'Check status / result:
curl http://localhost:8000/jobs/<job_id>/resultAll runtime configuration is from environment variables (control plane and worker images). Example highlights:
Worker
WORKER_QUEUE(high|normal|low) — the queue the worker consumesMAX_CONCURRENCY— number of parallel jobs within one workerBACKEND— which backend adapter the worker uses (e.g.,ollama,hf_local,dev)
General (Control plane & worker)
OTEL_EXPORTER_OTLP_ENDPOINT(collector address)LOG_LEVEL(DEBUG|INFO|WARN|ERROR)REDIS_URL(e.g.redis://redis:6379/0)
See docker/* examples also.
-
All logs are structured JSON and include
event,job_id,policy,backend, and (when available)trace_id. -
OTEL SDK instruments metrics. Key metrics:
llm_orchestrator_jobs_created_totalllm_orchestrator_jobs_enqueued_totalllm_orchestrator_job_latency_seconds(histogram)llm_orchestrator_queue_depthllm_orchestrator_routing_decisions_total{policy,backend}
Prometheus / Grafana
- OTEL Collector exposes Prometheus scrape endpoint at
:9464. - Useful Grafana panels (some already prepared in
./observability/grafana/dashboards): P50/P95/P99 latency, latency heatmap, queue depth, throughput, routing breakdown.
Logs
- By default logs print to stdout (JSON) so Docker/Cloud environments can ingest them (Cloud Logging, Loki, etc).
For demo / small production, Cloud Run (managed) is recommended (faster to operate than GKE).
High level steps
-
Build and push images:
gcloud builds submit --tag gcr.io/<PROJECT_ID>/llm-orchestrator-control gcloud builds submit --tag gcr.io/<PROJECT_ID>/llm-orchestrator-worker
-
Create a managed Redis (Memorystore) instance (or use a cloud Redis provider) and set
REDIS_URL. -
Deploy control plane to Cloud Run:
gcloud run deploy llm-orchestrator-control \ --image gcr.io/<PROJECT_ID>/llm-orchestrator-control \ --region <REGION> \ --allow-unauthenticated \ --set-env-vars REDIS_URL=<redis_url>,OTEL_EXPORTER_OTLP_ENDPOINT=<collector>
-
For workers: run in Cloud Run (concurrency=1, specific CPU/GPU machine types not supported on Cloud Run — for GPU workers use GCE/GKE) or run GPU workers as GCE instances with the same container and environment.
-
Use Google Cloud Monitoring / Cloud Logging or configure OTEL exporter for Google to push metrics & logs.
Notes
- For GPU-backed heavy inference, use GCE GPU VMs or GKE node pools with GPU; workers are the same container image configured with
BACKENDto use local GPU resources. - Keep
docker-compose.dev.ymlmarked as dev-only in README and DO NOT use it for production.
- Control plane (FastAPI): validates and enqueues jobs, exposes API and admin endpoints, determines routing policy.
- Redis: coordination (queues, job metadata, idempotency keys, results).
- Workers: stateless runtime reading one queue, executing jobs via configured backend adapter, writing results back.
- OTEL Collector → Prometheus → Grafana: metrics pipeline for monitoring.
- Policy module: pluggable selection logic (cost/latency-aware); policy hot-swap supported.
app/– control plane sourceworker/– worker runtime & backendsdocker/– Dockerfiles (app.Dockerfile,worker.Dockerfile)docker-compose.dev.yml– local dev composition (dev-only)observability/– OTEL Collector / Prometheus configs
- Use local compose to validate end-to-end flow.
- Generate test load (simple script in
tools/) to produce latency graphs for README. - Unit tests cover policy logic and queue helpers; integration tests cover enqueue→worker execution (requires Redis).
- Non-goals: file/object lifecycle, auth, billing, multi-tenant quotas, training/fine-tuning.
- Idempotency handled at API layer via
Idempotency-Keyheader (stored in Redis with TTL). - Visibility timeouts / exactly-once semantics are deferred; current model provides at-least-once execution with explicit job states.
- Fork & create a feature branch.
- Run unit tests (
pytest). - Add or update docs / dashboard JSON if you change metrics.
- Open PR describing design tradeoffs.
MIT
rastorguev2047@gmail.com

