llm-orchestrator

Production-grade LLM inference orchestration (control plane + worker)
A pragmatic, auditable system for routing and executing LLM/VLM inference across heterogeneous backends (local GPU, cloud GPU, external APIs). Built as a control plane (FastAPI) + stateless worker runtime with Redis queues, OTEL instrumentation and Prometheus / Grafana observability.

Note: docker-compose.dev.yml in this repo is intended for local development only — it brings up the app, workers, Redis and a local OTEL/Prometheus/Grafana stack for rapid iteration. Production deployment should use Cloud Run / managed services (see Deploy → GCP).

Status

✅ Control plane (FastAPI) with job submission / status endpoints
✅ Generic worker runtime with pluggable backend adapter model
✅ Redis-backed priority queues (high, normal, low)
✅ Structured JSON logging and OTEL metrics + Prometheus integration
⚠️ No autoscaling yet (planned)
⚠️ File/object storage intentionally out of scope

Highlights (why this repo)

Clear separation of concerns: control plane vs workers
Deterministic perception of system state via Redis primitives (inspectable)
Cost/latency aware routing is implemented in a testable policy module (pluggable)
Observability-first: OTEL → Prometheus → Grafana + structured logs
Minimal cloud migration path (Cloud Run + managed Redis recommended)

Screenshots (Grafana)

P95 latency (example):

Latency distribution heatmap:

Quickstart — Local development

Prerequisites

Docker (Docker Compose)
git

Run the full local dev stack (this runs app, worker, redis, otel-collector, prometheus, grafana):

docker compose -f docker-compose.dev.yml up --build -d

Open:

API: http://localhost:8000 (FastAPI docs http://localhost:8000/docs)
Grafana: http://localhost:3001 (default admin:admin)
Prometheus: http://localhost:9090

Submit a job (example)

curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -H "Idempotency-Key: demo-1" \
  -d '{"prompt":"Summarize the page", "priority":"normal", "requires_multimodal": false}'

Check status / result:

curl http://localhost:8000/jobs/<job_id>/result

Configuration (env / runtime)

All runtime configuration is from environment variables (control plane and worker images). Example highlights:

Worker

WORKER_QUEUE (high|normal|low) — the queue the worker consumes
MAX_CONCURRENCY — number of parallel jobs within one worker
BACKEND — which backend adapter the worker uses (e.g., ollama, hf_local, dev)

General (Control plane & worker)

OTEL_EXPORTER_OTLP_ENDPOINT (collector address)
LOG_LEVEL (DEBUG|INFO|WARN|ERROR)
REDIS_URL (e.g. redis://redis:6379/0)

See docker/* examples also.

Observability & Metrics

All logs are structured JSON and include event, job_id, policy, backend, and (when available) trace_id.
OTEL SDK instruments metrics. Key metrics:
- llm_orchestrator_jobs_created_total
- llm_orchestrator_jobs_enqueued_total
- llm_orchestrator_job_latency_seconds (histogram)
- llm_orchestrator_queue_depth
- llm_orchestrator_routing_decisions_total{policy,backend}

Prometheus / Grafana

OTEL Collector exposes Prometheus scrape endpoint at :9464.
Useful Grafana panels (some already prepared in ./observability/grafana/dashboards): P50/P95/P99 latency, latency heatmap, queue depth, throughput, routing breakdown.

Logs

By default logs print to stdout (JSON) so Docker/Cloud environments can ingest them (Cloud Logging, Loki, etc).

Deploy → Google Cloud (recommended path)

For demo / small production, Cloud Run (managed) is recommended (faster to operate than GKE).

High level steps

Build and push images:

gcloud builds submit --tag gcr.io/<PROJECT_ID>/llm-orchestrator-control
gcloud builds submit --tag gcr.io/<PROJECT_ID>/llm-orchestrator-worker

Create a managed Redis (Memorystore) instance (or use a cloud Redis provider) and set REDIS_URL.

Deploy control plane to Cloud Run:

gcloud run deploy llm-orchestrator-control \
  --image gcr.io/<PROJECT_ID>/llm-orchestrator-control \
  --region <REGION> \
  --allow-unauthenticated \
  --set-env-vars REDIS_URL=<redis_url>,OTEL_EXPORTER_OTLP_ENDPOINT=<collector>

For workers: run in Cloud Run (concurrency=1, specific CPU/GPU machine types not supported on Cloud Run — for GPU workers use GCE/GKE) or run GPU workers as GCE instances with the same container and environment.
Use Google Cloud Monitoring / Cloud Logging or configure OTEL exporter for Google to push metrics & logs.

Notes

For GPU-backed heavy inference, use GCE GPU VMs or GKE node pools with GPU; workers are the same container image configured with BACKEND to use local GPU resources.
Keep docker-compose.dev.yml marked as dev-only in README and DO NOT use it for production.

Architecture overview (concise)

Control plane (FastAPI): validates and enqueues jobs, exposes API and admin endpoints, determines routing policy.
Redis: coordination (queues, job metadata, idempotency keys, results).
Workers: stateless runtime reading one queue, executing jobs via configured backend adapter, writing results back.
OTEL Collector → Prometheus → Grafana: metrics pipeline for monitoring.
Policy module: pluggable selection logic (cost/latency-aware); policy hot-swap supported.

File & config layout (important files)

app/ – control plane source
worker/ – worker runtime & backends
docker/ – Dockerfiles (app.Dockerfile, worker.Dockerfile)
docker-compose.dev.yml – local dev composition (dev-only)
observability/ – OTEL Collector / Prometheus configs

Testing / Validation

Use local compose to validate end-to-end flow.
Generate test load (simple script in tools/) to produce latency graphs for README.
Unit tests cover policy logic and queue helpers; integration tests cover enqueue→worker execution (requires Redis).

Operational notes & non-goals

Non-goals: file/object lifecycle, auth, billing, multi-tenant quotas, training/fine-tuning.
Idempotency handled at API layer via Idempotency-Key header (stored in Redis with TTL).
Visibility timeouts / exactly-once semantics are deferred; current model provides at-least-once execution with explicit job states.

Contributing

Fork & create a feature branch.
Run unit tests (pytest).
Add or update docs / dashboard JSON if you change metrics.
Open PR describing design tradeoffs.

License

MIT

Contact

rastorguev2047@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
app		app
docker		docker
observability		observability
queues		queues
tests		tests
utils		utils
worker		worker
.gitignore		.gitignore
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
requirements-dev.txt		requirements-dev.txt
requirements-worker.txt		requirements-worker.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-orchestrator

Status

Highlights (why this repo)

Screenshots (Grafana)

Quickstart — Local development

Configuration (env / runtime)

Observability & Metrics

Deploy → Google Cloud (recommended path)

Architecture overview (concise)

File & config layout (important files)

Testing / Validation

Operational notes & non-goals

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-orchestrator

Status

Highlights (why this repo)

Screenshots (Grafana)

Quickstart — Local development

Configuration (env / runtime)

Observability & Metrics

Deploy → Google Cloud (recommended path)

Architecture overview (concise)

File & config layout (important files)

Testing / Validation

Operational notes & non-goals

Contributing

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages