Skip to content

A-lex-Ra/LLM-orchestrator

Repository files navigation

llm-orchestrator

Production-grade LLM inference orchestration (control plane + worker)
A pragmatic, auditable system for routing and executing LLM/VLM inference across heterogeneous backends (local GPU, cloud GPU, external APIs). Built as a control plane (FastAPI) + stateless worker runtime with Redis queues, OTEL instrumentation and Prometheus / Grafana observability.

Note: docker-compose.dev.yml in this repo is intended for local development only — it brings up the app, workers, Redis and a local OTEL/Prometheus/Grafana stack for rapid iteration. Production deployment should use Cloud Run / managed services (see Deploy → GCP).


Status

  • ✅ Control plane (FastAPI) with job submission / status endpoints
  • ✅ Generic worker runtime with pluggable backend adapter model
  • ✅ Redis-backed priority queues (high, normal, low)
  • ✅ Structured JSON logging and OTEL metrics + Prometheus integration
  • ⚠️ No autoscaling yet (planned)
  • ⚠️ File/object storage intentionally out of scope

Highlights (why this repo)

  • Clear separation of concerns: control plane vs workers
  • Deterministic perception of system state via Redis primitives (inspectable)
  • Cost/latency aware routing is implemented in a testable policy module (pluggable)
  • Observability-first: OTEL → Prometheus → Grafana + structured logs
  • Minimal cloud migration path (Cloud Run + managed Redis recommended)

Screenshots (Grafana)

P95 latency (example):
P95 latency

Latency distribution heatmap:
Latency distribution


Quickstart — Local development

Prerequisites

  • Docker (Docker Compose)
  • git

Run the full local dev stack (this runs app, worker, redis, otel-collector, prometheus, grafana):

docker compose -f docker-compose.dev.yml up --build -d

Open:

  • API: http://localhost:8000 (FastAPI docs http://localhost:8000/docs)
  • Grafana: http://localhost:3001 (default admin:admin)
  • Prometheus: http://localhost:9090

Submit a job (example)

curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -H "Idempotency-Key: demo-1" \
  -d '{"prompt":"Summarize the page", "priority":"normal", "requires_multimodal": false}'

Check status / result:

curl http://localhost:8000/jobs/<job_id>/result

Configuration (env / runtime)

All runtime configuration is from environment variables (control plane and worker images). Example highlights:

Worker

  • WORKER_QUEUE (high|normal|low) — the queue the worker consumes
  • MAX_CONCURRENCY — number of parallel jobs within one worker
  • BACKEND — which backend adapter the worker uses (e.g., ollama, hf_local, dev)

General (Control plane & worker)

  • OTEL_EXPORTER_OTLP_ENDPOINT (collector address)
  • LOG_LEVEL (DEBUG|INFO|WARN|ERROR)
  • REDIS_URL (e.g. redis://redis:6379/0)

See docker/* examples also.


Observability & Metrics

  • All logs are structured JSON and include event, job_id, policy, backend, and (when available) trace_id.

  • OTEL SDK instruments metrics. Key metrics:

    • llm_orchestrator_jobs_created_total
    • llm_orchestrator_jobs_enqueued_total
    • llm_orchestrator_job_latency_seconds (histogram)
    • llm_orchestrator_queue_depth
    • llm_orchestrator_routing_decisions_total{policy,backend}

Prometheus / Grafana

  • OTEL Collector exposes Prometheus scrape endpoint at :9464.
  • Useful Grafana panels (some already prepared in ./observability/grafana/dashboards): P50/P95/P99 latency, latency heatmap, queue depth, throughput, routing breakdown.

Logs

  • By default logs print to stdout (JSON) so Docker/Cloud environments can ingest them (Cloud Logging, Loki, etc).

Deploy → Google Cloud (recommended path)

For demo / small production, Cloud Run (managed) is recommended (faster to operate than GKE).

High level steps

  1. Build and push images:

    gcloud builds submit --tag gcr.io/<PROJECT_ID>/llm-orchestrator-control
    gcloud builds submit --tag gcr.io/<PROJECT_ID>/llm-orchestrator-worker
  2. Create a managed Redis (Memorystore) instance (or use a cloud Redis provider) and set REDIS_URL.

  3. Deploy control plane to Cloud Run:

    gcloud run deploy llm-orchestrator-control \
      --image gcr.io/<PROJECT_ID>/llm-orchestrator-control \
      --region <REGION> \
      --allow-unauthenticated \
      --set-env-vars REDIS_URL=<redis_url>,OTEL_EXPORTER_OTLP_ENDPOINT=<collector>
  4. For workers: run in Cloud Run (concurrency=1, specific CPU/GPU machine types not supported on Cloud Run — for GPU workers use GCE/GKE) or run GPU workers as GCE instances with the same container and environment.

  5. Use Google Cloud Monitoring / Cloud Logging or configure OTEL exporter for Google to push metrics & logs.

Notes

  • For GPU-backed heavy inference, use GCE GPU VMs or GKE node pools with GPU; workers are the same container image configured with BACKEND to use local GPU resources.
  • Keep docker-compose.dev.yml marked as dev-only in README and DO NOT use it for production.

Architecture overview (concise)

  • Control plane (FastAPI): validates and enqueues jobs, exposes API and admin endpoints, determines routing policy.
  • Redis: coordination (queues, job metadata, idempotency keys, results).
  • Workers: stateless runtime reading one queue, executing jobs via configured backend adapter, writing results back.
  • OTEL Collector → Prometheus → Grafana: metrics pipeline for monitoring.
  • Policy module: pluggable selection logic (cost/latency-aware); policy hot-swap supported.

File & config layout (important files)

  • app/ – control plane source
  • worker/ – worker runtime & backends
  • docker/ – Dockerfiles (app.Dockerfile, worker.Dockerfile)
  • docker-compose.dev.yml – local dev composition (dev-only)
  • observability/ – OTEL Collector / Prometheus configs

Testing / Validation

  • Use local compose to validate end-to-end flow.
  • Generate test load (simple script in tools/) to produce latency graphs for README.
  • Unit tests cover policy logic and queue helpers; integration tests cover enqueue→worker execution (requires Redis).

Operational notes & non-goals

  • Non-goals: file/object lifecycle, auth, billing, multi-tenant quotas, training/fine-tuning.
  • Idempotency handled at API layer via Idempotency-Key header (stored in Redis with TTL).
  • Visibility timeouts / exactly-once semantics are deferred; current model provides at-least-once execution with explicit job states.

Contributing

  1. Fork & create a feature branch.
  2. Run unit tests (pytest).
  3. Add or update docs / dashboard JSON if you change metrics.
  4. Open PR describing design tradeoffs.

License

MIT


Contact

rastorguev2047@gmail.com

About

LLM inference orchestrator for routing requests across heterogeneous backends (local GPU, cloud APIs) with explicit latency, cost, and failure-isolation trade-offs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors