Skip to content

HPA autoscaling for InferenceService based on llama.cpp metrics #240

@Defilan

Description

@Defilan

Description

InferenceService currently requires manually setting replicas in the spec. There is no integration with Kubernetes Horizontal Pod Autoscaler (HPA) or custom metrics-based scaling. This is the most significant functional gap compared to other Kubernetes inference platforms (KServe, KAITO) and blocks production use for variable-load scenarios.

Goals

  1. HPA integration using llama.cpp's built-in Prometheus metrics (exposed via --metrics flag, already enabled on all inference pods)
    • Scaling based on request concurrency (llamacpp:requests_processing)
    • Scaling based on queue depth (llamacpp:requests_pending)
    • Scaling based on KV cache utilization (llamacpp:kv_cache_usage_ratio)
  2. CRD fields on InferenceServiceSpec for autoscaling configuration:
    • minReplicas / maxReplicas
    • targetMetric and targetValue (or a structured autoscaling block)
  3. Prometheus Adapter or KEDA integration to bridge llama.cpp metrics into the Kubernetes metrics API for HPA consumption

Non-goals (for now)

Context

llama.cpp already exposes rich metrics on every inference pod. The building blocks are in place; the missing piece is the controller logic and CRD fields to wire them into HPA.

This is the most commonly cited gap when comparing LLMKube to KServe (HPA + Knative KPA) and KAITO (KEDA-based scaling). Closing it is critical for production credibility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions