-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
component/apiRelated to CRDs and API definitionsRelated to CRDs and API definitionscomponent/controllerRelated to the operator controllerRelated to the operator controllerenhancementNew feature or requestNew feature or requestpriority/highHigh priorityHigh priority
Description
Description
InferenceService currently requires manually setting replicas in the spec. There is no integration with Kubernetes Horizontal Pod Autoscaler (HPA) or custom metrics-based scaling. This is the most significant functional gap compared to other Kubernetes inference platforms (KServe, KAITO) and blocks production use for variable-load scenarios.
Goals
- HPA integration using llama.cpp's built-in Prometheus metrics (exposed via
--metricsflag, already enabled on all inference pods)- Scaling based on request concurrency (
llamacpp:requests_processing) - Scaling based on queue depth (
llamacpp:requests_pending) - Scaling based on KV cache utilization (
llamacpp:kv_cache_usage_ratio)
- Scaling based on request concurrency (
- CRD fields on InferenceServiceSpec for autoscaling configuration:
minReplicas/maxReplicastargetMetricandtargetValue(or a structuredautoscalingblock)
- Prometheus Adapter or KEDA integration to bridge llama.cpp metrics into the Kubernetes metrics API for HPA consumption
Non-goals (for now)
- Scale-to-zero (tracked separately in Cost optimization features #11 under cost optimization)
- SLO-driven scaling with model fallback (tracked in Advanced SLO auto-remediation #10)
Context
llama.cpp already exposes rich metrics on every inference pod. The building blocks are in place; the missing piece is the controller logic and CRD fields to wire them into HPA.
This is the most commonly cited gap when comparing LLMKube to KServe (HPA + Knative KPA) and KAITO (KEDA-based scaling). Closing it is critical for production credibility.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component/apiRelated to CRDs and API definitionsRelated to CRDs and API definitionscomponent/controllerRelated to the operator controllerRelated to the operator controllerenhancementNew feature or requestNew feature or requestpriority/highHigh priorityHigh priority