HPA autoscaling for InferenceService based on llama.cpp metrics

## Description
InferenceService currently requires manually setting `replicas` in the spec. There is no integration with Kubernetes Horizontal Pod Autoscaler (HPA) or custom metrics-based scaling. This is the most significant functional gap compared to other Kubernetes inference platforms (KServe, KAITO) and blocks production use for variable-load scenarios.

## Goals
1. **HPA integration** using llama.cpp's built-in Prometheus metrics (exposed via `--metrics` flag, already enabled on all inference pods)
   - Scaling based on request concurrency (`llamacpp:requests_processing`)
   - Scaling based on queue depth (`llamacpp:requests_pending`)
   - Scaling based on KV cache utilization (`llamacpp:kv_cache_usage_ratio`)
2. **CRD fields** on InferenceServiceSpec for autoscaling configuration:
   - `minReplicas` / `maxReplicas`
   - `targetMetric` and `targetValue` (or a structured `autoscaling` block)
3. **Prometheus Adapter or KEDA integration** to bridge llama.cpp metrics into the Kubernetes metrics API for HPA consumption

## Non-goals (for now)
- Scale-to-zero (tracked separately in #11 under cost optimization)
- SLO-driven scaling with model fallback (tracked in #10)

## Context
llama.cpp already exposes rich metrics on every inference pod. The building blocks are in place; the missing piece is the controller logic and CRD fields to wire them into HPA.

This is the most commonly cited gap when comparing LLMKube to KServe (HPA + Knative KPA) and KAITO (KEDA-based scaling). Closing it is critical for production credibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPA autoscaling for InferenceService based on llama.cpp metrics #240

Description

Goals

Non-goals (for now)

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HPA autoscaling for InferenceService based on llama.cpp metrics #240

Description

Description

Goals

Non-goals (for now)

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions