Context
Serving runs raw vLLM behind GIE/agentgateway (ADR-0005, ADR-0013). Replica count is managed by hand today (make vllm-up/down). To become a self-hosted LLM platform the serving tier needs to scale
its replicas on load. The question is what signal to scale on and what mechanism drives it.
GPU/accelerator constraints shape this: under PagedAttention vLLM holds the GPU near 100% util at
all times, so GPU-util is a useless load signal. Request concurrency is also a poor proxy:
in-flight count says nothing about whether the server is keeping up. vLLM exposes a direct
saturation signal instead: vllm:num_requests_waiting (requests queued, not yet running).
This lab is single-GPU (GPUS_ALL_REGIONS=1) and runs $0-idle by design: raw-vllm defaults
to replicas: 0, the GPU node scales to zero, and Argo ignores /spec/replicas so make vllm-up/down is the bring-up control. Any autoscaler we add must not break that cost posture.
Decision
Scale vLLM with KEDA onvllm:num_requests_waiting (Prometheus trigger), secondary
vllm:kv_cache_usage_perc. This is the convergent 2026 vendor default (GKE inference
best-practices, Red Hat OpenShift AI Nov-2025, KServe v0.18 Standard-mode, vLLM production-stack).
- Mechanism = KEDA, not a raw HPA + custom-metrics adapter. KEDA’s Prometheus scaler queries
PromQL directly (the ServiceMonitor already lands vLLM metrics in Prometheus) and generates the
HPA for the 1→N range, with no Prometheus Adapter /
external.metrics.k8s.ioplumbing to operate. - Primary signal
sum(vllm:num_requests_waiting), threshold ~5/replica (AverageValue → desired = ceil(total_queue / 5)). Secondaryvllm:kv_cache_usage_perc(KV-cache pressure; V1 rename ofgpu_cache_usage_perc). Never scale on GPU-util. - KEDA operator is a controller-only, no-cost, auto-sync platform app (chart 2.20.1, wave 1,
its CRDs gate the ScaledObject). The ScaledObject lives with
raw-vllm(manual-sync, wave 4) and ships paused (autoscaling.keda.sh/paused: "true").
make vllm-down. So the ScaledObject is paused by default: KEDA touches no replicas while paused, so
make vllm-up/down behave exactly as before. Autoscaling is a deliberate, session-scoped action
(make keda-demo-up → load test → make keda-demo-down → make vllm-down), consistent with the
manual-sync discipline for paid compute.
Replica floor is a profile knob (ties to deployment profiles), not a fixed default: prod min 2 (HA) ·
dev min 1 · cost min 0. True scale-to-zero (min 0) additionally needs the KEDA HTTP Add-on
(buffering proxy) because GIE returns 503 at 0 endpoints + node min-0 + accepting multi-minute
cold start (gated on the cold-start work); deferred, not wired now.
Alternatives considered
- HPA + Prometheus Adapter (
external.metrics.k8s.io). Rejected: an extra adapter to deploy and operate for the same outcome KEDA gives via direct PromQL. KEDA still emits an HPA under the hood. - GKE custom/external metrics HPA (Stackdriver adapter). Rejected: couples scaling to Cloud Monitoring + the managed adapter; we already run in-cluster Prometheus and want portability.
- Knative / KPA (concurrency or RPS). Rejected: concurrency/RPS are poor load proxies for LLM serving (a slow decode keeps concurrency low while the queue grows), and KPA pulls in Knative Serving, heavy for a single hot model. Queue depth is the truer saturation signal.
- Scale on GPU utilization. Rejected outright: PagedAttention pegs GPU-util ~100%, so it never reflects load.
- llm-d Workload Variant Autoscaler (WVA). Deferred: KV/SLO-aware, per-model scale-to-zero, independent prefill/decode scaling, the future power path; extends our GIE/EPP. Adopt when multi-model / disaggregation earns the complexity (CNCF Sandbox today). Not needed for one model.
- Kueue for serving autoscale. Rejected: Kueue governs batch admission/quota; let KEDA/HPA own serving replicas. They are complementary, not substitutes.
Consequences
- Serving scales on a real saturation signal with no custom-metrics adapter to operate; the only new always-on component is the small KEDA controller (no GPU cost).
- $0-idle and
make vllm-up/downare preserved because the ScaledObject is paused by default; the trade-off is that autoscaling is opt-in per session rather than always-on. - Single-GPU quota caps real scale-out at 1 replica, so true 1→N is validated on the free CPU
sim (
llm-d-inference-sim, which emits the samevllm:num_requests_waiting); on the real GPU we validate that KEDA reads the metric and computes scale intent. Raising the ceiling needs a GCP GPU quota bump. - Scale-to-zero is not delivered here (needs KEDA HTTP Add-on + cold-start work); the
floor stays a profile decision. Tripwire to watch:
vllm:num_requests_swapped > 0(KV-cache thrash, a sign to scale out sooner or cap concurrency). - Sources: KEDA on
num_requests_waiting(KServe v0.18, Red Hat 2025-09/11), vLLM Sleep Mode, KEDA HTTP Add-on maturity.