Skip to main content
Status: Accepted Date: 2026-06-18

Context

serving/raw-vllm (ADR-0003 benchmarks) is a hand-rolled Deployment+Service+PVC: full control, but every operational concern (rollout, canary, autoscale-to-zero, model lifecycle, URL/ingress, multi-model governance) is ours to wire. The question for a platform: when do we hand that to a serving controller, and which one? KServe is CNCF-incubating and the de-facto K8s model-serving control plane. As of v0.19 it ships two CR families that are easy to conflate (§9):
  • InferenceService + ServingRuntime (serving.kserve.io/v1beta1): the classic path. The ISVC declares a model; a ServingRuntime (or a custom container) serves it; KServe reconciles a Deployment/Service, an ingress route, autoscaling, and canary rollout.
  • LLMInferenceService (newer, ~v0.18+): an LLM-specific CR that natively embeds llm-d + GIE (disaggregated prefill/decode, KV-aware routing, prefix caching).

Decision

Core scope: classic InferenceService with a custom vLLM container, RawDeployment (Standard) mode. Keep raw-vllm as the control benchmark; add serving/kserve serving the same model (Qwen2.5-0.5B-Instruct, same vLLM v0.23.0) so the comparison is engine-identical (raw on GPU, KServe on CPU while GPU is stocked out). Ingress via Gateway API on our existing agentgateway (no Istio). LLMInferenceService was kept out of the core comparison path: it embeds llm-d+GIE, which overlaps the routing layer we already built (ADR-0005); folding it into the core path would have duplicated the agentgateway/GIE work. It is since integrated and tested as its own isolated disaggregated-serving path (serving/llm-d/), not part of the core raw-vLLM-vs-KServe comparison. Use a custom-container predictor (predictor.containers), not model+runtime: KServe’s model+runtime binding force-injects --model=/mnt/models and therefore requires a storageUri
  • storage-initializer; a custom container runs verbatim so vLLM controls its own model path. (This costs the ServingRuntime abstraction; documented in the runbook.)

When raw vLLM vs KServe (the takeaway)

  • Raw vLLM when: one model, one team, you want full control and minimal moving parts, and you’ll own rollout/scale yourself. It’s the right call for a single benchmarked endpoint (ADR-0003).
  • KServe InferenceService when: you want the platform to own model lifecycle: declarative rollout + canary (canaryTrafficPercent), scale-to-zero, a stable URL/ingress per model, and a consistent multi-model contract. The cost is a control plane (controller + cert-manager) and KServe’s opinions (see consequences).
  • LLMInferenceService when (advanced-inference stage): large/multi-node models needing disaggregated serving, KV-aware routing, prefix caching, i.e. llm-d territory.

Consequences

  • + Same model serves raw and via KServe; verified: ISVC Ready, GET /v1/modelsqwen2.5-0.5b, POST /v1/chat/completions → 200 (system_fingerprint vllm-0.23.0) through the agentgateway. KServe adds the lifecycle/ingress/rollout layer raw-vllm lacks.
  • + Reused agentgateway as KServe’s Gateway-API ingress: one gateway spans GIE routing and KServe serving; no Istio.
  • New prereqs: cert-manager (webhook certs) + the KServe controller.
  • KServe injects opinionated defaults that fight you (all → runbook kserve.md): a default cpu limit of 1 (invalidates requests.cpu>1), an Istio VirtualString path unless disableIstioVirtualHost=true, and the model+runtime /mnt/models injection above.
  • Canary not live-demoed at $0-CPU: the new revision is a second predictor pod; our model on an RWO PVC pins both revisions to one node and 2×cpu:2 vLLM-CPU pods exceed an e2-standard-4. The weighted-split concept is already proven live at the GIE layer (ADR-0005, 90/10). A live KServe canary needs a ReadOnlyMany model volume (Filestore), deferred with the advanced-inference stage.
    • 2026-06-20: KServe-native canary is configured but not live-demoed only because RWO pins both revisions to one node. PRODUCT TRIGGER = >1 model version in prod AND a ReadOnlyMany model volume.

Follow-up: model delivery at scale

This work exposed model delivery as its own decision: HF self-download is fragile (the cluster egress IP is HF-429-rate-limited even with a token), so we pre-staged the model onto a PVC and load it offline. At scale the options are: per-pod HF pull, a shared RWX cache, KServe LocalModelCache (node-local pre-pull DaemonSet, v0.17+), or OCI “modelcar” (oci:// weights-in-image, immutable, air-gapped-friendly). Decide deliberately when large models / many replicas / air-gap actually arrive; modelcar is favored for a forkable air-gapped platform.

Update 2026-06-20: KServe promoted to model-delivery substrate

The deferred follow-up above is resolved by ADR-0016: OCI modelcar (digest-pinned oci://) is the default model-delivery primitive, and KServe now owns model delivery (modelcar + later LocalModelCache are KServe-native; raw-vLLM stays the simple/benchmark endpoint). The custom-container objection in the Decision dissolves: modelcar symlinks the weights to /mnt/models, so the old “model+runtime force-injects /mnt/models” complaint is moot: a custom container reads /mnt/models directly with no storage-initializer copy. New ISVC serving/kserve/inferenceservice-modelcar.yaml (qwen-oci, GPU). GPU validation pending (the proof that closes this ADR’s CPU-only gap is the model-delivery-at-scale deliverable; runbook kserve-modelcar.md). See ADR-0016.

Addendum 2026-06-23: KServe LocalModelCache DROPPED (with the OCI modelcar default)

The model-delivery follow-up above listed LocalModelCache (KServe’s node-local pre-pull DaemonSet, v0.17+) as a scale option alongside the OCI modelcar. With modelcar settled as the default (ADR-0016), LocalModelCache is dropped, for two grounded reasons:
  1. No oci:// source. LocalModelCache pre-pulls from a storageUri (HF / object storage / PVC). The modelcar delivers weights as an oci:// artifact resolved by the injected modelcar sidecar + init-container, not by KServe’s storage-initializer: there is no storageUri for LocalModelCache to pre-pull. The two delivery mechanisms do not compose.
  2. Scale-to-zero evicts the node-warm cache. This lab scales the GPU node to zero when idle, which deletes the node and any node-local cache on it, so a node-warm pre-pull buys nothing here. (Per ADR-0000’s north star this is a lab validation caveat, not a product verdict: a real warm multi-node pool is exactly where node-warm caching pays off. The drop is “not viable with this delivery default on this footprint,” not “node caching is useless.”)
Decision: dropped, documented. The modelcar (immutable, digest-pinned, air-gap-clean, layer-cached by the container runtime on a warm node) is the delivery substrate; node-local pre-pull is not pursued on top of it. Revisit only if a future default reintroduces a storageUri-based delivery path on a warm pool. Tracked as A12 / C3 (dropped). See ADR-0016, ADR-0000.

References

  • serving/kserve/, runbook kserve.md, docs/raw-vllm-vs-kserve.md. KServe v0.19.0 (serving.kserve.io/v1beta1); cert-manager v1.20.2; vLLM v0.23.0 (CPU). Supersedes the original ADR-006 stub.