Context
serving/raw-vllm (ADR-0003 benchmarks) is a hand-rolled Deployment+Service+PVC: full control,
but every operational concern (rollout, canary, autoscale-to-zero, model lifecycle, URL/ingress,
multi-model governance) is ours to wire. The question for a platform: when do we hand that to a
serving controller, and which one? KServe is CNCF-incubating and the de-facto K8s model-serving
control plane. As of v0.19 it ships two CR families that are easy to conflate (§9):
InferenceService+ServingRuntime(serving.kserve.io/v1beta1): the classic path. The ISVC declares a model; a ServingRuntime (or a custom container) serves it; KServe reconciles a Deployment/Service, an ingress route, autoscaling, and canary rollout.LLMInferenceService(newer, ~v0.18+): an LLM-specific CR that natively embeds llm-d + GIE (disaggregated prefill/decode, KV-aware routing, prefix caching).
Decision
Core scope: classicInferenceService with a custom vLLM container, RawDeployment (Standard) mode.
Keep raw-vllm as the control benchmark; add serving/kserve serving the same model
(Qwen2.5-0.5B-Instruct, same vLLM v0.23.0) so the comparison is engine-identical (raw on
GPU, KServe on CPU while GPU is stocked out). Ingress via Gateway API on our existing
agentgateway (no Istio). LLMInferenceService was kept out of the core comparison path: it embeds llm-d+GIE,
which overlaps the routing layer we already built (ADR-0005); folding it into the core path
would have duplicated the agentgateway/GIE work. It is since integrated and tested as its own
isolated disaggregated-serving path (serving/llm-d/), not part of the core raw-vLLM-vs-KServe
comparison.
Use a custom-container predictor (predictor.containers), not model+runtime: KServe’s
model+runtime binding force-injects --model=/mnt/models and therefore requires a storageUri
- storage-initializer; a custom container runs verbatim so vLLM controls its own model path. (This costs the ServingRuntime abstraction; documented in the runbook.)
When raw vLLM vs KServe (the takeaway)
- Raw vLLM when: one model, one team, you want full control and minimal moving parts, and you’ll own rollout/scale yourself. It’s the right call for a single benchmarked endpoint (ADR-0003).
- KServe
InferenceServicewhen: you want the platform to own model lifecycle: declarative rollout + canary (canaryTrafficPercent), scale-to-zero, a stable URL/ingress per model, and a consistent multi-model contract. The cost is a control plane (controller + cert-manager) and KServe’s opinions (see consequences). LLMInferenceServicewhen (advanced-inference stage): large/multi-node models needing disaggregated serving, KV-aware routing, prefix caching, i.e. llm-d territory.
Consequences
- + Same model serves raw and via KServe; verified: ISVC
Ready,GET /v1/models→qwen2.5-0.5b,POST /v1/chat/completions→ 200 (system_fingerprint vllm-0.23.0) through the agentgateway. KServe adds the lifecycle/ingress/rollout layer raw-vllm lacks. - + Reused agentgateway as KServe’s Gateway-API ingress: one gateway spans GIE routing and KServe serving; no Istio.
- − New prereqs: cert-manager (webhook certs) + the KServe controller.
- − KServe injects opinionated defaults that fight you (all → runbook
kserve.md): a default cpu limit of 1 (invalidatesrequests.cpu>1), an Istio VirtualString path unlessdisableIstioVirtualHost=true, and the model+runtime/mnt/modelsinjection above. - − Canary not live-demoed at $0-CPU: the new revision is a second predictor pod; our model
on an RWO PVC pins both revisions to one node and 2×
cpu:2vLLM-CPU pods exceed ane2-standard-4. The weighted-split concept is already proven live at the GIE layer (ADR-0005, 90/10). A live KServe canary needs a ReadOnlyMany model volume (Filestore), deferred with the advanced-inference stage.- 2026-06-20: KServe-native canary is configured but not live-demoed only because RWO pins both revisions to one node. PRODUCT TRIGGER = >1 model version in prod AND a ReadOnlyMany model volume.
Follow-up: model delivery at scale
This work exposed model delivery as its own decision: HF self-download is fragile (the cluster egress IP is HF-429-rate-limited even with a token), so we pre-staged the model onto a PVC and load it offline. At scale the options are: per-pod HF pull, a shared RWX cache, KServe LocalModelCache (node-local pre-pull DaemonSet, v0.17+), or OCI “modelcar” (oci://
weights-in-image, immutable, air-gapped-friendly). Decide deliberately
when large models / many replicas / air-gap actually arrive; modelcar is favored for a
forkable air-gapped platform.
Update 2026-06-20: KServe promoted to model-delivery substrate
The deferred follow-up above is resolved by ADR-0016: OCI modelcar (digest-pinnedoci://) is the
default model-delivery primitive, and KServe now owns model delivery (modelcar + later
LocalModelCache are KServe-native; raw-vLLM stays the simple/benchmark endpoint). The custom-container
objection in the Decision dissolves: modelcar symlinks the weights to /mnt/models, so the old
“model+runtime force-injects /mnt/models” complaint is moot: a custom container reads /mnt/models
directly with no storage-initializer copy. New ISVC serving/kserve/inferenceservice-modelcar.yaml
(qwen-oci, GPU). GPU validation pending (the proof that closes this ADR’s CPU-only gap is the
model-delivery-at-scale deliverable; runbook kserve-modelcar.md). See ADR-0016.
Addendum 2026-06-23: KServe LocalModelCache DROPPED (with the OCI modelcar default)
The model-delivery follow-up above listed LocalModelCache (KServe’s node-local pre-pull DaemonSet, v0.17+) as a scale option alongside the OCI modelcar. With modelcar settled as the default (ADR-0016), LocalModelCache is dropped, for two grounded reasons:- No
oci://source. LocalModelCache pre-pulls from astorageUri(HF / object storage / PVC). The modelcar delivers weights as anoci://artifact resolved by the injected modelcar sidecar + init-container, not by KServe’s storage-initializer: there is nostorageUrifor LocalModelCache to pre-pull. The two delivery mechanisms do not compose. - Scale-to-zero evicts the node-warm cache. This lab scales the GPU node to zero when idle, which deletes the node and any node-local cache on it, so a node-warm pre-pull buys nothing here. (Per ADR-0000’s north star this is a lab validation caveat, not a product verdict: a real warm multi-node pool is exactly where node-warm caching pays off. The drop is “not viable with this delivery default on this footprint,” not “node caching is useless.”)
storageUri-based delivery path on a warm
pool. Tracked as A12 / C3 (dropped). See ADR-0016, ADR-0000.
References
serving/kserve/, runbookkserve.md,docs/raw-vllm-vs-kserve.md. KServe v0.19.0 (serving.kserve.io/v1beta1); cert-manager v1.20.2; vLLMv0.23.0(CPU). Supersedes the original ADR-006 stub.