serving/raw-vllm),
KServe InferenceService (serving/kserve), and llm-d LLMInferenceService
(serving/llm-d). They are not stacked: a model is served on one of them. This runbook is how an
operator picks the layer for a model and moves a model between them. The architecture-level comparison
is Raw vLLM vs KServe; the per-layer operational gotchas live in
vllm-serving.md, kserve.md, and kserve-modelcar.md.
Pick the layer
| Use | When | Cost |
|---|---|---|
| Raw vLLM | one model, one team, full control; you own rollout/scale (the benchmarked default) | you write Deployment + Service + PVC + probes by hand; no canary, no auto-ingress |
KServe InferenceService | lifecycle/canary/scale/ingress/governance handled declaratively across many models | a control plane to run (cert-manager + controller) and opinionated defaults that fight you until known |
llm-d LLMInferenceService | large/multi-node models needing prefill/decode disaggregation + KV-aware routing | needs >=2 GPUs to show the throughput win; runs in its own namespace with its own GIE so it does not collide with the reference routing layer |
- Single benchmarked endpoint, minimal moving parts: raw vLLM.
- A fleet of models with a uniform CR, stable per-model URLs, declarative canary, native scale-to-zero: KServe.
- A model too big for one GPU, or one that needs KV-aware/disaggregated serving: llm-d (validated on the multi-GPU substrate; on a single GPU the manifest demonstrates the mechanism, with the throughput win observable once a second GPU is present).
What changes between layers (same model, same engine)
| Dimension | Raw vLLM | KServe ISVC | llm-d LLMISVC |
|---|---|---|---|
| Manifest | Deployment + Service + PVC + probes | one InferenceService CR | one LLMInferenceService CR |
| Namespace | serving | kserve | llm-d |
| Weights | pre-staged PVC, initContainer, HF-offline | pre-staged PVC (--model=/models/qwen) or oci:// modelcar | hf:// URI; KServe storage stages the weights |
| Ingress / URL | Service; you wire the gateway/route | KServe auto-creates an HTTPRoute + stable per-ISVC URL | router (route: {}) generates the HTTPRoute on the isolated gateway |
| Rollout / canary | DIY (Recreate, no split) | canaryTrafficPercent (revision-based split) | router-managed |
| Scale-to-zero | manual replicas:0 + Argo ignoreDifferences | native minReplicas:0 | router/pool managed |
| Routing | LiteLLM -> Service (or GIE InferencePool) | gateway HTTPRoute -> predictor | bundled InferencePool + EPP (KV/prefix-aware) |
| Prereqs | none beyond the cluster | cert-manager + KServe controller + network controller | KServe + the kserve-llmisvc CRD/presets, >=2 GPUs |
| Feature gate | serving-core (always on) | kserve: true | llm-d: true (llm-d) |
Feature gates and bring-up
Each layer’s ArgoApplications exist only when its feature group is enabled in
environments/ai-dev/config.yaml, then materialized with make resolve-groups && make root
(see staged-bring-up.md). The GPU-bearing apps are manual-sync (the cost gate):
| Layer | Feature group | Applications | Sync |
|---|---|---|---|
| raw vLLM | serving-core (always on) | raw-vllm | manual (make vllm-up/down owns replicas) |
| KServe | kserve | kserve (controller), kserve-demo (qwen-cpu / qwen-oci) | controller auto; kserve-demo manual |
| llm-d | llm-d | llm-d (qwen-llmd) | manual |
Move a model from raw vLLM to KServe
The sameQwen2.5-0.5B-Instruct is already served on both layers, so this is the worked path. Bring up
the KServe variant; the raw-vLLM Deployment is independent and stays at replicas:0.
-
Confirm the gate + controller.
kserve: trueinconfig.yaml; thekservegroup synced (cert-manager + KServe controller Healthy). The Gateway (serving/kserve/gateway.yaml,gatewayClassName: agentgateway) must exist for the ISVC URL to resolve. -
Use a custom-container predictor.
serving/kserve/inferenceservice.yamlruns vLLM verbatim (container namedkserve-container) instead ofmodel+runtime. Themodel+runtimebinding force-injects--model=/mnt/modelsand requires astorageUri+ storage-initializer, which the HF-429 egress breaks (kserve.md §4-5). Keep--model=/models/qwenagainst the pre-stagedkserve-model-cachePVC, or switch to the digest-pinnedoci://modelcar for big models (kserve-modelcar.md). -
Set an explicit
limits.cpu. KServe defaults a cpu limit of 1 when unset; anyrequests.cpu>1then makes the Deployment invalid and it never updates (kserve.md §3). -
Sync and validate through the gateway:
-
Re-point the tenant alias (optional). To route the LiteLLM alias to the KServe path instead of
raw vLLM, change
api_basein the model’smodel_listentry (platform/litellm/values.yaml) to the predictor Service (http://qwen-cpu-predictor.kserve:80) or the gateway, thenargocd app sync litellm. The tenant-facing alias and virtual keys are unchanged.
KServe gives nativeminReplicas:0scale-to-zero; idle-pause a running predictor with theserving.kserve.io/stopannotation (kserve.md §8), notmake vllm-down.
Move a model to llm-d
llm-d is a contained, advanced scale-out path, isolated in thellm-d namespace with its own
gateway/GIE, not a replacement for the reference agentgateway+GIE path.
serving/llm-d/llminferenceservice.yaml serves the same Qwen2.5-0.5B-Instruct with prefill/decode
disaggregation and KV-aware routing.
-
Enable + materialize:
llm-d: true->make resolve-groups && make root PROFILE=fullso thellm-dApplication exists (manual-sync,ServerSideApply=truefor the large CRD). -
Confirm GPU capacity. Disaggregation = 1 prefill GPU + 1 decode GPU. On a single GPU
(
GPUS_ALL_REGIONS=1) only one pool schedules, so the disaggregation/KV-routing throughput win is not yet observable; the manifest demonstrates the mechanism. The throughput benchmark runs on the 2-GPU substrate (seeserving/llm-d/README.md). -
Sync the path:
hf://Qwen/Qwen2.5-0.5B-Instruct URI (KServe’s storage layer stages them), not a
hand-wired PVC like raw vLLM. The router (scheduler: {} + route: {}) creates the bundled
InferencePool + EPP and the HTTPRoute on the isolated llm-d-gateway.
Switching back / tearing down a layer
Layers are independent, so “switching off” a layer is scaling its workload to $0 and (optionally) re-pointing the LiteLLM alias back:make resolve-groups) or delete its
catalog group; full ordered teardown (Gateways before the cluster to avoid orphaned LBs) is in
teardown.md.