Skip to main content
Three serving layers run the same vLLM engine on this platform: raw vLLM (serving/raw-vllm), KServe InferenceService (serving/kserve), and llm-d LLMInferenceService (serving/llm-d). They are not stacked: a model is served on one of them. This runbook is how an operator picks the layer for a model and moves a model between them. The architecture-level comparison is Raw vLLM vs KServe; the per-layer operational gotchas live in vllm-serving.md, kserve.md, and kserve-modelcar.md.

Pick the layer

UseWhenCost
Raw vLLMone model, one team, full control; you own rollout/scale (the benchmarked default)you write Deployment + Service + PVC + probes by hand; no canary, no auto-ingress
KServe InferenceServicelifecycle/canary/scale/ingress/governance handled declaratively across many modelsa control plane to run (cert-manager + controller) and opinionated defaults that fight you until known
llm-d LLMInferenceServicelarge/multi-node models needing prefill/decode disaggregation + KV-aware routingneeds >=2 GPUs to show the throughput win; runs in its own namespace with its own GIE so it does not collide with the reference routing layer
Decision shortcut:
  • Single benchmarked endpoint, minimal moving parts: raw vLLM.
  • A fleet of models with a uniform CR, stable per-model URLs, declarative canary, native scale-to-zero: KServe.
  • A model too big for one GPU, or one that needs KV-aware/disaggregated serving: llm-d (validated on the multi-GPU substrate; on a single GPU the manifest demonstrates the mechanism, with the throughput win observable once a second GPU is present).

What changes between layers (same model, same engine)

DimensionRaw vLLMKServe ISVCllm-d LLMISVC
ManifestDeployment + Service + PVC + probesone InferenceService CRone LLMInferenceService CR
Namespaceservingkservellm-d
Weightspre-staged PVC, initContainer, HF-offlinepre-staged PVC (--model=/models/qwen) or oci:// modelcarhf:// URI; KServe storage stages the weights
Ingress / URLService; you wire the gateway/routeKServe auto-creates an HTTPRoute + stable per-ISVC URLrouter (route: {}) generates the HTTPRoute on the isolated gateway
Rollout / canaryDIY (Recreate, no split)canaryTrafficPercent (revision-based split)router-managed
Scale-to-zeromanual replicas:0 + Argo ignoreDifferencesnative minReplicas:0router/pool managed
RoutingLiteLLM -> Service (or GIE InferencePool)gateway HTTPRoute -> predictorbundled InferencePool + EPP (KV/prefix-aware)
Prereqsnone beyond the clustercert-manager + KServe controller + network controllerKServe + the kserve-llmisvc CRD/presets, >=2 GPUs
Feature gateserving-core (always on)kserve: truellm-d: true (llm-d)

Feature gates and bring-up

Each layer’s Argo Applications exist only when its feature group is enabled in environments/ai-dev/config.yaml, then materialized with make resolve-groups && make root (see staged-bring-up.md). The GPU-bearing apps are manual-sync (the cost gate):
features:
  kserve: true             # KServe declarative serving path (serving/kserve)
  llm-d: true              # advanced disaggregated serving (serving/llm-d)
LayerFeature groupApplicationsSync
raw vLLMserving-core (always on)raw-vllmmanual (make vllm-up/down owns replicas)
KServekservekserve (controller), kserve-demo (qwen-cpu / qwen-oci)controller auto; kserve-demo manual
llm-dllm-dllm-d (qwen-llmd)manual
make resolve-groups                  # after editing config.yaml features
make root PROFILE=full               # materialize the Applications for the enabled groups

Move a model from raw vLLM to KServe

The same Qwen2.5-0.5B-Instruct is already served on both layers, so this is the worked path. Bring up the KServe variant; the raw-vLLM Deployment is independent and stays at replicas:0.
  1. Confirm the gate + controller. kserve: true in config.yaml; the kserve group synced (cert-manager + KServe controller Healthy). The Gateway (serving/kserve/gateway.yaml, gatewayClassName: agentgateway) must exist for the ISVC URL to resolve.
  2. Use a custom-container predictor. serving/kserve/inferenceservice.yaml runs vLLM verbatim (container named kserve-container) instead of model+runtime. The model+runtime binding force-injects --model=/mnt/models and requires a storageUri + storage-initializer, which the HF-429 egress breaks (kserve.md §4-5). Keep --model=/models/qwen against the pre-staged kserve-model-cache PVC, or switch to the digest-pinned oci:// modelcar for big models (kserve-modelcar.md).
  3. Set an explicit limits.cpu. KServe defaults a cpu limit of 1 when unset; any requests.cpu>1 then makes the Deployment invalid and it never updates (kserve.md §3).
  4. Sync and validate through the gateway:
    argocd app sync kserve-demo
    kubectl -n kserve get isvc qwen-cpu                 # READY=True
    GW=$(kubectl -n kserve get gateway kserve-ingress-gateway -o jsonpath='{.status.addresses[0].value}')
    curl -s -H "Host: qwen-cpu-kserve.example.com" http://$GW/v1/models
    
  5. Re-point the tenant alias (optional). To route the LiteLLM alias to the KServe path instead of raw vLLM, change api_base in the model’s model_list entry (platform/litellm/values.yaml) to the predictor Service (http://qwen-cpu-predictor.kserve:80) or the gateway, then argocd app sync litellm. The tenant-facing alias and virtual keys are unchanged.
KServe gives native minReplicas:0 scale-to-zero; idle-pause a running predictor with the serving.kserve.io/stop annotation (kserve.md §8), not make vllm-down.

Move a model to llm-d

llm-d is a contained, advanced scale-out path, isolated in the llm-d namespace with its own gateway/GIE, not a replacement for the reference agentgateway+GIE path. serving/llm-d/llminferenceservice.yaml serves the same Qwen2.5-0.5B-Instruct with prefill/decode disaggregation and KV-aware routing.
  1. Enable + materialize: llm-d: true -> make resolve-groups && make root PROFILE=full so the llm-d Application exists (manual-sync, ServerSideApply=true for the large CRD).
  2. Confirm GPU capacity. Disaggregation = 1 prefill GPU + 1 decode GPU. On a single GPU (GPUS_ALL_REGIONS=1) only one pool schedules, so the disaggregation/KV-routing throughput win is not yet observable; the manifest demonstrates the mechanism. The throughput benchmark runs on the 2-GPU substrate (see serving/llm-d/README.md).
  3. Sync the path:
    argocd app sync llm-d
    kubectl -n llm-d get llminferenceservice qwen-llmd
    
Weights come from the hf://Qwen/Qwen2.5-0.5B-Instruct URI (KServe’s storage layer stages them), not a hand-wired PVC like raw vLLM. The router (scheduler: {} + route: {}) creates the bundled InferencePool + EPP and the HTTPRoute on the isolated llm-d-gateway.

Switching back / tearing down a layer

Layers are independent, so “switching off” a layer is scaling its workload to $0 and (optionally) re-pointing the LiteLLM alias back:
make vllm-down                                          # raw vLLM -> $0 (releases the GPU node)
kubectl -n kserve annotate isvc qwen-cpu serving.kserve.io/stop=true --overwrite   # KServe idle-pause
argocd app delete llm-d --yes                          # drop the llm-d path
To remove a whole layer’s Applications, disable its feature flag (make resolve-groups) or delete its catalog group; full ordered teardown (Gateways before the cluster to avoid orphaned LBs) is in teardown.md.