serving/kserve/. KServe v0.19.0 in Standard (RawDeployment) mode (no
Knative/Istio), ingress via Gateway API on agentgateway. Argo apps: cert-manager (wave 0),
kserve-crd (1), kserve (2), kserve-demo (4). See Serving layers compared.
1. Smoke test
Ready ISVC: kubectl -n kserve get isvc qwen-cpu. The gateway’s catch-all HTTPRoute forwards
the path to the predictor (vLLM serves /v1/* + /health).
2. Install prereqs: cert-manager + deployment mode
KServe webhooks need cert-manager (platform/cert-manager, v1.20.2; crds.enabled=true).
Set Standard mode + reuse agentgateway as the network controller (platform/kserve/values.yaml):
kserve.controller.deploymentMode=Standard, gateway.ingressGateway.enableGatewayApi=true,
gateway.ingressGateway.kserveGateway=kserve/kserve-ingress-gateway. The Gateway
(serving/kserve/gateway.yaml, gatewayClassName: agentgateway) must exist for the ISVC URL to
resolve.
3. KServe injects a default cpu limit of 1
Symptom: ISVCPredictorReady=False ReconcileFailed, controller logs
Deployment ... invalid: spec.template.spec.containers[0].resources.requests: Invalid value "2": must be less than or equal to cpu limit of 1. KServe defaults a cpu limit of 1 when unset, so
any requests.cpu>1 makes the Deployment invalid and it never updates (stale pod persists).
Fix: set an explicit limits.cpu >= your request.
4. predictor.model + runtime forces --model=/mnt/models
With model.modelFormat+runtime(+storageUri), KServe overrides the runtime’s --model with
/mnt/models (the storage-initializer mount), so it requires a storageUri. To let vLLM use its
own --model (HF repo id or a local path) and skip the storage-initializer, use a
custom-container predictor (predictor.containers, container named kserve-container). Costs
the ServingRuntime abstraction; worth it for control.
5. HuggingFace 429 on the cluster egress IP → pre-stage the model
The GKE egress IP is persistently HF-429-rate-limited (IP-level API limit; an HF token does not lift it). Both the storage-initializer and vLLM self-download fail (429, surfaced as
OSError: couldn't connect to huggingface.co). Mitigations, cheapest first:
- Pre-stage to the cache PVC (what we do): download the model on an un-throttled machine and
kubectl cpit onto thekserve-model-cachePVC, then serve--model=/models/qwenoffline. vLLM never touches HF. Steps: stage into a pod that mounts the PVC RW (same node if RWO), copyconfig.json generation_config.json merges.txt model.safetensors tokenizer*.json vocab.json. - HF token (
serving/kserve/externalsecret.yaml, ESO→GSMhf-token): still set, helps on a non-throttled egress; on a throttled IP it doesn’t clear the 429. - A populated PVC is also resilient:
hf_hubfalls back to cache when a revalidation HEAD 429s. - Forkers on un-throttled egress can set
--model=Qwen/Qwen2.5-0.5B-Instructto self-download.
6. vLLM-CPU specifics
- WorkerProc init crashes at low CPU.
VLLM_CPU_OMP_THREADS_BINDdefaultautoneeds >=2 cores (it reserves one for the API server and binds workers to the rest);cpu:1leaves none →WorkerProc initialization failed. Usecpu:2+. Do not set the value toall: vLLM v0.23.0 rejects it (ValueError: invalid literal for int(): 'all'); leave it unset (auto) or give an explicit core-id list. --enforce-eager: skip torch.compile/inductor (slow + fragile on CPU).--dtype=float32(CPU has no fp16/bf16 fast path),VLLM_CPU_KVCACHE_SPACE=2(GiB of RAM KV cache). Weights load from the PVC in ~10s; full warmup ~2-4 min one2-standard-4.
7. Recreate strategy + the canary constraint
Predictor deploymentStrategy: type: Recreate (a CPU predictor needs ~2 cores; a RollingUpdate
surge pod needs a second free 2-core slot and deadlocks Insufficient cpu, same single-slot
reasoning as raw-vllm’s Recreate for the single GPU). Canary (canaryTrafficPercent) runs the
new revision as a second pod; with an RWO model PVC both revisions pin to one node and
2×cpu:2 exceeds an e2-standard-4, so a live canary isn’t feasible at $0-CPU. A real KServe
canary needs a ReadOnlyMany model volume (Filestore) so revisions spread across nodes. The
weighted-split concept is already proven live at the GIE layer (runbook inference-gateway.md §2).
8. Idle-pause to save cost (live action, not declarative)
Pausing is an operational action, kept out of the manifest so a fresh clone deploys a running ISVC, not a dead one. To scale the predictor to 0 during idle breaks (model stays verified; reloads from the PVC in ~2-3 min):kserve-demo Argo app is manual-sync (paid GPU, see staged-bring-up.md), so selfHeal won’t
revert a live stop=true: the pause holds until you next argocd app sync kserve-demo.
9. Misc
VirtualServiceCRDNotFoundwarning → setkserve.controller.gateway.disableIstioVirtualHost=true(we route via Gateway API, not Istio).- KServe CRDs are large →
kserve-crdapp usesServerSideApply=true(same trap as Kueue/GIE).