Context
ADR-0006 deferred model delivery as its own decision: HF self-download is fragile (the cluster egress IP is persistently HF-429-rate-limited even with a token), so theqwen-cpu demo pre-stages
weights onto an RWO PVC and serves offline. That pattern doesn’t scale: it pins a model to one node,
needs out-of-band kubectl cp, and says nothing about large models, many replicas, or air-gap. The
candidates ADR-0006 listed: per-pod HF pull, a shared RWX cache, KServe LocalModelCache
(node-local pre-pull, v0.17+), or OCI “modelcar” (oci:// weights-in-image). This decision
settles model delivery at scale, and GPU-validates the KServe path (closing the ADR-0006 CPU-only gap).
This also forces a platform-wide design rule: when do we reach for cloud-specific accelerators
(GKE Hyperdisk ML, GCS-FUSE, Image Streaming, Custom Compute Classes) vs generic K8s primitives? The
platform’s public claim is “GKE-verified / portable-by-design”, so the default
must stay portable.
Decision
Adopt the digest-pinned OCI modelcar (oci://reg/model@sha256:...) as the DEFAULT model-delivery
primitive, on the KServe path.
- Modelcar mechanism (KServe >=0.14, default-on v0.19): the model ships as an OCI image (
busyboxCOPY data/ /models/). KServe seesstorageUri: oci://...and injects a sidecar + init-container,shareProcessNamespace: true, a shared/mntvolume, and anln -sfso the weights appear at/mnt/models: symlinked, not copied (the big-model win: no 2x disk, fast start). Works with our custom-container predictor because injection keys off the container namedkserve-container(which ADR-0006 already uses). vLLM reads--model=/mnt/models.
- Digest-pinned, never
:latest. A@sha256:ref pinsIfNotPresent(node-cached after first pull); a tag/latestforcesAlwaysand re-pulls the multi-GB image on every pod start. - Cloud-agnostic. Any OCI registry works; air-gap via a mirror (supply-chain integrity). This is why modelcar is the default and not a GKE-specific feature.
- Generic-first rule (platform-wide): generic, K8s-native primitives are the default; cloud-
specific implementations (Hyperdisk ML, GCS-FUSE, GKE Image Streaming, Custom Compute Classes) are
optional, profile-gated, and clearly labelled. Distinction (lab-vs-product):
for real models the OCI image is 30-70 GB and image-pull dominates cold
start, so fast large-image delivery is a PRODUCT-CRITICAL concern, not optional. What is
optional is the GKE-specific tool; the concern itself is tracked as product-critical with a
cloud-agnostic baseline (lean/zstd images, in-region registry, digest→
IfNotPresentnode caching, LocalModelCache node-warm pre-pull, the cloud-agnostic node-warming stage) and per-cloud accelerators layered on top (GKE Image Streaming ~191s→30s, Hyperdisk ML, the per-cloud cold-start accelerators stage profile). We demonstrate it cheaply at 1GB/1-GPU; we do not scope it away because the lab model is small. - KServe is the model-delivery substrate. Modelcar and (later) LocalModelCache are KServe-native;
raw-vLLM stays the simple/benchmark endpoint. This extends KServe’s role beyond ADR-0006’s “lifecycle
- ingress” framing, so we amend ADR-0006 (done, dated note).
Alternatives considered
- Per-pod HF self-download. Rejected as default: fragile (egress-IP HF-429), no air-gap story, slow
cold start per pod, no caching. Fine for a forker on un-throttled egress (
--model=<hf-id>). - Shared RWX cache (Filestore / NFS). Useful for live canary (revisions spread across nodes, ADR-0006 −) and multi-replica reads, but adds a stateful RWX volume to operate and a copy/stage step; it’s a complement, not the delivery default. Modelcar’s image registry is the simpler portable single source of truth.
- KServe LocalModelCache (now). Not rejected, sequenced to the cloud-agnostic node-warming stage (see above). It layers on top of the OCI-image delivery (warms the node cache), so the delivery mechanism lands first.
- GKE accelerators as the default (Hyperdisk ML / GCS-FUSE / Image Streaming). Rejected as default: cloud-specific, breaks portability. Kept as the optional profile-gated scale-path (the per-cloud cold-start accelerators stage).
- Bake weights into the vLLM image directly (no modelcar sidecar). Rejected: couples model lifecycle
to engine lifecycle (re-build/re-push on every vLLM bump), and loses KServe’s
storageUricontract + the symlink/no-copy mechanism.
Consequences
- + Cloud-agnostic, air-gap-friendly, big-model-ready delivery with immutable digest provenance; no
egress dependency at serve time (
HF_HUB_OFFLINE=1). Subsequent pods on a warm node start instantly. - + ADR-0006’s custom-container objection dissolves: the old “model+runtime force-injects
/mnt/models” complaint is moot because the weights now are at/mnt/models(via the symlink), with a custom container. - - Node image-layer disk is the new sizing constraint: the full image (30-70 GB for a real
model) is pulled to the node’s containerd storage. Size the GPU node boot/image disk (>=200 GB) or hit
ImageGCFailed/ disk-pressure eviction. (This is the trade-off LocalModelCache / RWX PVC address.) - − A build+push step enters the model lifecycle (
scripts/build-modelcar.sh/ Cloud Build); the digest must be re-pinned on every model change. - RESOLVED 2026-06-20 (scratch-ns smoke): a custom-container modelcar wires
/mnt/modelsvia theSTORAGE_URIenv onkserve-container: KServe then injects the modelcar sidecar + init + the symlink (verified: pod went 1→2 containers + init). Notpredictor.model.storageUri(the webhook rejectsmodel+containerstogether) and not theserving.kserve.io/storageUriannotation (no injection). The committedinferenceservice-modelcar.yamluses theSTORAGE_URIenv accordingly. - −
shareProcessNamespace: true(webhook-injected) shares the PID namespace across pod containers; blast radius is the model sidecar + vLLM in a single-tenant pod. Don’t co-locate untrusted sidecars.
Validation: live GPU serve (2026-06-22)
Status: GPU-validated end-to-end on the canonical IaC cluster (L4; vLLM v0.23.0, KServe v0.19.0). Theqwen-oci ISVC pulled the digest-pinned OCI modelcar, KServe injected the modelcar init+sidecar, vLLM
loaded model=/mnt/models with HF_HUB_OFFLINE=1, and /v1/chat/completions returned 200: weights
served from the image, zero HF egress. Closes the ADR-0006 “KServe never GPU-tested” gap.
Three forker traps, all committed (a vanilla forker hits each):
- Build:
huggingface-cli downloadis removed inhuggingface_hub1.x (hard-exits, “usehf”), and thehfreplacement parses--excludedifferently (extra patterns swallowed as positional filenames → emptydata/→COPY data/fails). Fix: bothcloudbuild.yamlandscripts/build-modelcar.shnow call the stable Python APIsnapshot_download(model, local_dir='data', ignore_patterns=[...]), the same call the raw-vllm pre-stage init uses (which never broke). - Serve,
KeyError: getpwuid(): uid not found: 1010. KServe runs the predictor as a non-root UID with no/etc/passwdentry; vLLM/torch callgetpass.getuser(), which falls through topwd.getpwuid(). Fix: setUSERenv (read before the passwd lookup). - Serve,
PermissionError: '/.cache'.HOMEis unset, so~/.cacheresolves to/.cache(unwritable as non-root); flashinfer + torch.compilemkdirtheir workspace under$HOME. Fix: setHOME=/tmp.
HOME: set USER + HOME env, or runtimes that assume a real user (vLLM,
torch, flashinfer, HF) crash before /health ever passes. (raw-vllm avoids this, different security
context.) Both envs are now in inferenceservice-modelcar.yaml.
References
- Resolves the ADR-0006 follow-up (model delivery at scale); ADR-0006 amended (KServe = delivery substrate). Generic-first gating applies. Replica floor as a profile knob: ADR-0014 (deployment profiles).
serving/kserve/inferenceservice-modelcar.yaml(qwen-oci),serving/kserve/modelcar/,scripts/build-modelcar.sh,platform/kserve/values.yaml(enableModelcar: true), runbookkserve-modelcar.md. KServe OCI/Modelcar (stable v0.14, default-on v0.15.2). vLLMv0.23.0; KServe v0.19.0.