Context
Scale-from-zero (ADR-0014 deployment profiles: replica floor is a profile knob, lab idles at 0) means every wake is a cold start. For a real model the wake is dominated by physics the lab hides: a 0.5B/1GB modelcar boots in seconds, a 30-70 GB image does not. The north star forbids demoting a product concern because the lab artifact (single GPU, 0.5B, $0-idle) makes it look cheap. So cold start is designed as a product concern, demonstrated cheaply. Cold start is not one number: it is four independent latencies that add up, each with its own fix and its own owner. Conflating them (e.g. “just use Image Streaming”) fixes one lever and leaves the other three. The research (2026-06-20-sso-coldstart.md TOPIC 2) separates them:
- Model weight load: pulling/mapping weights into the engine.
- Container image pull: getting the multi-GB OCI image onto the node.
- Node 0→1: provisioning a GPU node when the pool is empty (minutes).
- Runtime / scale-from-zero: what vLLM does between container-start and
/healthpassing.
Decision
Treat cold start as four independent levers with a cloud-agnostic BASELINE (the default, ships in every profile) and per-cloud ACCELERATORS (optional, profile-gated). Lever 4 (runtime) is DECIDED and applied now toqwen-oci; levers 1-3 are sequenced (model delivery at scale).
Lever 1: Model weight load (baseline; GKE accelerators optional)
- Baseline (cloud-agnostic): weights ship as the digest-pinned OCI modelcar (ADR-0016), HF fully
offline (
HF_HUB_OFFLINE=1+TRANSFORMERS_OFFLINE=1), and--load-format runai_streamerstreams weights into GPU memory in parallel (engine-native, no cloud dependency, ≫ tensorizer from object storage).LocalModelCachepre-pulls weights node-warm (ADR-0016, model delivery at scale: product capability, deferred by sequencing only). - Accelerators (profile-gated, GKE): Hyperdisk ML (RO
READ_ONLY_MANY, ~11.9× faster, RO fan-out to thousands of nodes) for very large models; GCS-FUSE as a warm cache (file-cache + parallel-download). Optional; never depended on.
Lever 2: Container image pull (PRODUCT-CRITICAL, not optional)
- For a real model the OCI image is 30-70 GB and image-pull dominates cold start (ADR-0016). The concern is product-critical; only the GKE-specific tool is optional.
- Baseline (cloud-agnostic): lean base + zstd-compressed layers; in-region registry
(no cross-region egress); digest-pinned ⇒
IfNotPresentso a warm node never re-pulls (ADR-0016: a tag/:latestforcesAlways);LocalModelCachenode-warm pre-pull so scale-out doesn’t pay first-pull per node. - Accelerators (profile-gated, GKE): GKE Image Streaming (lazy-pull from Artifact Registry, ~191s→~30s first start); secondary boot disks to preload images/weights for >~20 GB.
Lever 3: Node 0→1 (baseline buffer; GKE compute classes optional)
- Baseline (cloud-agnostic): a warm buffer via low-priority pause pods (hold a GPU node warm so a real pod preempts instead of waiting minutes for provisioning). Reservations/DWS for scarce GPUs.
- Accelerators (profile-gated, GKE): Custom Compute Classes (Spot→on-demand fallback + migrate-back, GA); GKE Active Buffer/CapacityBuffer (Preview; GPU support unconfirmed, verify before relying on it). Optional; the pause-pod buffer is the portable default.
Lever 4: Runtime (DECIDED, applied to qwen-oci now)
These are the cheap, lab-visible startup traps from the research “Top real issues”. Applied to
serving/kserve/inferenceservice-modelcar.yaml:
/dev/shmtmpfs ≥16Gi. The container default/dev/shmis tiny; NCCL falls back / hangs on init. Mount anemptyDir{medium: Memory, sizeLimit: 16Gi}at/dev/shm.NCCL_CUMEM_ENABLE=0. Paired with the tmpfs above: the known-good combo against the NCCL init hang.- torch.compile / HF cache at
/root/.cache.emptyDirin the lab (per-pod, cleared on restart); back it with a persistent volume in production so CUDA-graph compilation is not re-captured on every cold start (>10s per boot). --gpu-memory-utilization: 0.7-0.8 is the CEILING, 0.6 conservative-safe. Higher starves the transient startup VRAM (graph capture / activation buffers) and trips startup OOM before/health.--enforce-eageronly when cold-start beats throughput. CUDA-graph capture adds >10s; skip it to wake faster, keep it when steady-state throughput matters. A per-profile call, not a default.- vLLM Sleep Mode for multi-model. L1 RAM offload, 18-200x faster wake than a full reload:
the right tool when one GPU multiplexes several models (future; not single-model
qwen-oci).
shareProcessNamespace/securityContext; the KServe webhook injects them
(ADR-0016). We add only the volumes + mounts + env.
Alternatives considered
- One-knob fix (e.g. only GKE Image Streaming). Rejected: fixes lever 2 alone, leaves model load, node 0→1, and runtime untouched. The four levers are additive and independent.
- GKE accelerators as the baseline (Hyperdisk ML / Image Streaming / Custom Compute Classes). Rejected as default: cloud-specific, breaks the portability claim (ADR-0016). Kept as optional profile-gated scale-paths.
- Keep a warm replica (
minReplicas: 1) to dodge cold start entirely. That is the ADR-0014 deployment-profiles knob, not a cold-start fix: it trades $0-idle for latency. Orthogonal; both can apply. - Bake torch.compile cache into the image. Rejected: re-bakes on every model/engine bump and
couples cache to image lifecycle; the
/root/.cachevolume decouples it. - Drop runai_streamer, plain HF load. Rejected for the baseline: serial load, slower; and HF online load reintroduces the egress dependency (ADR-0016 HF-429).
Consequences
- + Cold start is now a tractable, per-lever design with a portable default everywhere and cloud accelerators layered on by profile, no cloud dependency in the baseline (ADR-0016 rule held).
- + The runtime traps (lever 4) are fixed for
qwen-ociat near-zero cost and are validated on the 1GB/1-GPU lab, exercising the same code paths a real model hits. - − The 16Gi
/dev/shmtmpfs is charged against the pod memory limit; size the pod memory above shm + engine RSS, or the kernel OOM-kills on shm allocation. - − The
/root/.cacheemptyDiris per-pod in the lab, so scale-to-zero re-captures CUDA graphs on the next cold start (a validation caveat of $0-idle, not a reason to skip the cache; production uses a persistent volume). - - Levers 1-3 baseline (runai_streamer, LocalModelCache, zstd/in-region, pause-pod buffer) are sequenced to model delivery at scale; until then the lab pays the un-optimized model-load + image-pull on a true cold node.
- −
--enforce-eagerand Sleep Mode are deferred per-profile decisions; choosing wrong trades wake latency for steady-state throughput (or vice versa).
References
- Builds on the model-delivery default and generic-first rule ADR-0016; the scale-from-zero floor
knob ADR-0014; the serving-layer / KServe substrate decision ADR-0006 (cold start is the cost
of
minReplicas: 0). Cloud-specific accelerators stay profile-gated. serving/kserve/inferenceservice-modelcar.yaml(qwen-oci, lever 4 applied). vLLMv0.23.0; KServe v0.19.0. vLLMrunai_model_streamer,sleep_mode,cuda_graphs; GKE hyperdisk-ml / image-streaming / custom-compute-classes / active-buffer;vllm#24541,vllm#23115,vllm#21051.