The four levers
- Model weight load: pulling and mapping weights into the engine.
- Container image pull: getting the multi-GB OCI image onto the node.
- Node 0→1: provisioning a GPU node when the pool is empty, which takes minutes.
- Runtime / scale-from-zero: what vLLM does between container start and
/healthpassing.
Portable baseline, cloud accelerators on top
The design separates a cloud-agnostic baseline (ships in every profile) from per-cloud accelerators (optional, profile-gated). Making cloud-specific tools the default would break the portability claim.- Weight load. Baseline: weights ship as a digest-pinned OCI modelcar, Hugging Face fully offline (
HF_HUB_OFFLINE=1+TRANSFORMERS_OFFLINE=1), and--load-format runai_streamerstreams them into GPU memory in parallel with no cloud dependency. On GKE, Hyperdisk ML (read-only, fan-out to many nodes) is an optional accelerator for very large models. - Image pull. Baseline: lean base plus zstd-compressed layers, an in-region registry to avoid cross-region egress, and digest-pinning so the pull policy is
IfNotPresentand a warm node never re-pulls (a tag or:latestforcesAlways). On GKE, Image Streaming lazy-pulls from Artifact Registry. - Node 0→1. Baseline: a warm buffer of low-priority pause pods holds a node warm so a real pod preempts instead of waiting on provisioning. On GKE, Custom Compute Classes add Spot→on-demand fallback.
Runtime traps are cheap and load-bearing
The runtime lever is fixed now on the modelcarInferenceService, at near-zero cost, and it exercises the same code paths a real model hits:
/dev/shmtmpfs ≥16Gi. The container default is tiny and NCCL hangs on init. Mount anemptyDir{medium: Memory, sizeLimit: 16Gi}at/dev/shm, paired withNCCL_CUMEM_ENABLE=0: the known-good combo against the NCCL init hang.--gpu-memory-utilization0.7-0.8 is the ceiling; 0.6 is conservative-safe. Higher starves the transient startup VRAM (graph capture, activation buffers) and trips a startup OOM before/health.--enforce-eageronly when cold-start wins over throughput. CUDA-graph capture adds >10s; skip it to wake faster, keep it when steady-state throughput matters. A per-profile call.- Back
/root/.cachewith a persistent volume in production so CUDA-graph compilation is not re-captured on every cold start. A per-podemptyDirre-captures on each wake.
/dev/shm tmpfs counts against the pod memory limit, so size pod memory above shm plus engine RSS or the kernel OOM-kills on allocation.