serving/raw-vllm OpenAI-compatible endpoint: scale-to-zero,
GPU-capacity fallback, and the GitOps gotchas hit standing it up.

Operating it (default-off, scale-to-zero)
The Deployment shipsreplicas: 0: no GPU node, $0 idle. Bring it up for a session,
tear it down after:
make vllm-up blocks on kubectl rollout status (up to 18m: a cold node is provision +
~8GB image pull + model download + CUDA-graph capture). The model-cache PVC keeps the
weights across restarts so a warm node skips the HuggingFace re-pull.
1. make vllm-up is reverted to 0 by Argo CD
Symptom: you scale to 1, the pod starts (or even loads the model), then the Deployment
drops back to 0/0 and the pod is deleted within a minute or two.
Cause: the App pins replicas: 0 in git with ignoreDifferences on /spec/replicas.
ignoreDifferences only suppresses the diff display; it does not exclude the field
from the sync patch. So the next auto-sync reapplies replicas: 0 and undoes the manual
scale.
Fix: add RespectIgnoreDifferences=true to the App’s syncOptions (alongside
ignoreDifferences). That makes sync skip the ignored field, so a manual kubectl scale
sticks, the standard pattern for an externally-scaled (HPA-style) Deployment.
2. GPU pod stuck Pending: “GCE out of resources”
Symptom: aftermake vllm-up the pod is Pending; events show
FailedScaleUp ... GCE out of resources and the autoscaler goes into backoff.
Cause: the requested GPU has no stock in the zone right now. L4 (g2) in
us-central1-a is frequently exhausted, on both spot and on-demand.
Fix: the Deployment is GPU-type-agnostic: it requests nvidia.com/gpu: 1 and
tolerates the GPU taint but pins no cloud.google.com/gke-accelerator, so the autoscaler
falls back to whatever GPU pool has capacity. Keep a second pool for this:
Set gpu_node_pool_name = "gpu-t4", gpu_machine_type = "n1-standard-4", and
gpu_accelerator_type = "nvidia-tesla-t4" in infra/gke/terraform/terraform.tfvars, then run
make tf-apply.
Both pools are scale-to-zero ($0 idle). The 0.5B model fits either (L4 23GB / T4 16GB).
Check the live capacity / quota before assuming it’s gone for good:
3. Rollout deadlocks on a template change
Symptom: editing the Deployment (e.g. args) while it’s scaled up leaves a new podPending forever with FailedScaleUp, and the old pod never terminates.
Cause: the project’s GPUS_ALL_REGIONS quota is 1: only one GPU node cluster-wide.
A RollingUpdate surge pod needs a second GPU before the old one frees, which the quota
forbids, so the roll never completes.
Fix: the Deployment uses strategy: Recreate: old pod terminates before the new one
schedules. Correct for single-GPU serving; the brief downtime is unavoidable with one GPU.
4. Auth: /v1/* needs the key, /metrics does not
vLLM enforces VLLM_API_KEY (ESO-provisioned from GCP Secret Manager) on /v1/* but
serves /health and /metrics unauthenticated, so the Prometheus ServiceMonitor scrapes
without credentials. A 401 on a chat request means the vllm-api-key Secret is missing or
stale:
SecretSyncError, the backing Secret Manager value doesn’t exist yet; create it
(see serving/raw-vllm/README.md).