
1. GPU Operator pods never schedule: “insufficient quota to match these scopes”
Symptom: after thegpu-operator Argo CD app syncs, the operator + NFD Deployments stay at
0/1 with zero pods (not even Pending). Events show:
system-node-critical / system-cluster-critical priority classes
to namespaces that explicitly grant them via a scoped ResourceQuota (enforced through a
cluster-level admission config; there is no visible ResourceQuota to find). The GPU Operator’s
components run at system-node-critical, so admission rejects their pods in a fresh namespace.
Fix: grant the gpu-operator namespace a ResourceQuota scoped to those priority classes
(platform/gpu-operator/prereqs/resourcequota.yaml), synced before the operator (sync-wave 1
< the operator’s wave 2). Pods schedule immediately once it exists.
Diagnose:
2. GPU node never provisions: “GCE quota exceeded”
Symptom: a pod requestingnvidia.com/gpu stays Pending; the autoscaler triggers a scale-up
that then fails and gives up (cluster_autoscaler_unhelpable_until: Inf):
GPUS_ALL_REGIONS(global) gates all GPU usage regardless of region. Default 0 → blocks everything.PREEMPTIBLE_CPUS(regional): Spot/preemptible nodes (e.g.g2-standard-8= 8 vCPU) count here. Default 0. (On-demand nodes use the regionalCPUSquota instead.)
NVIDIA_L4_GPUS is necessary but not sufficient: GPUS_ALL_REGIONS must also be ≥ the count.
Fix: request quota increases (Console → IAM & Admin → Quotas, or the Cloud Console quota page):
GPUs (all regions)→ ≥ 1Preemptible CPUsin the region → ≥ 8 (for a single g2-standard-8 spot node), or drop--spotand ensure regionalCPUshas headroom.
Note: an L4 only comes on ag2-*machine, and the smallest (g2-standard-4) is already 4 vCPU. With a 4-vCPU spot default pool, a spot L4 needsPREEMPTIBLE_CPUS≥ 8 region-wide (default-pool 4 + GPU 4). Granting exactly 8 fits only the smallest L4 shape;g2-standard-8would need 12.
3. Self-managed GPU Operator on GKE: driver validates but the nvidia runtime never registers
Symptom: with the GPU Operator (driver.enabled=true) on a GKE Ubuntu node, the
driver itself works (nvidia-smi succeeds inside nvidia-driver-daemonset), but
nvidia-operator-validator, nvidia-device-plugin-daemonset, nvidia-dcgm-exporter,
and gpu-feature-discovery stay in Init / CreateContainerError:
nvidia.com/gpu is never advertised, so GPU pods stay Pending with
Insufficient nvidia.com/gpu.
Two distinct causes, in order:
- driver-validation watches the wrong directory.
hostPaths.driverInstallDirwas set to GKE’s managed-driver path/home/kubernetes/bin/nvidia, but withdriver.enabled=truethe operator installs to the chart default/run/nvidia/driver. Thedriver-validationinit container logsfailed to validate the driveron a 5s loop forever. Fix: remove thehostPaths.driverInstallDiroverride (keeptoolkit.installDiron the writable GKE path). This unblocks validation but exposes cause 2. - the container-toolkit cannot configure GKE’s containerd
(gpu-operator#1679). The
toolkit writes a drop-in to
/etc/containerd/conf.d/99-nvidia.tomland SIGHUPs containerd, but GKE has no/etc/containerd/config.tomlat the standard path and does not import that drop-in dir, so thenvidiaruntime is never registered.driver.enabled=falsedoes not help: the toolkit still runs and still fails.
gpu-driver-version=default, COS image, GKE’s device plugin), and run
dcgm-exporter standalone for metrics.
Diagnose:
4. Own dcgm-exporter crashes on GKE: GKE already runs one (embedded DCGM conflict)
Symptom: the NVIDIAdcgm-exporter Helm chart pod starts then exits 1 immediately on a
GKE GPU node (distroless image → only Starting dcgm-exporter then exit status 1), even though
the driver libs are mounted and nvidia-smi works.
Cause: GKE runs its own dcgm-exporter (DaemonSet in gke-managed-system) with an
embedded DCGM engine on every GPU node. A second embedded engine can’t co-attach to the
same GPUs, so our exporter dies. (Separate, earlier issue: on GKE COS there’s no nvidia
RuntimeClass, so a self-run exporter must mount the host driver
/home/kubernetes/bin/nvidia → /usr/local/nvidia, set LD_LIBRARY_PATH=/usr/local/nvidia/lib64,
and run privileged, but that only gets you to the embedded-engine conflict above.)
Fix: don’t run our own. Scrape GKE’s managed dcgm-exporter into the in-cluster Prometheus
with a PodMonitor (platform/dcgm-metrics). GKE’s exporter already emits DCGM_FI_DEV_* with
pod→GPU attribution on :9400. On a self-managed / GPU-Operator cluster, run the chart instead.
Diagnose: