Pending GPU workloads - Kubernetes LLM Platform

How to tell why a GPU Job isn’t running, plus the gotchas hit standing up Kueue. A Job is either suspended by Kueue (waiting on quota, working as intended) or Pending at the scheduler (admitted but no GPU node). They look different and need different actions.

Is it Kueue (quota) or the scheduler (capacity)?

kubectl get localqueue -A                 # PENDING vs ADMITTED counts per queue
kubectl get workloads -A                  # ADMITTED column: True = past Kueue
kubectl -n <ns> get workload <name> -o jsonpath='{.status.conditions}'

Workload not admitted → Kueue is holding it. The condition message is explicit, e.g. couldn't assign flavors to pod set main: insufficient unused quota for nvidia.com/gpu in flavor gpu-any, 1 more needed. There is no pod yet: the Job is suspended. This is correct behaviour when another workload holds the single GPU’s quota; it admits automatically when quota frees (or on preemption). Nothing to fix.
Workload admitted but the pod is Pending → it’s past Kueue; now it’s a scheduling/capacity problem (GPU node provisioning, or GCE out of resources). See docs/public/guides/vllm-serving.md §2 (GPU-agnostic scheduling + T4 fallback).

A Job that isn’t governed by Kueue at all (no kueue.x-k8s.io/queue-name label) is never suspended: manageJobsWithoutQueueName is false, so check the label first.

1. ClusterQueue / ResourceFlavor fail to apply: “no endpoints available” for the webhook

Symptom: on first sync the ClusterQueue and ResourceFlavor error with failed calling webhook "mclusterqueue.kb.io": ... no endpoints available for service "kueue-webhook-service". (LocalQueues/WorkloadPriorityClasses may apply fine.) Cause: a sync-wave ordering race: the queue config (wave 3) was applied before the Kueue controller’s webhook pod had endpoints. Kueue uses internal cert management (no cert-manager), and the webhook only serves once the controller is Ready. Fix: none needed long-term: Argo CD selfHeal retries and succeeds once the controller is up. To force it: kubectl -n argocd annotate application kueue-config argocd.argoproj.io/refresh=hard --overwrite. Confirm the webhook is ready first:

kubectl -n kueue-system rollout status deploy/kueue-controller-manager
kubectl -n kueue-system get endpoints kueue-webhook-service

2. ClusterQueue stuck OutOfSync forever (Argo drift)

Symptom: kueue-config never reaches Synced; only ClusterQueue/gpu-cq is OutOfSync. Cause: the ClusterQueue mutating webhook fills defaults not present in git: queueingStrategy: BestEffortFIFO, stopPolicy: None, flavorFungibility: {whenCanBorrow: MayStopSearch, whenCanPreempt: TryNextFlavor}, preemption.borrowWithinCohort.policy: Never. Argo then sees live ≠ desired every reconcile. Fix: set those fields explicitly in the manifest so git matches the mutated object (done in platform/kueue-config/clusterqueue.yaml). Preferable to ignoreDifferences: the config stays self-documenting.

3. ClusterQueue Active=False

kubectl get clusterqueue gpu-cq -o jsonpath='{.status.conditions[?(@.type=="Active")]}'

Common reasons: a flavors[].name references a ResourceFlavor that doesn’t exist, or coveredResources omits a resource the flavor lists. Fix the reference; it goes Active within seconds.

Verify admission + preemption (the demo)

kubectl apply -f workloads/kueue-demo/job-team-a.yaml   # admitted, holds the GPU
kubectl apply -f workloads/kueue-demo/job-team-b.yaml   # suspended on quota
kubectl get workloads -A                                # team-a ADMITTED=True, team-b pending

Kueue injects the nvidia.com/gpu toleration onto team-a’s pod from the gpu-any flavor (the Job spec carries none), and the autoscaler brings up whichever GPU pool has capacity. Set team-b’s kueue.x-k8s.io/priority-class to high-priority to preempt team-a. Clean up with kubectl -n team-a delete job gpu-job-a / kubectl -n team-b delete job gpu-job-b.

​Is it Kueue (quota) or the scheduler (capacity)?

​1. ClusterQueue / ResourceFlavor fail to apply: “no endpoints available” for the webhook

​2. ClusterQueue stuck OutOfSync forever (Argo drift)

​3. ClusterQueue Active=False

​Verify admission + preemption (the demo)

Is it Kueue (quota) or the scheduler (capacity)?

1. ClusterQueue / ResourceFlavor fail to apply: “no endpoints available” for the webhook

2. ClusterQueue stuck OutOfSync forever (Argo drift)

3. ClusterQueue Active=False

Verify admission + preemption (the demo)