Is it Kueue (quota) or the scheduler (capacity)?
- Workload not admitted → Kueue is holding it. The condition message is explicit, e.g.
couldn't assign flavors to pod set main: insufficient unused quota for nvidia.com/gpu in flavor gpu-any, 1 more needed. There is no pod yet: the Job is suspended. This is correct behaviour when another workload holds the single GPU’s quota; it admits automatically when quota frees (or on preemption). Nothing to fix. - Workload admitted but the pod is Pending → it’s past Kueue; now it’s a scheduling/capacity
problem (GPU node provisioning, or
GCE out of resources). Seedocs/public/guides/vllm-serving.md§2 (GPU-agnostic scheduling + T4 fallback).
kueue.x-k8s.io/queue-name label) is never
suspended: manageJobsWithoutQueueName is false, so check the label first.
1. ClusterQueue / ResourceFlavor fail to apply: “no endpoints available” for the webhook
Symptom: on first sync theClusterQueue and ResourceFlavor error with
failed calling webhook "mclusterqueue.kb.io": ... no endpoints available for service "kueue-webhook-service". (LocalQueues/WorkloadPriorityClasses may apply fine.)
Cause: a sync-wave ordering race: the queue config (wave 3) was applied before the Kueue
controller’s webhook pod had endpoints. Kueue uses internal cert management (no
cert-manager), and the webhook only serves once the controller is Ready.
Fix: none needed long-term: Argo CD selfHeal retries and succeeds once the controller is
up. To force it: kubectl -n argocd annotate application kueue-config argocd.argoproj.io/refresh=hard --overwrite.
Confirm the webhook is ready first:
2. ClusterQueue stuck OutOfSync forever (Argo drift)
Symptom:kueue-config never reaches Synced; only ClusterQueue/gpu-cq is OutOfSync.
Cause: the ClusterQueue mutating webhook fills defaults not present in git:
queueingStrategy: BestEffortFIFO, stopPolicy: None,
flavorFungibility: {whenCanBorrow: MayStopSearch, whenCanPreempt: TryNextFlavor},
preemption.borrowWithinCohort.policy: Never. Argo then sees live ≠ desired every reconcile.
Fix: set those fields explicitly in the manifest so git matches the mutated object
(done in platform/kueue-config/clusterqueue.yaml). Preferable to ignoreDifferences: the
config stays self-documenting.
3. ClusterQueue Active=False
flavors[].name references a ResourceFlavor that doesn’t exist, or
coveredResources omits a resource the flavor lists. Fix the reference; it goes Active within
seconds.
Verify admission + preemption (the demo)
nvidia.com/gpu toleration onto team-a’s pod from the gpu-any flavor (the
Job spec carries none), and the autoscaler brings up whichever GPU pool has capacity. Set
team-b’s kueue.x-k8s.io/priority-class to high-priority to preempt team-a. Clean up with
kubectl -n team-a delete job gpu-job-a / kubectl -n team-b delete job gpu-job-b.