Context
GPUs are the scarce resource: this cluster runs one GPU at a time (GPUS_ALL_REGIONS = 1). Multiple workloads (and, conceptually, multiple teams) will want it.
We need to ration it fairly: admit work when capacity is free, queue it when not, and let
urgent work preempt best-effort work, without each submitter busy-waiting or the scheduler
thrashing.
This is an admission/queueing problem, not an autoscaling one. It must not be confused with:
- HPA / KEDA: scale replicas of a service on load.
- Cluster Autoscaler / Karpenter: add/remove nodes.
- Gateway API Inference Extension: route a request to the best replica.
Options
- Plain
ResourceQuota+PriorityClass+ scheduler preemption. Hard per-namespace caps and priority preemption with zero add-ons. But over-quota pods are rejected, not queued; there is no borrowing, no fair-sharing, no cohort, no “wait your turn.” Operationally weak for contended GPUs. - Volcano. A replacement batch scheduler (ex
kube-batch) with strong gang scheduling: all-or-nothing placement for multi-node distributed training/HPC. Powerful, but it supplants the default scheduler and is heavier than this single-GPU, single-node need. - Kueue. A quota/admission layer that sits on top of the default scheduler: it suspends
an opted-in Job until its ClusterQueue has free quota, then un-suspends it. Provides nominal
quota, borrowing within a cohort, fair-sharing, preemption, and flavor fungibility.
A
kubernetes-sigsproject; integrates with Job, JobSet, Kubeflow, Ray.
Decision
Use Kueue (v0.18.1, OCI Helm chart) for GPU quota and admission.
- One ResourceFlavor
gpu-any(no nodeLabels; carries thenvidia.com/gputoleration Kueue injects on admission) and one ClusterQueuegpu-cqwithnvidia.com/gpunominal quota of 1, the real physical capacity. Tenant namespacesteam-a/team-breach it through namespaced LocalQueues. Two WorkloadPriorityClasses drive preemption. - Manifests are
kueue.x-k8s.io/**v1beta2**(the served storage version in 0.18; v1beta1 is deprecated). manageJobsWithoutQueueName: false: Kueue governs only Jobs that opt in via thekueue.x-k8s.io/queue-namelabel, so the serving stack and system workloads are untouched.
On flavor fungibility (deliberately deferred)
The instructive design (listl4 then t4 flavors so a job falls back across pools) needs a
GPU budget > 1 to be meaningful (per-flavor quota would otherwise let Kueue admit more jobs
than there are GPUs, pushing the contention down to the scheduler and muddying the queueing
story). With one physical GPU, a single agnostic flavor with nominalQuota: 1 is the honest
model; the l4→t4 fallback already exists one layer down, in the autoscaler (the GPU pods pin no
accelerator, ADR/infra). Revisit when the GPU budget grows.
Consequences
- A second GPU job waits in Kueue (
WorkloadPending on quota), not as a failed pod, demonstrated inworkloads/kueue-demo. A high-priority Workload preempts a running low-priority one. - Kueue is the batch/offline admission layer. The always-on vLLM serving Deployment is not a natural Kueue object (it never completes); per-request serving scale/overload is a different layer (HPA/KEDA + inference-aware routing), not this ADR.
- Adds one controller (
kueue-system) with internal cert management, no cert-manager dependency.