Skip to main content
Status: Accepted Date: 2026-06-15

Context

GPUs are the scarce resource: this cluster runs one GPU at a time (GPUS_ALL_REGIONS = 1). Multiple workloads (and, conceptually, multiple teams) will want it. We need to ration it fairly: admit work when capacity is free, queue it when not, and let urgent work preempt best-effort work, without each submitter busy-waiting or the scheduler thrashing. This is an admission/queueing problem, not an autoscaling one. It must not be confused with:
  • HPA / KEDA: scale replicas of a service on load.
  • Cluster Autoscaler / Karpenter: add/remove nodes.
  • Gateway API Inference Extension: route a request to the best replica.
Admission decides whether a job may start at all, given a quota; the autoscalers then react to the pods it admits. Core Kubernetes alone can cap and prioritise but cannot queue.

Options

  1. Plain ResourceQuota + PriorityClass + scheduler preemption. Hard per-namespace caps and priority preemption with zero add-ons. But over-quota pods are rejected, not queued; there is no borrowing, no fair-sharing, no cohort, no “wait your turn.” Operationally weak for contended GPUs.
  2. Volcano. A replacement batch scheduler (ex kube-batch) with strong gang scheduling: all-or-nothing placement for multi-node distributed training/HPC. Powerful, but it supplants the default scheduler and is heavier than this single-GPU, single-node need.
  3. Kueue. A quota/admission layer that sits on top of the default scheduler: it suspends an opted-in Job until its ClusterQueue has free quota, then un-suspends it. Provides nominal quota, borrowing within a cohort, fair-sharing, preemption, and flavor fungibility. A kubernetes-sigs project; integrates with Job, JobSet, Kubeflow, Ray.
(Also surveyed: YuniKorn (hierarchical queues, strongest for Spark/data) and Run:ai (commercial/NVIDIA, fractional-GPU). Neither fits a small self-hosted OSS GPU platform better than Kueue here.)

Decision

Use Kueue (v0.18.1, OCI Helm chart) for GPU quota and admission.
  • One ResourceFlavor gpu-any (no nodeLabels; carries the nvidia.com/gpu toleration Kueue injects on admission) and one ClusterQueue gpu-cq with nvidia.com/gpu nominal quota of 1, the real physical capacity. Tenant namespaces team-a/team-b reach it through namespaced LocalQueues. Two WorkloadPriorityClasses drive preemption.
  • Manifests are kueue.x-k8s.io/**v1beta2** (the served storage version in 0.18; v1beta1 is deprecated).
  • manageJobsWithoutQueueName: false: Kueue governs only Jobs that opt in via the kueue.x-k8s.io/queue-name label, so the serving stack and system workloads are untouched.

On flavor fungibility (deliberately deferred)

The instructive design (list l4 then t4 flavors so a job falls back across pools) needs a GPU budget > 1 to be meaningful (per-flavor quota would otherwise let Kueue admit more jobs than there are GPUs, pushing the contention down to the scheduler and muddying the queueing story). With one physical GPU, a single agnostic flavor with nominalQuota: 1 is the honest model; the l4→t4 fallback already exists one layer down, in the autoscaler (the GPU pods pin no accelerator, ADR/infra). Revisit when the GPU budget grows.

Consequences

  • A second GPU job waits in Kueue (Workload Pending on quota), not as a failed pod, demonstrated in workloads/kueue-demo. A high-priority Workload preempts a running low-priority one.
  • Kueue is the batch/offline admission layer. The always-on vLLM serving Deployment is not a natural Kueue object (it never completes); per-request serving scale/overload is a different layer (HPA/KEDA + inference-aware routing), not this ADR.
  • Adds one controller (kueue-system) with internal cert management, no cert-manager dependency.