Skip to main content
A layered LLM-serving platform on Kubernetes. The request path runs top to bottom; cross-cutting concerns (GitOps, secrets, observability) span every layer.

Goal and scope

Serve LLM inference on Kubernetes the way a platform team would. Every operational concern (GPU scheduling, routing, tenancy, secrets, scaling, model delivery, observability) is an explicit, forkable decision rather than a managed black box. The design target is a stack that runs the same on any GPU-capable cluster and grows from a single scale-to-zero GPU to a multi-tenant, HA deployment without rearchitecting. It integrates open-source components; it does not reinvent them. The one hard boundary: infrastructure-as-code owns the cloud substrate; GitOps owns everything inside Kubernetes. Neither crosses into the other’s half. That split keeps a fork portable instead of a pile of provider-specific glue.

The layers

LLM inference platform layer stack showing experience and automation, tenant edge, routing, serving, GPU platform, operations, and cloud substrate
  • Tenant edge: LiteLLM. One OpenAI /v1 facade with virtual keys, per-key/team budgets, rate limits, and a spend ledger. This is where tenancy economics live.
  • Routing: Gateway API + GIE. Inference-aware endpoint selection (KV-cache, queue depth, model/canary splits) via an InferencePool and endpoint-picker, on agentgateway (no Istio). Distinct from the tenant edge: one is economics, the other is request routing.
  • Serving: vLLM / KServe. Raw vLLM is the default (full control, minimal moving parts); KServe InferenceService adds managed lifecycle (canary, scale-to-zero, model governance) where it earns its control plane. See Serving layers compared.
  • GPU platform. The NVIDIA driver, device plugin, and DCGM metrics (managed on GKE; the NVIDIA GPU Operator off GKE), with Kueue for quota/admission and KEDA for queue-depth autoscaling.
  • Substrate & delivery. OpenTofu provisions the cluster and identity; Argo CD reconciles the whole in-cluster stack from git as an app-of-apps. Secrets sync keylessly via External Secrets Operator; observability is Prometheus + Grafana + DCGM + vLLM metrics.
  • Cost visibility. OpenCost attributes real infrastructure cost (node, GPU, memory cost) per workload from the same Prometheus, feature-gated. It complements the LiteLLM spend ledger: token economics from the gateway, infrastructure cost from OpenCost.
For the canonical term-by-term glossary, see the glossary.

Platform topology

The full platform has more than one path through the cluster. /v1 inference requests go through LiteLLM and the inference router. MCP tool traffic is a separate route through agentgateway. n8n is an experience-layer automation surface: its dashboard is SSO-gated, and its AI calls use a LiteLLM virtual key rather than bypassing the tenant edge.
Platform topology showing the inference request path, separate MCP tool path, GitOps, identity, secrets, observability, and OpenTofu substrate

Request path

A single chat completion crosses the tenant edge, the router, and the model server. The edge owns identity and economics; the router owns endpoint selection; the model server owns the GPU. Inference request lifecycle from client through LiteLLM, GIE routing, queueing, batching, prefill, decode, and token streaming

Components and status

Component versions and maturity. Validated means exercised end-to-end on GPU.
LayerComponentVersionState
DeliveryArgo CD (app-of-apps, sync waves)v3.4 / chart 9.5.21✅ Built
SubstrateOpenTofu (cluster, pools, identity, IAM)n/a✅ Built
GPU platformNVIDIA driver + device plugin + DCGM (managed on GKE / GPU Operator off-GKE)n/a✅ Built
GPU platformKueue (ClusterQueue / LocalQueue / ResourceFlavor)0.18.1✅ Built
GPU platformKEDA (queue-depth autoscale)n/a✅ Built
Servingraw vLLM (OpenAI-compatible)v0.23.0✅ Validated
ServingKServe InferenceServicev0.19.0✅ Validated
ServingOCI modelcar delivery (oci://, digest-pinned)n/a✅ Validated
RoutingGateway API + GIE (InferencePool + EPP)agentgateway v1.2.1 / GIE v1.5.0✅ Built
Tenant edgeLiteLLM (virtual keys, budgets, spend)n/a✅ Built
CostOpenCost (per-workload infra cost, feature-gated)chart 2.5.23✅ Built
ExperienceCoding assistant (chat / FIM / agentic, Open WebUI, Tabby)Qwen2.5-Coder 1.5B-14B✅ Validated
AuthSSO (Dex + oauth2-proxy + key-portal)n/a✅ Validated
Observabilitykube-prometheus-stack + DCGM + vLLM ServiceMonitor86.2.3✅ Built
SecretsExternal Secrets Operator + cloud secret manager (keyless)ESO 2.6.0✅ Built
Ingresscert-managerv1.20.2✅ Built
Deferred capabilities: LLM-level tracing (Langfuse), alerting, multi-GPU fair-share, GPU time-slicing, RAG, and MLOps lifecycle. These are tracked with explicit adoption triggers, not designed out.

Operational model: IaC, GitOps, and the one-time steps

The platform is declarative end to end. After a one-time bootstrap, Argo CD reconciles the entire in-cluster stack from git (platform, serving, routing, gateway, experience, and secrets via External Secrets). Day-2 operation is 100% git: change a manifest, Argo applies it. A short list of steps is imperative because they sit below GitOps (they create the cluster and the GitOps engine) or handle values that cannot live in git:
StepCommandWhy it is not reconciled from git
Cluster, IAM, registry, Workload Identitymake tf-applythis is the IaC that creates the substrate
Install Argo CDmake bootstrapinstalls the GitOps engine itself
Seed secretsmake seed-secretssecret values are never committed
Private-repo credentialmake argocd-repoa fork’s read token is supplied at setup
Apply the app-of-appsmake rootstarts reconciliation; everything after is git
The only capability that is genuinely outside infrastructure-as-code is container image builds: the model-delivery image (an OCI modelcar with the weights baked in) is produced by a registry build, not a declarative manifest. Pre-built public images are provided so a fork serves the default model with no build step; building a different model is a documented, opt-in step.

Go deeper