Skip to main content
Status: Accepted Date: 2026-06-16

Context

A plain L7 load balancer (round-robin / least-conn / ring-hash) is structurally blind to what makes an LLM replica slow. vLLM latency is driven by per-replica live state (KV-cache occupancy, the running/waiting queue, prefix-cache hits), none of which an HTTP LB can see. So an LB will happily send a request to a replica whose KV-cache is full while another sits idle, spiking TTFT. It also can’t express LLM-shaped intent: route by model/adapter name (carried in the request body, not the path/header), split traffic across model versions, or admit interactive traffic ahead of batch under contention. The Gateway API Inference Extension (GIE) addresses exactly this: an InferencePool (GA at inference.networking.k8s.io/v1) groups model-server pods, and an Endpoint Picker (EPP), a gRPC ext-proc the gateway calls per request, picks the endpoint using live vLLM metrics. We need a Gateway API data plane that implements GIE. This is a distinct layer from ADR-0002 (Kueue = job admission/quota, not per-request routing). Kueue decides whether a batch job starts; GIE decides which replica a live request hits.

Options (gateway implementation)

  1. Istio (1.28+). Confirmed InferencePool v1 support, GIE chart has first-class provider.name=istio, most mature. But it’s a full service mesh, heavy for a single inference gateway, and we explicitly do not want Istio here.
  2. kgateway (2.1). Supports InferencePool v1, but its Envoy-based inference path is deprecated, removal in 2.2; kgateway steers AI/inference to agentgateway. Pinning it is a dead-end. Rejected on version-discipline grounds.
  3. Envoy AI Gateway (0.7). Native GIE plus multi-provider + token/key management. But it requires an Envoy Gateway control plane underneath (two layers), is pre-1.0, and lags GIE releases. Its real value is the multi-tenant AI gateway, a multi-tenant-gateway concern, not the core serving path.
  4. LiteLLM. Not a Gateway API / InferencePool implementation at all; its “load” awareness is its own observed request counters/latency, never live vLLM KV/queue metrics. It is an API-level multi-provider proxy (virtual keys, budgets, chargeback), a multi-tenant-gateway tool, wrong layer for GIE.
  5. agentgateway (1.2.1). Self-contained Gateway API provider (no separate control plane), post-1.0, native InferencePool v1 + EPP, $0-GPU sim quickstart, vendor-neutral. It is the forward path kgateway routes inference toward.

Decision

Use GIE v1.5.0 (InferencePool v1, EPP via the upstream inferencepool Helm chart) on agentgateway v1.2.1 as the gateway data plane (provider.name=none; agentgateway discovers the pool). Routing chain: Gateway(class=agentgateway) → HTTPRoute → InferencePool → EPP → vLLM pods. InferenceObjective (inference.networking.x-k8s.io/v1alpha2) expresses interactive>batch priority. Model-aware routing + version canary. OpenAI clients carry the model in the request body, which Gateway API cannot match on. We use GIE’s Body-Based Routing (BBR), implemented natively by agentgateway as an AgentgatewayPolicy (agentgateway.dev/v1alpha1) in the PreRouting phase: a CEL transform (string(json(request.body).model)) lifts the model name into the X-Gateway-Base-Model-Name header. HTTPRoute then matches that header per model → its own InferencePool (one pool/EPP per model). Passthrough (no LoRA→base alias map) means an unknown model matches no rule and gets a clean 404, proof the routing is genuinely by model name, not a catch-all. Version rollout is weighted backendRefs (standard Gateway API) across two pools of the same model (model-b-stable:90 / model-b-canary:10), so each split target still gets the EPP’s inference-aware endpoint pick. This rejects routing the split at the Service layer (loses EPP awareness) and a single multi-version pool (can’t weight versions independently). Rationale: leanest correct fit for the core serving path (one Helm install, no mesh, no second control plane), on a non-deprecated path, and it doubles as the multi-tenant gateway: agentgateway also does multi-provider LLM routing, per-key auth (JWT/OPA), RBAC, and rate limiting. If the multi-tenant gateway needs OSS per-tenant budgets/chargeback (agentgateway virtual keys are Solo.io Enterprise), LiteLLM slots in front of agentgateway for exactly that. So this choice advances the later goal instead of discarding work.

Consequences

  • + EPP picks the least-loaded endpoint per request, verified distributing 12 requests 3/2/7 across 3 replicas (load-weighted, not round-robin) at $0-GPU on the llm-d-inference-sim.
  • + Model-aware routing verified: Qwen3-0.6B→model-a pool, Qwen2.5-1.5B→model-b pool, unknown model→404; the model-b version canary held ~90/10 (55 stable / 5 canary over 60 requests). All $0-GPU.
  • One EPP per InferencePool: N models + a canary = N+1 EPPs (each ~200m cpu). Cheap with sims, but on a real multi-GPU fleet this is a per-pool cost to plan for.
  • + One vendor-neutral gateway spans the core serving path (GIE) and the multi-tenant gateway (multi-provider).
  • agentgateway is young (v1.2.x) and CNCF-sandbox-tier; its OSS edition lacks virtual keys.
  • InferenceObjective is still alpha; schema may shift; we pin GIE v1.5.0 and verify the served schema (see runbook).
  • The classic LLMInferenceService/KServe three-way comparison (ADR-0006, pending) is unaffected; GIE is the routing layer, KServe is the serving-lifecycle layer.

References

  • Runbook inference-gateway.md. GIE v1.5.0; agentgateway v1.2.1; llm-d-inference-sim v0.8.2. Supersedes the original ADR-005 framing (now resolved here).