Context
The stack already routes inference through Gateway API + the Gateway API Inference Extension (GIE) on agentgateway: anInferencePool + Endpoint Picker does inference-aware,
model-aware routing (KV-cache/queue-depth endpoint selection, model/canary splits). What it does
not do is the tenant-facing economics layer: virtual API keys, per-key spend budgets, TPM/RPM
rate limits, a spend ledger, and a single OpenAI /v1 facade across local + commercial providers.
That is the tenant gateway jump to a self-hosted LLM platform. We need to decide how that layer relates to GIE.
Decision
Add LiteLLM as a control-plane layer above the existing GIE data plane, not as a replacement.- GIE / agentgateway = data plane. Inference-aware routing to model-server endpoints. Stays.
- LiteLLM = control plane / economics. Virtual keys, budgets, TPM/RPM, spend,
/v1facade, multi-provider. New.
client → LiteLLM /v1 (key auth, budget check, spend log) → inference-gateway (model-aware) → InferencePool/EPP → vLLM. LiteLLM’s local model targets the in-cluster gateway
(http://inference-gateway.inference.svc/v1). External commercial providers route through the same
gateway (see Amendment (2026-06-22)).
This matches upstream guidance: the GIE docs explicitly position self-hosted models as something to
“integrate alongside MaaS providers in higher-level AI Gateways like LiteLLM”, i.e. the two are
different, complementary layers.
LiteLLM runs from the official Helm chart (ghcr.io/berriai/litellm-helm, pinned 1.89.2), backed
by CloudNativePG Postgres (single instance; allow_requests_on_db_unavailable=true mitigates
the SPOF), no Redis (single replica). Master/salt/DB/upstream secrets come via ESO (ADR-0011); the
salt key is write-once. Schema migration runs as the chart’s Argo PreSync-hook Job with the proxy’s
DISABLE_SCHEMA_UPDATE=true.
Amendment (2026-06-22): agentgateway is the unified external egress
The original decision left external commercial APIs (OpenAI, Anthropic, …) attached directly at LiteLLM. We now route them through agentgateway instead, so all model traffic, local and external, exits through one governed data plane:client → LiteLLM /v1 (keys, budgets, spend) → agentgateway → {GIE/InferencePool/EPP → vLLM | OpenAI | Anthropic | Gemini | Bedrock}
LiteLLM points at one upstream (agentgateway); agentgateway normalizes every provider to an
OpenAI-compatible response (its AIProvider schema-translation orchestrator, first-class OpenAI/
Anthropic/Gemini/Bedrock + any OpenAI-compatible) and returns token usage, so LiteLLM still computes
spend (usage × its pricing table) and enforces budgets unchanged.
agentgateway plays two distinct roles, keep them honest:
- Inference-aware router for self-hosted vLLM (GIE
InferencePool/EPP, KV/queue-aware). Local only. - Normalizing egress proxy for external SaaS (the AIProvider path). GIE does not apply to SaaS (no KV/queue to observe); do not claim “ChatGPT routed by GIE.”
usage round-trips through agentgateway for one external
provider so LiteLLM’s spend/budget math binds (one short spike).
Layering is unchanged: LiteLLM stays the client-facing economics layer on top; only its external
backend moves from the provider URL to agentgateway. This refines, but does not reverse, the rejection of
“LiteLLM as the only gateway” below (GIE stays the local data plane).
Alternatives considered
- Envoy AI Gateway (consolidate economics into the data plane). Rejected for now: pre-GA (1.0 target ~2026-06-30), Redis-required, and token-quota only, with no virtual keys, no $budgets, no spend ledger. It can’t deliver the tenant gateway. Revisit post-GA only to potentially consolidate the data plane; it does not replace LiteLLM’s economics.
- GIE alone (push budgets/keys into routing). Rejected: GIE is deliberately a routing layer; it has no key/budget/spend model. Wrong layer for tenant economics.
- LiteLLM as the only gateway (drop GIE, point LiteLLM straight at vLLM). Rejected: throws away the inference-aware routing that is the platform’s depth differentiator. Keep both layers.
Consequences
- Two layers to operate, but each does one job (economics vs inference-aware routing): clean separation, and each is independently swappable (Envoy AI GW could replace the data plane later; a different economics layer could replace LiteLLM).
- Adds an always-on Postgres + proxy on the CPU node pool (small, non-GPU cost): the self-hosted LLM platform edge is meant to be always available; the GPU model underneath stays scale-to-zero.
- Tenant→key→budget data lives in Postgres (tenant data, not git). SSO/audit (LiteLLM Enterprise) are out of scope; unification with OIDC is part of multi-tenancy/governance (see SSO/auth).
- Sources: GIE↔LiteLLM layering (GIE docs); Envoy AI Gateway maturity/feature gaps; LiteLLM vkeys/budgets/spend.