Operating guide for the enforcement controls: Security enforcement. The assumption these gaps depend on: Trust model.
What IS enforced
Secrets never live in git or in OpenTofu state. Every secret is a name, materialized at runtime by External Secrets Operator from GCP Secret Manager. On GKE, ESO authenticates keylessly via Workload Identity, with no static service-account key. A forker seeds secret values once into Secret Manager (see the secret contract); the repo holds only references. Model endpoints require auth. raw-vLLM enforces an API key (VLLM_API_KEY) on every /v1/* request.
Tenants do not call model servers directly; they call LiteLLM with a virtual key, and LiteLLM
presents the shared upstream key to the serving layer. Virtual keys carry per-key/team budgets, TPM/RPM
limits, and a spend ledger: the tenancy and economic boundary is the LiteLLM layer.
TLS at the public edge. When a domain is configured, the agentgateway edge terminates Let’s Encrypt
TLS (cert-manager) and owns the stable HTTPS issuer URL SSO depends on. In-cluster backends
stay plain HTTP behind that edge today; see “deliberately deferred” below.
SSO is shared identity, not shared authorization. Dex is the issuer for Argo CD, Grafana, Open
WebUI, LiteLLM admin UI, and oauth2-proxy forward-auth surfaces. Each application still owns its role
mapping: Argo CD and Grafana map admin@<domain> to admin, LiteLLM maps PROXY_ADMIN_ID, and the key
portal lets any signed-in user manage only keys scoped to that user’s email. n8n is SSO-gated at the
edge but keeps its OSS owner login behind that gate.
Optional, off-by-default enforcement (the security group): when multi-tenancy warrants it, two
logical controls become enforced ones:
- SR1, NetworkPolicy (native
networking.k8s.io/v1, CNI-portable): default-deny + allow-list so LiteLLM is the sole authorized caller of the model servers and GIE gateway. Removes the network-bypass around virtual keys and budgets. Needs a NetworkPolicy-enforcing dataplane (on GKE, Dataplane-V2, set at cluster creation). - SR2, Kyverno admission: rejects any pod requesting
nvidia.com/gpuin a managed namespace that lacks the Kueue queue-name label, so a pod cannot bypass GPU quota/fair-share.
What is DELIBERATELY deferred
Each item below is decided and tracked, with the trigger that pulls it forward. None is an accident.| Not enforced today | Why deferred | Enabled when |
|---|---|---|
| TLS everywhere (backend / mTLS) | The public edge terminates TLS; in-cluster hops stay HTTP. A single trusted-tenant lab gains little from intra-cluster mTLS, and it adds a cert-rotation surface. | Untrusted in-cluster workloads, or a compliance requirement for encryption-in-transit between pods. |
| Full RBAC / tenant isolation | The tenancy boundary is LiteLLM (virtual keys/budgets), not the K8s namespace; tenants consume a /v1 endpoint, not a namespace. Per-tenant namespace + ResourceQuota + RBAC + NetworkPolicy is the baseline, but multi-team RBAC, SCIM, and audit are not wired. | More than one team shares the cluster; the multi-tenancy/governance milestone (home SSO + a self-service key portal). |
| Rate limiting at the gateway | LiteLLM enforces per-key TPM/RPM and budgets (the economic limit). A separate edge rate-limit (e.g. agentgateway/MCP) is not configured. | A public-facing or MCP-exposed surface needs abuse protection independent of per-key budgets. |
| Fail-closed budgets (SR3) | The lab runs allow_requests_on_db_unavailable: true (availability-first): if Postgres is down, requests pass un-metered. Fail-closed is a profile knob requiring CloudNativePG HA (instances ≥ 3) + a Redis-backed budget/rate cache. | The prod profile, where overspend risk outweighs availability; rides the HA hardening work. |
| SLO breach alerting | SLOs are recorded and dashboarded but Alertmanager is disabled; the SLOs are baseline-only (one model, one GPU). | First multi-replica/multi-model serving ⇒ per-model SLOs + alerting. |