Context
LiteLLM (ADR-0013) is the tenant economics layer (virtual keys / budgets / spend). The tenant gateway requires LLM-level observability distinct from the infra metrics already in Prometheus/Grafana: (a) a spend/budget view (per key/team/model cost + remaining budget), and (b) prompt/token/cost tracing per request. We chose Langfuse (self-hosted) for tracing. Two facts force a split decision rather than “deploy Langfuse and call it done”:- Langfuse v3 is heavy. Self-hosted v3 requires Postgres + ClickHouse + Redis/Valkey + an S3-compatible blob store (the official Helm chart’s bundled subcharts are single-replica smoke-test only). Realistic footprint is ~8GB+ RAM, disproportionate for this ~$30-35/mo single-cluster lab and likely forces a larger/extra CPU node.
- LiteLLM’s Prometheus metrics are Enterprise-gated. LiteLLM moved its Prometheus integration
(
litellm_spend_metric, token/budget metrics) to Enterprise ($250/mo) in Sept 2024. The obvious “scrape LiteLLM/metricsinto Grafana” path is not available to an OSS reference platform.
LiteLLM_SpendLogs and
LiteLLM_VerificationToken tables, which is what makes budgets enforce, verified by the raw-vLLM GPU proof). That data is
sufficient for the spend/budget view.
Decision
Split the observability work: ship the spend dashboard now from Postgres; defer Langfuse tracing.- Spend dashboard (now). A Grafana dashboard backed by a Postgres datasource reading
LiteLLM’s spend tables in the existing CNPG
litellm-pg. Access via a dedicated least-privilege read-only role (grafana_ro, CNPG managed role with the PG-builtinpg_read_all_data); its password flows GSM → ESO. The datasource is delivered as an ESO-rendered Secret labelledgrafana_datasource(keeps the password out of git) and imported by the Grafana datasource sidecar. Reuses Grafana + CNPG; no new infrastructure. Panels: spend over time, spend by model, spend by key, tokens, remaining budget per key. - Langfuse tracing (deferred). Langfuse stays the chosen tracing tool but lands
as its own task when its v3 footprint is justified/affordable (reuse CNPG for its Postgres; bring
ClickHouse + Redis + MinIO). LiteLLM → Langfuse is a one-line
success_callback: ["langfuse"]when we get there.- 2026-06-20: Langfuse tracing deferred on lab-footprint grounds only (not product value). PRODUCT TRIGGER = multi-tenant request tracing needed OR budget allows the v3 ClickHouse+Redis+MinIO footprint.
Alternatives considered
- LiteLLM
/metrics→ Prometheus → Grafana. Rejected: Enterprise-gated; unusable for an OSS reference. (The free 3-month migration license is temporary and not forkable, wrong for a reference platform.) - Deploy full Langfuse v3 now for both tracing and spend. Rejected for now: footprint disproportionate to the lab; the spend view (the higher-value, user-visible economics signal) is achievable at near-zero cost without it.
- Langfuse v2 (Postgres-only, lighter). Rejected: v2 is EOL/unmaintained and missing v3 features, wrong signal for a platform meant to reflect SOTA.
- Reuse the LiteLLM app DB user for Grafana. Rejected: over-privileged (read-write) for a dashboard; a dedicated read-only role is the least-privilege choice.
Consequences
- + The tenant gateway economics story is observable (keys/budgets/spend) with no added infra cost, reusing Grafana + CNPG.
- + Strict least-privilege: Grafana can only
SELECT(viapg_read_all_data); it can never mutate LiteLLM data. Password never in git (GSM → ESO). - − No request-level prompt/token tracing yet (that is Langfuse tracing).
- − The dashboard SQL is coupled to LiteLLM’s table/column names (
"LiteLLM_SpendLogs","LiteLLM_VerificationToken"); a LiteLLM schema change could require dashboard SQL updates. - − Single Postgres instance → the datasource targets
litellm-pg-rw(the-roservice has no endpoints without replicas); switch to-rowhen HA lands (later multi-tenancy/HA hardening). - The dashboard and datasource live in the
llm-gatewaydeployment layer (grafana-litellmapp,dashboards/litellm/) so the baseplatformprofile does not require LiteLLM secrets. - New secret: GSM
litellm-grafana-ro-password(secrets-inventory).
References
- Langfuse v3 self-hosting (Postgres + ClickHouse + Redis + S3): langfuse.com/self-hosting.
- LiteLLM Prometheus → Enterprise (Sept 2024): BerriAI/litellm discussion #5163; LiteLLM pricing.
- LiteLLM spend logging to Postgres (OSS) + Langfuse
success_callback: docs.litellm.ai. - ADR-0013 (LiteLLM), ADR-0011 (secrets/ESO).