Context
ADR-0003 sets latency/throughput SLOs and asks the question a serving platform must answer: what is the max sustainable request rate at which the SLOs still hold, the latency-vs-throughput SLO frontier. The original baseline used vLLM’s built-invllm bench serve over a closed-loop concurrency sweep
(--request-rate inf at --max-concurrency ∈ ). That harness emits the right percentiles
(TTFT, ITL, TPOT, throughput) but only backs into the frontier: a closed-loop client offers exactly as
much load as the server can absorb, so it never asks “at a fixed arrival rate, does the server keep up?”,
which is the question that actually defines capacity-at-SLO.
GuideLLM (vllm-project/guidellm) is the ecosystem-standard SLO/capacity tool: it backs the Red Hat
inference-benchmark articles and llm-d/llm-d-benchmark. Its open-loop sweep rate-type drives a
fixed arrival rate independent of server speed, so it yields the latency-vs-throughput frontier
(max-RPS-at-SLO) directly. A Red Hat AI-Catalyst-style comparison requires the “max-RPS-at-SLO”
story, so the earlier “GuideLLM not adopted” position flips.
Decision
Adopt GuideLLM as the standard serving benchmark. Its open-loopsweep is the SLO-frontier source
of record going forward. Demote vllm bench serve to an optional, zero-dependency smoke check, kept
because it ships inside the vLLM image (no extra pull) and gives a quick “is the endpoint alive and
roughly sane” signal, but no longer the lineage for published numbers.
Do not maintain both as first-class. One canonical harness avoids two divergent number sets that a
reader would have to reconcile.
benchmarks/guidellm-job.yamlruns in-cluster as a Job, imageghcr.io/vllm-project/guidellm:v0.5.0(pinned: GuideLLM is pre-1.0 and CLI/output shapes shift across tags).make bench-guidellm.- Canonical invocation:
guidellm benchmark run --target $BASE_URL --rate-type sweep --data 'kind=synthetic_text,prompt_tokens=256,output_tokens=128' …, matching the ADR-0003 256-in/128-out request shape so the frontier is comparable to the baseline’s request profile. - Target-agnostic, reusing the existing
bench-target/bench-target-authcontract (same as thevllm bench serveJob): point it at raw-vLLM, KServe, or LiteLLM by ConfigMap; a LiteLLM virtual key is sourced frombench-target-auth. No new auth path.
Parity gate (done before demoting vllm bench serve)
GuideLLM was run on the same L4 / Qwen2.5-0.5B-Instruct baseline to confirm it reports the same
qualitative shape before the older harness was demoted. Open-loop numbers differ from the closed-loop
baseline by design (different loop model + request shaping), so they are not expected to match
point-for-point; only the saturation behavior must agree.
Recorded, 2026-06-23, L4 / Qwen2.5-0.5B-Instruct via KServe modelcar (vLLM v0.23.0), synthetic
256-in/128-out:
| strategy | req/s | total tok/s | TTFT mdn | ITL mdn |
|---|---|---|---|---|
| synchronous (1 stream) | 1.3 | ~550 | 31 ms | 5.7 ms |
| sustainable (~SLO knee) | ~12 | ~5.3k | ~70 ms | ~9 ms |
| saturation (throughput) | ~15 | ~6.1k | 3.7 s (saturated) | 86 ms |
Consequences
- + The latency-vs-throughput SLO frontier (ADR-0003’s actual question) is read directly, not inferred from a closed-loop matrix.
- + Aligns with the ecosystem-standard tool, so forkers’ numbers are comparable to published llm-d / Red Hat results.
- + Becomes the obvious harness for later multi-replica / disaggregation comparisons where “max-RPS-at-SLO across topologies” is the headline.
- − A second image to pull (mitigated: pinned dig/tag; the smoke path still needs no extra image).
- − GuideLLM is pre-1.0; the tag pin is load-bearing and must be bumped deliberately.
- The single-GPU GuideLLM baseline above is recorded; multi-GPU sweeps (HA + disaggregation) are tracked
and
[needs-gpu].
benchmarks/README.md, benchmarking guide in the docs site.