Skip to main content
Baseline serving benchmark for serving/raw-vllm: a concurrency sweep measuring latency (TTFT / ITL / end-to-end) and throughput, with GPU and KV-cache utilisation read from the same Prometheus the dashboards use. Harness: vllm bench serve (the repo’s benchmarks/ Job).

Run context

Commit8e78aaa
GPU1× NVIDIA L4 (23 GB), g2-standard-4, on-demand
Driver / CUDA580.126.20 / 13.0 (GKE-managed, COS)
ModelQwen/Qwen2.5-0.5B-Instruct (BF16)
vLLMv0.23.0 (V1 engine)
Server args--gpu-memory-utilization 0.6 --max-model-len 8192
KubernetesGKE v1.36.0-gke.2684000
Request shape256-token input / 128-token output, --ignore-eos, seed 42
Load patternclosed-loop (--request-rate inf) at fixed max-concurrency

Results: concurrency sweep

Latencies in milliseconds. TTFT = time to first token (prefill), ITL = inter-token latency (decode), E2E = end-to-end request latency. Benchmark concurrency sweep showing output token throughput rising while TTFT and E2E latency climb
concurrencyreq/sout tok/sTTFT p50TTFT p95ITL p50ITL p95E2E p50E2E p95
11.316827.0115.55.16.1679765
22.936737.848.25.26.3698705
45.469850.185.45.36.4722760
810.2131260.988.45.57.4772816
1617.82277103.8135.76.08.9880947
3229.53773135.3191.87.112.010821128
Resource peaks over the run (Prometheus / DCGM):
peak
GPU utilisation (DCGM_FI_DEV_GPU_UTIL)100 %
GPU memory used (DCGM_FI_DEV_FB_USED)~13.9 GiB (the 0.6 reservation, not workload pressure)
KV-cache usage (vllm:kv_cache_usage_perc)1.1 %
Requests running / waiting (vllm:num_requests_*)31 / 0

What the numbers say

  • Compute-bound, not memory-bound. GPU compute pins at 100 % under load while KV-cache never exceeds ~1 % and nothing ever queues (waiting = 0). For a 0.5B model the KV footprint per request is tiny, so on one L4 you run out of compute long before memory. Scaling pods/replicas wouldn’t help on a single GPU: the GPU is already the bottleneck.
  • TTFT degrades first. As concurrency rises 1 → 32, TTFT p50 grows ~5× (27 → 135 ms): more requests competing for prefill slots queue at the front of the request. This is the first SLI to watch under load.
  • Decode (ITL) stays cheap and flat. ITL holds ~5-6 ms until concurrency 16, reaching 12 ms p95 only at 32. Continuous batching keeps per-token decode efficient; the small model means decode is never the constraint here.
  • Throughput scales near-linearly to saturation. Output throughput rises 168 → 3773 tok/s (≈22×) from concurrency 1 → 32, tracking GPU utilisation toward 100 %. Past saturation, more concurrency buys throughput only by trading latency (TTFT/E2E climb).
  • E2E is decode-dominated. End-to-end is ≈ TTFT + 127×ITL; with 128 output tokens the ~5-7 ms ITL accounts for most of the ~0.7-1.1 s total, so E2E tracks ITL more than TTFT.
These shapes are model- and GPU-specific: a larger model (bigger KV per token, heavier prefill) would shift the first bottleneck toward KV-cache/memory and make TTFT far costlier. The same harness re-run on that model is the next data point.

Live view

The vLLM Serving Grafana dashboard (dashboards/vllm-serving-dashboard.json) shows the same signals in real time: TTFT/ITL/E2E percentiles, prompt vs generation throughput, running vs waiting requests, KV-cache usage, and GPU util/mem (DCGM).