Skip to main content
Both serve the same model (Qwen2.5-0.5B-Instruct) on the same engine (vLLM v0.23.0), so this compares the serving layer, not the engine. raw vLLM is serving/raw-vllm (see Benchmarks); KServe is serving/kserve. Both expose the OpenAI API, and the KServe path is fronted by the same agentgateway as the GIE routing layer. Operational gotchas: the KServe guide. Serving layer choice comparing raw vLLM, KServe InferenceService, and llm-d paths for the same model
DimensionRaw vLLM (serving/raw-vllm)KServe InferenceService (serving/kserve)
What you writeDeployment + Service + PVC + probes, by handOne InferenceService CR; KServe reconciles the rest
Model lifecycleManual (make vllm-up/down, edit image/args)Declarative; revisions tracked by the controller
Rollout / canaryDIY (Recreate; no traffic split)canaryTrafficPercent (revision-based weighted split)
Scale-to-zeroManual replicas:0; Argo ignoreDifferencesNative (minReplicas:0)
Ingress / URLService; you wire the gateway/routeKServe auto-creates an HTTPRoute + stable URL per ISVC
Multi-model contractNone; each model is bespokeUniform CR across all models (governance)
Control / blast radiusTotal control, minimal moving partsController + cert-manager + KServe’s opinionated defaults
Prereqsnone beyond the clustercert-manager, KServe controller, a network controller

What KServe buys (and what it costs)

Buys: the platform owns model lifecycle: a stable per-model URL, declarative canary, native scale-to-zero, and one CR contract across every model a team ships. That is the difference between “I deployed a model” and “we operate a model-serving platform.” Costs: a control plane to run (cert-manager + controller) and a set of opinionated behaviours that fight you until you know them: a default cpu limit of 1, the model+runtime /mnt/models injection, an Istio VirtualService path unless disabled (all in the runbook). raw vLLM has none of that surface because it does none of that work.

When to use which

  • Raw vLLM: one model, one team, full control, you own rollout/scale. The right call for a single benchmarked endpoint.
  • KServe InferenceService: you want lifecycle/canary/scale/ingress/governance handled declaratively across many models.
  • LLMInferenceService: a future path for large/multi-node models needing disaggregated serving and KV-aware routing (llm-d); it overlaps the GIE routing layer.

Confirmed behavior

The same model answers an OpenAI chat.completions request on both layers. KServe path: the InferenceService reaches Ready, GET /v1/models returns qwen2.5-0.5b, and POST /v1/chat/completions returns 200 (system_fingerprint vllm-0.23.0) through the agentgateway, identical to the raw vLLM endpoint. Weights load from a pre-staged volume rather than a live Hugging Face pull, which is also the more air-gap-friendly pattern.