serving/coder-chat is the worked “added model” example
to copy from. Both paths end at the same place: a budgeted, keyed model behind the LiteLLM /v1
facade. The contrast between serving layers (raw vLLM vs KServe vs llm-d) is in
Switch the serving layer; this guide stays on the raw-vLLM path.
The model contract (what every served model wires)
A served model on this path is five things, all inserving/<model>/:
| Piece | File | What it sets |
|---|---|---|
| Engine + args | deployment.yaml | --model, --served-model-name, quantization, context, GPU/CPU resources |
| Pre-staged weights | deployment.yaml initContainer | offline-first HF snapshot to the cache PVC |
| Weight cache | pvc.yaml | one RWO PVC per model on the cluster default StorageClass (no RWO contention) |
| In-cluster endpoint | service.yaml | OpenAI /v1 Service the gateway/LiteLLM routes to |
| Tenant registration | platform/litellm/values.yaml model_list | the keyed/budgeted tenant alias |
--served-model-name is the engine-facing name (what GET /v1/models returns and what a direct
request sends in "model"). The LiteLLM model_name is the tenant-facing alias. Keep them distinct:
clients hit the alias, the gateway routes to the served name.
A. Change the model an existing endpoint serves
Use this to swap the weights/args of an endpoint that already exists (for example, movecoder-chat
to a different Qwen tier). The Service, PVC, and gateway wiring stay; only the engine args and the
pre-staged weights change.
1. Edit the engine args
Inserving/<model>/deployment.yaml, change the model id (both the initContainer snapshot and the
serving --model) and the served name. Re-check that the new weights fit the GPU and host RAM:
Two sizing traps (see coder-stack.md and the model catalog):
- Host RAM caps model size, not VRAM. vLLM stages weights through host RAM, so a model whose weights exceed the GPU node’s RAM OOM-loads even if it fits VRAM. The 14B-AWQ tier (18 GB) needs a
g2-standard-8(32 GB RAM); 7B-AWQ fitsg2-standard-4.- A bigger model needs a bigger PVC. If the new weights exceed the PVC, bump
resources.requests.storageinpvc.yaml(an already-bound PVC cannot shrink; growing it is fine).
2. Re-stage the weights
The serving container runsHF_HUB_OFFLINE=1, so it will only boot if the new weights are already on
the cache PVC. The initContainer handles this automatically on the next pod start: a warm cache
contacts Hugging Face zero times; a cache that lacks the new model falls back to a one-time online
pull, then serves offline. The GKE egress IP is HF-429-rate-limited, so if the online
fallback fails, pre-stage out-of-band (download on an un-throttled machine and kubectl cp onto the
PVC, see kserve.md §5).
3. Update the tenant registration
If--served-model-name changed, update the matching model_list entry in
platform/litellm/values.yaml so the alias still routes:
A per-token cost is required. A self-hosted model has no default price; omit the cost fields and
every call computes $0 and budgets never bind. Set illustrative prices (or your GPU amortization).
4. Commit, sync, bring up
Serving apps are manual-sync (the cost gate, see staged-bring-up.md), so a commit does not deploy a GPU pod. After committing the manifest + values change:strategy: Recreate (one GPU, no surge slot), so the old pod terminates before the
new one schedules. First boot on a new model re-stages weights and re-captures CUDA graphs (slow).
5. Validate, then scale to $0
B. Add a new model endpoint
Use this to stand up a model that does not exist yet.serving/coder-chat is the reference: copy its
four manifests, rename, and wire it into LiteLLM and the GitOps catalog.
1. Create the manifest directory
Copy the closest existing model as the template (coder-chat for a GPU chat model, embeddings for a
CPU model), then rename the resources. A new model coder-mini (illustrative):
serving/coder-mini/:
pvc.yaml: rename the PVC (coder-mini-model-cache); one PVC per model so chat/fim/raw-vllm never contend on a single RWO volume. Sizerequests.storageto the weights.deployment.yaml: renamemetadata.name, theapp.kubernetes.io/namelabel (used by the selector, the Service, andwait), the initContainer model id, the serving--model/--served-model-name, and theclaimNameon themodel-cachevolume. Keep the sharedvllm-api-keysecret reference (one key for the wholeservingnamespace) and theHF_HUB_OFFLINE=1/TRANSFORMERS_OFFLINE=1offline env.service.yaml: renamemetadata.name+ label; keepport: 8000,targetPort: http.kustomization.yaml: no path edits needed (it references the three files by relative name); it reuses thevllm-api-keyExternalSecret created by theraw-vllmapp.
replicas: 0 (a new GPU model starts at $0 idle) and strategy: Recreate (single GPU).
2. Register it in LiteLLM
Add amodel_list entry in platform/litellm/values.yaml pointing at the new Service. This is what
makes the model keyed and budgeted (the “as-a-service” contract):
models list includes it;
re-mint or scope keys per coder-stack.md §1. To register a model at runtime without
a values edit, LiteLLM’s store_model_in_db: true allows POST /model/new (the coder-agent pattern,
coder-stack.md §4b); the committed model_list is the durable, reviewable path.
3. Wire it into the GitOps catalog
An ArgoApplication per model makes it exist in the cluster. Copy the existing one and re-point its
path:
coder-mini.yaml: set metadata.name, source.path: serving/coder-mini, and the
ignoreDifferences Deployment name to coder-mini (so make vllm-up/down-style replica scaling
never shows OutOfSync). Keep manual-sync (RespectIgnoreDifferences=true, no automated), the cost
gate for a GPU workload.
The coding-assistant group is gated by the coding-assistant: true feature flag in
environments/ai-dev/config.yaml. The ApplicationSet (clusters/ai-dev/appsets/serving.yaml) recurses
the group’s catalogPath, so a new Application file in an enabled group’s directory is picked up
automatically; no appset edit is needed. If you add the model under a disabled group, enable the flag
and run make resolve-groups first (see staged-bring-up.md).
4. Apply, sync, bring up, validate
One GPU = one model at a time. The reference deployment runsGPUS_ALL_REGIONS=1(a single L4). Each GPU model requests a wholenvidia.com/gpu, so a new GPU model cannot run concurrently with another until GPU time-slicing or a second GPU lands. Scale one down before bringing the next up.
After adding a model
- Record it in the model catalog (backend, GPU, context, cost, eval).
- Benchmark it on your GPU and capture the run context.
- Tear the GPU back to $0 (
make vllm-downorscale --replicas=0) when done; an idle GPU bills.