staged-bring-up.md), so most
sessions only need the cost-stop fast path. Full teardown is for when
you’re done with the cluster entirely.
Cluster facts (from environments/ai-dev/config.yaml + infra/gke/terraform/terraform.tfvars):
project <your-project>, cluster ai-dev, location us-central1-a. Export them once so the
blocks below paste clean:
⚠️ Cost footgun: four things that survive a naive kubectl delete / cluster delete
A deleted cluster (or kubectl delete ns ...) does not reclaim these. They keep billing on
their own line items until you delete them explicitly. Audit for all four at the end (§C.7):
| Survivor | Created by | Reclaim with |
|---|---|---|
| External IPs + forwarding rules | the agentgateway data-plane LB behind each Gateway (kserve-ingress-gateway, inference-gateway) | delete the Gateway objects before the cluster (§C.3) |
| Persistent disks (PDs) | the model-cache PVCs vllm-model-cache (serving), kserve-model-cache (kserve), bound on the GKE default SC standard-rwo | delete the PVCs, then verify no orphan disks (§C.4) |
| GPU node pool | OpenTofu (gpu-l4, optional extra pools) | scale to 0 (fast) or delete via OpenTofu (§B / §C.2) |
| Secret Manager secrets + GSA | secret values seeded manually; ESO GSA/IAM from OpenTofu | optional cleanup §C.6 (free at rest, but delete if abandoning the project) |
A. Validation checklist (is it actually healthy?)
Run these before trusting a deployment. Concrete pass conditions in each step. 1. Argo apps Synced + Healthy. Platform apps are auto-sync; serving/demo are manual.raw-vllm, kserve-demo, inference-demo) to show their workload state
only after an explicit argocd app sync.
2. GPU node present only when scaled up. With serving at replicas:0 there should be no
GPU node (that’s the $0-idle state, not a fault). After make vllm-up:
Pending, see pending-gpu-workloads.md / gpu-debugging.md (quota, taints).
3. vLLM smoke returns 200. Authenticated OpenAI chat round-trip (after make vllm-up):
raw-vllm ServiceMonitor (ns serving) and DCGM metrics feed
the in-cluster Prometheus.
Bound, none Pending/Terminating.
B. Cost-stop fast path
Keeps the cluster + GitOps intact, drops GPU spend to ~$0. This is the everyday action.make vllm-down is the fastest “stop paying” lever: it scales deploy/raw-vllm to 0, the GPU pod
terminates, and the autoscaler returns gpu-l4 to 0 nodes (the pool’s --min-nodes=0). Argo won’t
revert it (ignoreDifferences on /spec/replicas).
Pause the KServe CPU predictor too, if it’s running:
default-pool cost remains. PVCs
(PDs) persist: cheap, and they keep model caches warm. To reclaim those too, do a full teardown.
Cheaper still without losing GitOps: scaling default-pool to 0 is not safe here. Argo,
the agentgateway/KServe controllers, ESO and Prometheus run there. To go fully $0, do §C.
C. Full teardown (ordered to avoid orphans)
Order matters: workloads → GPU pool → LBs/Gateways → PVCs → cluster → optional secrets → audit. Deleting the cluster first strands forwarding rules and disks (the footgun above).C.1 Scale workloads to 0 / delete serving apps
PROFILE does not prune wider layers. To remove a
layer, delete its per-layer ApplicationSet (its group-roots + child apps prune with it):
demos → llm-gateway → routing → serving → platform); keep platform
last (Argo CD self-manages there). For a full cluster teardown, continue with C.2-C.7 below.
C.2 Delete the GPU node pool
Scale-to-0 already costs ~$0, so for a short pause justmake vllm-down. For a full
OpenTofu-created cluster teardown, delete Gateways/PVCs first, then let make tf-destroy remove the
pool with the rest of the substrate in C.5. If you need to remove only the GPU pool manually:
Scale-to-0 vs delete: the pool at 0 nodes has no compute cost, so keep it if you’ll serve again soon (avoids re-running the create script + re-driver-install). Delete it only when tearing the whole cluster down, or if you’re deleting the cluster anyway (which removes it).
C.3 Delete LoadBalancer Services / Gateways BEFORE the cluster
This is the bill footgun. Each agentgateway-backedGateway provisions a data-plane LB → a GCP
forwarding rule + external IP that outlives a deleted cluster. Delete the Gateways (and let
the controller deprovision the LB) while the cluster + controller still exist:
NoService type: LoadBalancerexists in the manifests; the only LBs are the two agentgateway data-plane Services the controller creates per Gateway. They live in the gateway’s namespace; find them withkubectl get svc -A | grep -i loadbalancerif you need the exact name.
C.4 Delete PVCs and confirm the backing PDs are gone
On GKE these PVCs each bind a GCE PD via the default SCstandard-rwo. Deleting the namespace/PVC
should delete the PD (reclaimPolicy Delete on standard-rwo), but verify: a Retain/stuck PV
strands the disk.
C.5 Delete the cluster / OpenTofu substrate
Only after Gateways/LBs and PVCs/PDs are confirmed gone (so nothing is stranded):default-pool, GPU pool, node service account/IAM, Artifact Registry
repo, and ESO IAM for an OpenTofu-created cluster.
For an older script-created cluster:
default-pool, and any remaining node pools.
C.6 Optional: remove Secret Manager values / legacy GSA
Secrets at rest in Secret Manager are effectively free, so keep them if you’ll redeploy. Remove them only when abandoning the project. OpenTofu removes the ESO GSA/IAM in C.5; the GSA commands below are only for older script-created clusters.C.7 Final audit: confirm nothing paid is left
Run all four. Every one should return no rows tied toai-dev (other unrelated project
resources may legitimately appear):
D. Validate the OpenTofu substrate: two paths
The GKE substrate is now OpenTofu-managed, but the liveai-dev
cluster was created by the old shell scripts: OpenTofu has no state for it and does not track it. So the
first real tofu apply is unproven. Pick a path by intent.
D.1 Parallel substrate validation (recommended, zero downtime)
Stand up a second, throwaway cluster from OpenTofu next to the running one, in the same project, and prove the IaC end-to-end without touching production. This is the safe way to validate the substrate while real work keeps running onai-dev.
Use isolated resource names: do NOT reuse/import the live ESO GSA or Artifact Registry repo. Those are
project-level singletons the live cluster depends on; if the throwaway state owned them, tofu destroy would
delete them out from under production. Isolated names cost one var-file and let OpenTofu create and cleanly
destroy a parallel GSA / AR repo / node SA. (The secretAccessor grant rides along, so ESO still works on the
validation cluster.)
- Shared GPU quota. Both clusters draw the same regional L4 quota (1). Keep the validation GPU pool at
min 0and don’tmake vllm-upon it whileai-devis serving on GPU, or the second hits a stockout. - Optional full profile bring-up (deeper proof):
make bootstrap && make argocd-repo && make root PROFILE=platformwith./kubeconfigpointed at theai-dev-tfcluster. ESO’s K8s SA annotation comes from git and points at the liveexternal-secrets@…GSA, so via Workload Identity the validation cluster reads secrets through the live GSA; the tf-createdeso-aidevtfGSA is created-and-destroyed purely as substrate proof. Fine for validation; don’t promote this cluster. - Restore the dedicated kubeconfig afterwards:
gcloud container clusters get-credentials ai-dev --location us-central1-a --project <your-project> --kubeconfig=$PWD/kubeconfig.
D.2 Retire ai-dev and rebuild from OpenTofu (the real cutover)
Only when you actually want the gcloud-created cluster gone and the OpenTofu one to become live. This is
destructive and the next tofu apply becomes your production cluster.
-
Back up anything not in git / Terraform / Secret Manager first. The one piece of live state that is in
none of those is the LiteLLM Postgres (
litellm-pg, CNPGinstances: 1, no backups): it holds the virtual keys and spend history. A teardown destroys its PD permanently. If that data matters, dump it:(Today’s keys/spend are demo/validation data and are re-mintable; skip the dump if you don’t care.) -
Ordered teardown of
ai-devvia §C (Gateways → PVCs/PDs → cluster). The live cluster is script-created, so delete it with the §C.5gcloud container clusters deletepath, notmake tf-destroy(empty state). Run the §C.7 audit, confirm no stranded forwarding-rules / IPs / disks. -
First real apply: this creates the live cluster from IaC:
Secrets in Secret Manager are retained (free at rest); ESO re-materializes them, so you do not re-seed them. Note the rebuilt cluster’s nodes use the dedicated node SA (
gke-ai-dev-nodes), not the old default compute SA, intended (least-privilege), but node identity is not byte-identical to the retired cluster.