
Operating on Progressive Delivery
This is the operational companion to Progressive Delivery with Argo Rollouts. That post explains the architecture and deployment. This one is the day-to-day runbook for promoting rollouts, interpreting analysis results, and recovering from stuck or failed states.
What “Healthy” Looks Like
The progressive delivery stack is healthy when:
- The
argo-rolloutscontroller pod is running in theargo-rolloutsnamespace - LiteLLM Rollout shows
phase: Healthywith pods serving at192.168.55.206:4000 - Sympozium Rollout shows
phase: Healthywith pods serving at192.168.55.207:8080 - No
DegradedorPausedrollouts exist (unless you’re mid-rollout)
Observing State
Controller Health
# Check the controller is running
kubectl get pods -n argo-rollouts
# Check the Cilium plugin loaded successfully (look for plugin registration in logs)
kubectl logs -n argo-rollouts deploy/argo-rollouts --tail=20 | grep -i pluginRollout Status (All Namespaces)
# Quick overview of all rollouts
kubectl get rollout -A
# Detailed status with the kubectl plugin
kubectl argo rollouts get rollout litellm -n litellm
kubectl argo rollouts get rollout sympozium-apiserver -n sympozium-system
# Watch a rollout in real-time (live-updating dashboard)
kubectl argo rollouts get rollout litellm -n litellm --watch$ kubectl argo rollouts get rollout litellm -n litellm
Name: litellm
Namespace: litellm
Status: ◌ Progressing
Message: waiting for rollout spec update to be observed
Strategy: Canary
Step: 0/6
SetWeight: 20
ActualWeight: 0
Replicas:
Desired: 1
Current: 0
Updated: 0
Ready: 0
Available: 0
NAME KIND STATUS AGE INFO
⟳ litellm Rollout ◌ Progressing 25d
└──# revision:1
└──⧉ litellm-79db46b9fc ReplicaSet • ScaledDown 25d canary
Analysis Results
# List recent analysis runs
kubectl get analysisrun -n litellm --sort-by=.metadata.creationTimestamp
# Check a specific analysis run's results
kubectl get analysisrun -n litellm <name> -o yaml | grep -A20 "status:"
# Check if AnalysisTemplates exist
kubectl get analysistemplate -ACanary Operations (LiteLLM)
Triggering a Canary
A canary starts automatically when the LiteLLM Deployment spec changes. The typical trigger is bumping the image tag in apps/litellm/values.yaml:
image:
tag: "main-v1.83.0-stable" # was main-v1.82.3-stableCommit, push, and ArgoCD syncs the Deployment. The Rollout controller detects the spec change and begins the canary.
Promoting Through Steps
The canary follows this sequence:
20% traffic → pause → 5-min VictoriaMetrics analysis →
50% traffic → pause → 5-min analysis →
100% (full promotion)Each pause step waits for manual promotion:
# Advance past the current pause step
kubectl argo rollouts promote litellm -n litellm
# Skip ALL remaining steps and promote to 100% immediately
kubectl argo rollouts promote litellm -n litellm --fullHandling Inconclusive Analysis
In a homelab with bursty traffic, the VictoriaMetrics error-rate query often returns NaN (zero requests in the window). Argo Rollouts treats NaN as inconclusive — it matches neither the success nor failure condition. After 3 consecutive inconclusive results (15 minutes), the analysis aborts.
When analysis is inconclusive:
# Check the analysis run status
kubectl get analysisrun -n litellm -l rollouts-pod-template-hash --sort-by=.metadata.creationTimestamp | tail -3
# Option 1: Generate some traffic, then promote to re-trigger analysis
curl http://192.168.55.206:4000/v1/models -H "Authorization: Bearer $LITELLM_KEY"
kubectl argo rollouts promote litellm -n litellm
# Option 2: Force promote if you're confident the release is fine
kubectl argo rollouts promote litellm -n litellm --fullAborting a Canary
# Abort — reverts traffic to 100% stable, scales down canary pods
kubectl argo rollouts abort litellm -n litellm
# After aborting, the Rollout is in a "Degraded" state. To retry:
kubectl argo rollouts retry rollout litellm -n litellmBlue-Green Operations (Sympozium)
Triggering a Blue-Green
Like the canary, a blue-green starts when the Deployment spec changes. Bump the image tag in apps/sympozium/values.yaml or update the chart targetRevision in apps/root/templates/sympozium.yaml.
Promotion Flow
- Argo Rollouts creates the green (preview) ReplicaSet
- Pre-promotion analysis runs (HTTP health check on
/healthzvia the preview service) - If health passes → Rollout waits for manual promotion
- You promote → traffic switches atomically from blue to green
# Watch the rollout (shows blue/green ReplicaSets and analysis state)
kubectl argo rollouts get rollout sympozium-apiserver -n sympozium-system --watch
# Smoke-test the preview stack before promoting
kubectl port-forward svc/sympozium-apiserver-preview -n sympozium-system 9090:8080
# Visit http://localhost:9090 — this hits the green stack only
# Promote green to active
kubectl argo rollouts promote sympozium-apiserver -n sympozium-systemAborting a Blue-Green
# Abort — keeps blue as active, tears down green ReplicaSet
kubectl argo rollouts abort sympozium-apiserver -n sympozium-systemTroubleshooting
Rollout Stuck in “Degraded”
This usually means the Rollout spec references something that doesn’t exist:
# Check the Rollout status message
kubectl get rollout <name> -n <ns> -o yaml | grep -A5 "phase:"
# Common causes:
# - AnalysisTemplate not found (ArgoCD hasn't synced it yet)
# - Service not found (preview service missing)
# - workloadRef Deployment not foundFix: ensure all referenced resources exist, then the controller self-heals.
ArgoCD Shows Deployment at 0 Replicas
This is expected behavior when using workloadRef. The Rollout controller scales the Helm chart’s Deployment to 0 and manages pods directly. The ignoreDifferences on spec.replicas prevents ArgoCD from fighting this.
If ArgoCD shows the Deployment as OutOfSync on replicas, check that ignoreDifferences is configured in the Application CR.
Rollout Pods Not Starting
# Check the Rollout's ReplicaSets
kubectl get rs -n <ns> -l rollouts-pod-template-hash
# Check pod events
kubectl describe pod -n <ns> -l rollouts-pod-template-hash=<hash>
# For canary: check if the canary service exists and has the right selector
kubectl get svc litellm-canary -n litellm -o yaml | grep -A5 selectorCiliumEnvoyConfig Not Created (Canary)
# Check if the Cilium plugin is loaded
kubectl logs -n argo-rollouts deploy/argo-rollouts | grep -i cilium
# Check for CiliumEnvoyConfig objects
kubectl get ciliumenvoyconfig -A
# Check RBAC — controller needs access to cilium.io CRDs
kubectl auth can-i create ciliumenvoyconfigs --as=system:serviceaccount:argo-rollouts:argo-rollouts