Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.9 KiB
Canary to Production Promotion Checklist
Use this checklist when promoting a canary deployment to production.
Quick Reference Commands
# Set kubeconfig for apexalgo-iad cluster
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
# Production deployment update
kubectl set image deployment/zai-proxy proxy=ronaldraygun/zai-proxy:VERSION -n mcp
# Monitor rollout
kubectl rollout status deployment/zai-proxy -n mcp
# Rollback if needed
kubectl rollout undo deployment/zai-proxy -n mcp
Phase 1: Pre-Promotion Validation
Canary Health Check
-
Canary pods are Running and Ready
kubectl get pods -n mcp -l app=zai-proxy,variant=testExpected: All pods
Runningand1/1Ready -
No canary-specific Prometheus alerts firing
# Check Grafana or AlertManager for: # - ZaiProxyCanaryDeploymentDown # - ZaiProxyCanaryErrorRateHigherThanProduction # - ZaiProxyCanaryHighErrorRate # - ZaiProxyCanaryLatencyDegraded -
Canary is receiving traffic (if using split traffic)
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total
Functional Testing
-
Health endpoint responds
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/healthExpected:
{"status":"ok"} -
Token counting is working
kubectl logs -n mcp deployment/zai-proxy-test --tail=50 | grep "Token usage"Expected: Log entries showing
Token usage: input=X, output=Y -
Metrics are being exported
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_tokens_totalExpected: Metrics with
variant="test"label
Worker Testing
-
At least one worker has tested canary endpoint
# Check worker logs for canary endpoint usage grep -r "zai-proxy-test" ~/.beads-workers/*.log -
Worker token counting verified
# Query Prometheus for test variant token metrics # Should show token counts from worker activity
Version Confirmation
-
Canary version is confirmed
kubectl get deployment/zai-proxy-test -n mcp \ -o jsonpath='{.spec.template.spec.containers[0].image}'Note the version (e.g.,
1.2.1-canary) -
Stable version is determined Example:
1.2.1-canary→1.2.1
Phase 2: Promotion Execution
Image Update
-
Production image updated to canary version (without -canary suffix)
kubectl set image deployment/zai-proxy \ proxy=ronaldraygun/zai-proxy:VERSION -n mcpReplace
VERSIONwith the stable version number -
Image update verified
kubectl get deployment/zai-proxy -n mcp \ -o jsonpath='{.spec.template.spec.containers[0].image}' && echoExpected: New version tag
Rollout Initiation
-
Rollout status checked
kubectl rollout status deployment/zai-proxy -n mcp -
Pod replacement monitored
kubectl get pods -n mcp -l app=zai-proxy,variant=production -wWatch until all pods are new version
Phase 3: During Rollout Monitoring
Pod Health
-
New pods are becoming Ready
kubectl get pods -n mcp -l app=zai-proxy,variant=productionExpected: All pods
Runningand1/1Ready -
No pod crash loops
kubectl get pods -n mcp -l app=zai-proxy,variant=production \ -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCountExpected: RESTARTS = 0 or 1
Error Rate
-
Error rate below threshold (<5%)
sum(rate(zai_proxy_requests_total{variant="production",status=~"5.."}[5m])) / sum(rate(zai_proxy_requests_total{variant="production"}[5m]))Expected: < 0.05 (5%)
-
No spike in 5xx errors Check logs:
kubectl logs -n mcp deployment/zai-proxy --tail=100
Latency
- P95 latency hasn't regressed (>50% increase)
Expected: Similar to pre-rollout baselinehistogram_quantile(0.95, sum(rate(zai_proxy_request_duration_seconds_bucket{variant="production"}[5m])) by (le) )
Token Counting
-
Token metrics are being exported
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \ grep zai_proxy_tokens_total | grep production -
Token counting latency is acceptable (<100ms p99)
histogram_quantile(0.99, sum(rate(zai_proxy_token_count_duration_seconds_bucket{variant="production"}[5m])) by (le) )Expected: < 0.1 seconds
Phase 4: Post-Rollout Verification
Version Verification
-
All pods running new version
kubectl get pods -n mcp -l app=zai-proxy,variant=production \ -o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.versionExpected: All pods show new version
-
Deployment shows updated image
kubectl describe deployment/zai-proxy -n mcp | grep Image:Expected: New image tag
Worker Verification
-
Workers are successfully making requests
# Check production logs for worker activity kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep "Token usage"Expected: Token usage log entries
-
Token counting is working for workers
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \ grep zai_proxy_tokens_total | grep productionExpected: Incrementing counters
Log Verification
-
No errors in production logs
kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep -i errorExpected: No error entries (or only expected transient errors)
-
Startup logs show correct configuration
kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep -E "(Token counting|DEPLOYMENT_VARIANT)"Expected:
DEPLOYMENT_VARIANT=production,Token counting enabled
Metrics Verification
-
Request rate is healthy
sum(rate(zai_proxy_requests_total{variant="production"}[5m]))Expected: Non-zero, similar to pre-rollout
-
Success rate is high (>95%)
sum(rate(zai_proxy_requests_total{variant="production",status=~"2.."}[5m])) / sum(rate(zai_proxy_requests_total{variant="production"}[5m]))Expected: > 0.95
Phase 5: Finalization
VERSION File Update
- VERSION file updated (removed -canary suffix)
Replacecd /home/coder/ardenone-cluster/containers/zai-proxy echo "VERSION" > VERSION cat VERSIONVERSIONwith stable version number
Git Commit and Tag
-
VERSION change committed
cd /home/coder/ardenone-cluster/containers/zai-proxy git add VERSION git commit -m "chore: release zai-proxy vVERSION" -
Git tag created
git tag -a vVERSION -m "Release zai-proxy vVERSION" -
Tag pushed to remote
git push origin vVERSION
Optional Cleanup
-
Canary deployment scaled down (if no longer needed)
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0 -
Canary manifest updated to next development version (if applicable)
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp # Edit zai-proxy-test.yml to next version
Rollback Triggers
Immediately rollback if ANY of these conditions occur:
- Error rate exceeds 10% for more than 2 minutes
- P95 latency increases by >100% for more than 2 minutes
- More than 50% of pods are NotReady
- Pods are crash looping
- Token counting stops working
- Workers cannot connect or experience high failure rates
Rollback Commands
# Quick rollback to previous version
kubectl rollout undo deployment/zai-proxy -n mcp
# Monitor rollback
kubectl rollout status deployment/zai-proxy -n mcp
# If rollback fails, scale to 0 and back up
kubectl scale deployment/zai-proxy -n mcp --replicas=0
kubectl scale deployment/zai-proxy -n mcp --replicas=1
Post-Promotion Monitoring (First Hour)
Monitor these metrics for the first hour after promotion:
- Request rate remains stable
- Error rate stays below 5%
- P95 latency doesn't increase by >20%
- Token counting metrics are incrementing
- No new Prometheus alerts firing
- Worker logs show no unexpected errors
Commands for Post-Promotion Monitoring
# Watch pod status
watch kubectl get pods -n mcp -l app=zai-proxy,variant=production
# Stream logs
kubectl logs -f -n mcp deployment/zai-proxy
# Check metrics every minute
watch -n 60 'curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total'
Sign-off
Promotion completed by: _____________________ Date: ____________
Verification completed by: _____________________ Date: ____________
Notes/Issues: