Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
345 lines
8.9 KiB
Markdown
345 lines
8.9 KiB
Markdown
# Canary to Production Promotion Checklist
|
|
|
|
Use this checklist when promoting a canary deployment to production.
|
|
|
|
## Quick Reference Commands
|
|
|
|
```bash
|
|
# Set kubeconfig for apexalgo-iad cluster
|
|
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
|
|
|
|
# Production deployment update
|
|
kubectl set image deployment/zai-proxy proxy=ronaldraygun/zai-proxy:VERSION -n mcp
|
|
|
|
# Monitor rollout
|
|
kubectl rollout status deployment/zai-proxy -n mcp
|
|
|
|
# Rollback if needed
|
|
kubectl rollout undo deployment/zai-proxy -n mcp
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 1: Pre-Promotion Validation
|
|
|
|
### Canary Health Check
|
|
- [ ] Canary pods are Running and Ready
|
|
```bash
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=test
|
|
```
|
|
Expected: All pods `Running` and `1/1` Ready
|
|
|
|
- [ ] No canary-specific Prometheus alerts firing
|
|
```bash
|
|
# Check Grafana or AlertManager for:
|
|
# - ZaiProxyCanaryDeploymentDown
|
|
# - ZaiProxyCanaryErrorRateHigherThanProduction
|
|
# - ZaiProxyCanaryHighErrorRate
|
|
# - ZaiProxyCanaryLatencyDegraded
|
|
```
|
|
|
|
- [ ] Canary is receiving traffic (if using split traffic)
|
|
```bash
|
|
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total
|
|
```
|
|
|
|
### Functional Testing
|
|
- [ ] Health endpoint responds
|
|
```bash
|
|
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/health
|
|
```
|
|
Expected: `{"status":"ok"}`
|
|
|
|
- [ ] Token counting is working
|
|
```bash
|
|
kubectl logs -n mcp deployment/zai-proxy-test --tail=50 | grep "Token usage"
|
|
```
|
|
Expected: Log entries showing `Token usage: input=X, output=Y`
|
|
|
|
- [ ] Metrics are being exported
|
|
```bash
|
|
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_tokens_total
|
|
```
|
|
Expected: Metrics with `variant="test"` label
|
|
|
|
### Worker Testing
|
|
- [ ] At least one worker has tested canary endpoint
|
|
```bash
|
|
# Check worker logs for canary endpoint usage
|
|
grep -r "zai-proxy-test" ~/.beads-workers/*.log
|
|
```
|
|
|
|
- [ ] Worker token counting verified
|
|
```bash
|
|
# Query Prometheus for test variant token metrics
|
|
# Should show token counts from worker activity
|
|
```
|
|
|
|
### Version Confirmation
|
|
- [ ] Canary version is confirmed
|
|
```bash
|
|
kubectl get deployment/zai-proxy-test -n mcp \
|
|
-o jsonpath='{.spec.template.spec.containers[0].image}'
|
|
```
|
|
Note the version (e.g., `1.2.1-canary`)
|
|
|
|
- [ ] Stable version is determined
|
|
Example: `1.2.1-canary` → `1.2.1`
|
|
|
|
---
|
|
|
|
## Phase 2: Promotion Execution
|
|
|
|
### Image Update
|
|
- [ ] Production image updated to canary version (without -canary suffix)
|
|
```bash
|
|
kubectl set image deployment/zai-proxy \
|
|
proxy=ronaldraygun/zai-proxy:VERSION -n mcp
|
|
```
|
|
Replace `VERSION` with the stable version number
|
|
|
|
- [ ] Image update verified
|
|
```bash
|
|
kubectl get deployment/zai-proxy -n mcp \
|
|
-o jsonpath='{.spec.template.spec.containers[0].image}' && echo
|
|
```
|
|
Expected: New version tag
|
|
|
|
### Rollout Initiation
|
|
- [ ] Rollout status checked
|
|
```bash
|
|
kubectl rollout status deployment/zai-proxy -n mcp
|
|
```
|
|
|
|
- [ ] Pod replacement monitored
|
|
```bash
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w
|
|
```
|
|
Watch until all pods are new version
|
|
|
|
---
|
|
|
|
## Phase 3: During Rollout Monitoring
|
|
|
|
### Pod Health
|
|
- [ ] New pods are becoming Ready
|
|
```bash
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production
|
|
```
|
|
Expected: All pods `Running` and `1/1` Ready
|
|
|
|
- [ ] No pod crash loops
|
|
```bash
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production \
|
|
-o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount
|
|
```
|
|
Expected: RESTARTS = 0 or 1
|
|
|
|
### Error Rate
|
|
- [ ] Error rate below threshold (<5%)
|
|
```promql
|
|
sum(rate(zai_proxy_requests_total{variant="production",status=~"5.."}[5m]))
|
|
/
|
|
sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
|
|
```
|
|
Expected: < 0.05 (5%)
|
|
|
|
- [ ] No spike in 5xx errors
|
|
Check logs: `kubectl logs -n mcp deployment/zai-proxy --tail=100`
|
|
|
|
### Latency
|
|
- [ ] P95 latency hasn't regressed (>50% increase)
|
|
```promql
|
|
histogram_quantile(0.95,
|
|
sum(rate(zai_proxy_request_duration_seconds_bucket{variant="production"}[5m])) by (le)
|
|
)
|
|
```
|
|
Expected: Similar to pre-rollout baseline
|
|
|
|
### Token Counting
|
|
- [ ] Token metrics are being exported
|
|
```bash
|
|
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \
|
|
grep zai_proxy_tokens_total | grep production
|
|
```
|
|
|
|
- [ ] Token counting latency is acceptable (<100ms p99)
|
|
```promql
|
|
histogram_quantile(0.99,
|
|
sum(rate(zai_proxy_token_count_duration_seconds_bucket{variant="production"}[5m])) by (le)
|
|
)
|
|
```
|
|
Expected: < 0.1 seconds
|
|
|
|
---
|
|
|
|
## Phase 4: Post-Rollout Verification
|
|
|
|
### Version Verification
|
|
- [ ] All pods running new version
|
|
```bash
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production \
|
|
-o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version
|
|
```
|
|
Expected: All pods show new version
|
|
|
|
- [ ] Deployment shows updated image
|
|
```bash
|
|
kubectl describe deployment/zai-proxy -n mcp | grep Image:
|
|
```
|
|
Expected: New image tag
|
|
|
|
### Worker Verification
|
|
- [ ] Workers are successfully making requests
|
|
```bash
|
|
# Check production logs for worker activity
|
|
kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep "Token usage"
|
|
```
|
|
Expected: Token usage log entries
|
|
|
|
- [ ] Token counting is working for workers
|
|
```bash
|
|
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \
|
|
grep zai_proxy_tokens_total | grep production
|
|
```
|
|
Expected: Incrementing counters
|
|
|
|
### Log Verification
|
|
- [ ] No errors in production logs
|
|
```bash
|
|
kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep -i error
|
|
```
|
|
Expected: No error entries (or only expected transient errors)
|
|
|
|
- [ ] Startup logs show correct configuration
|
|
```bash
|
|
kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep -E "(Token counting|DEPLOYMENT_VARIANT)"
|
|
```
|
|
Expected: `DEPLOYMENT_VARIANT=production`, `Token counting enabled`
|
|
|
|
### Metrics Verification
|
|
- [ ] Request rate is healthy
|
|
```promql
|
|
sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
|
|
```
|
|
Expected: Non-zero, similar to pre-rollout
|
|
|
|
- [ ] Success rate is high (>95%)
|
|
```promql
|
|
sum(rate(zai_proxy_requests_total{variant="production",status=~"2.."}[5m]))
|
|
/
|
|
sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
|
|
```
|
|
Expected: > 0.95
|
|
|
|
---
|
|
|
|
## Phase 5: Finalization
|
|
|
|
### VERSION File Update
|
|
- [ ] VERSION file updated (removed -canary suffix)
|
|
```bash
|
|
cd /home/coder/ardenone-cluster/containers/zai-proxy
|
|
echo "VERSION" > VERSION
|
|
cat VERSION
|
|
```
|
|
Replace `VERSION` with stable version number
|
|
|
|
### Git Commit and Tag
|
|
- [ ] VERSION change committed
|
|
```bash
|
|
cd /home/coder/ardenone-cluster/containers/zai-proxy
|
|
git add VERSION
|
|
git commit -m "chore: release zai-proxy vVERSION"
|
|
```
|
|
|
|
- [ ] Git tag created
|
|
```bash
|
|
git tag -a vVERSION -m "Release zai-proxy vVERSION"
|
|
```
|
|
|
|
- [ ] Tag pushed to remote
|
|
```bash
|
|
git push origin vVERSION
|
|
```
|
|
|
|
### Optional Cleanup
|
|
- [ ] Canary deployment scaled down (if no longer needed)
|
|
```bash
|
|
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0
|
|
```
|
|
|
|
- [ ] Canary manifest updated to next development version (if applicable)
|
|
```bash
|
|
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
|
|
# Edit zai-proxy-test.yml to next version
|
|
```
|
|
|
|
---
|
|
|
|
## Rollback Triggers
|
|
|
|
**Immediately rollback if ANY of these conditions occur:**
|
|
|
|
- [ ] Error rate exceeds 10% for more than 2 minutes
|
|
- [ ] P95 latency increases by >100% for more than 2 minutes
|
|
- [ ] More than 50% of pods are NotReady
|
|
- [ ] Pods are crash looping
|
|
- [ ] Token counting stops working
|
|
- [ ] Workers cannot connect or experience high failure rates
|
|
|
|
### Rollback Commands
|
|
|
|
```bash
|
|
# Quick rollback to previous version
|
|
kubectl rollout undo deployment/zai-proxy -n mcp
|
|
|
|
# Monitor rollback
|
|
kubectl rollout status deployment/zai-proxy -n mcp
|
|
|
|
# If rollback fails, scale to 0 and back up
|
|
kubectl scale deployment/zai-proxy -n mcp --replicas=0
|
|
kubectl scale deployment/zai-proxy -n mcp --replicas=1
|
|
```
|
|
|
|
---
|
|
|
|
## Post-Promotion Monitoring (First Hour)
|
|
|
|
Monitor these metrics for the first hour after promotion:
|
|
|
|
- [ ] Request rate remains stable
|
|
- [ ] Error rate stays below 5%
|
|
- [ ] P95 latency doesn't increase by >20%
|
|
- [ ] Token counting metrics are incrementing
|
|
- [ ] No new Prometheus alerts firing
|
|
- [ ] Worker logs show no unexpected errors
|
|
|
|
### Commands for Post-Promotion Monitoring
|
|
|
|
```bash
|
|
# Watch pod status
|
|
watch kubectl get pods -n mcp -l app=zai-proxy,variant=production
|
|
|
|
# Stream logs
|
|
kubectl logs -f -n mcp deployment/zai-proxy
|
|
|
|
# Check metrics every minute
|
|
watch -n 60 'curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total'
|
|
```
|
|
|
|
---
|
|
|
|
## Sign-off
|
|
|
|
**Promotion completed by:** _____________________ **Date:** ____________
|
|
|
|
**Verification completed by:** _____________________ **Date:** ____________
|
|
|
|
**Notes/Issues:**
|
|
|
|
_________________________________________________________________________
|
|
|
|
_________________________________________________________________________
|
|
|
|
_________________________________________________________________________
|