zai-proxy/docs/notes/CANARY_PROMOTION_CHECKLIST.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

345 lines
8.9 KiB
Markdown

# Canary to Production Promotion Checklist
Use this checklist when promoting a canary deployment to production.
## Quick Reference Commands
```bash
# Set kubeconfig for apexalgo-iad cluster
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
# Production deployment update
kubectl set image deployment/zai-proxy proxy=ronaldraygun/zai-proxy:VERSION -n mcp
# Monitor rollout
kubectl rollout status deployment/zai-proxy -n mcp
# Rollback if needed
kubectl rollout undo deployment/zai-proxy -n mcp
```
---
## Phase 1: Pre-Promotion Validation
### Canary Health Check
- [ ] Canary pods are Running and Ready
```bash
kubectl get pods -n mcp -l app=zai-proxy,variant=test
```
Expected: All pods `Running` and `1/1` Ready
- [ ] No canary-specific Prometheus alerts firing
```bash
# Check Grafana or AlertManager for:
# - ZaiProxyCanaryDeploymentDown
# - ZaiProxyCanaryErrorRateHigherThanProduction
# - ZaiProxyCanaryHighErrorRate
# - ZaiProxyCanaryLatencyDegraded
```
- [ ] Canary is receiving traffic (if using split traffic)
```bash
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total
```
### Functional Testing
- [ ] Health endpoint responds
```bash
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/health
```
Expected: `{"status":"ok"}`
- [ ] Token counting is working
```bash
kubectl logs -n mcp deployment/zai-proxy-test --tail=50 | grep "Token usage"
```
Expected: Log entries showing `Token usage: input=X, output=Y`
- [ ] Metrics are being exported
```bash
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_tokens_total
```
Expected: Metrics with `variant="test"` label
### Worker Testing
- [ ] At least one worker has tested canary endpoint
```bash
# Check worker logs for canary endpoint usage
grep -r "zai-proxy-test" ~/.beads-workers/*.log
```
- [ ] Worker token counting verified
```bash
# Query Prometheus for test variant token metrics
# Should show token counts from worker activity
```
### Version Confirmation
- [ ] Canary version is confirmed
```bash
kubectl get deployment/zai-proxy-test -n mcp \
-o jsonpath='{.spec.template.spec.containers[0].image}'
```
Note the version (e.g., `1.2.1-canary`)
- [ ] Stable version is determined
Example: `1.2.1-canary` → `1.2.1`
---
## Phase 2: Promotion Execution
### Image Update
- [ ] Production image updated to canary version (without -canary suffix)
```bash
kubectl set image deployment/zai-proxy \
proxy=ronaldraygun/zai-proxy:VERSION -n mcp
```
Replace `VERSION` with the stable version number
- [ ] Image update verified
```bash
kubectl get deployment/zai-proxy -n mcp \
-o jsonpath='{.spec.template.spec.containers[0].image}' && echo
```
Expected: New version tag
### Rollout Initiation
- [ ] Rollout status checked
```bash
kubectl rollout status deployment/zai-proxy -n mcp
```
- [ ] Pod replacement monitored
```bash
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w
```
Watch until all pods are new version
---
## Phase 3: During Rollout Monitoring
### Pod Health
- [ ] New pods are becoming Ready
```bash
kubectl get pods -n mcp -l app=zai-proxy,variant=production
```
Expected: All pods `Running` and `1/1` Ready
- [ ] No pod crash loops
```bash
kubectl get pods -n mcp -l app=zai-proxy,variant=production \
-o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount
```
Expected: RESTARTS = 0 or 1
### Error Rate
- [ ] Error rate below threshold (<5%)
```promql
sum(rate(zai_proxy_requests_total{variant="production",status=~"5.."}[5m]))
/
sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
```
Expected: < 0.05 (5%)
- [ ] No spike in 5xx errors
Check logs: `kubectl logs -n mcp deployment/zai-proxy --tail=100`
### Latency
- [ ] P95 latency hasn't regressed (>50% increase)
```promql
histogram_quantile(0.95,
sum(rate(zai_proxy_request_duration_seconds_bucket{variant="production"}[5m])) by (le)
)
```
Expected: Similar to pre-rollout baseline
### Token Counting
- [ ] Token metrics are being exported
```bash
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \
grep zai_proxy_tokens_total | grep production
```
- [ ] Token counting latency is acceptable (<100ms p99)
```promql
histogram_quantile(0.99,
sum(rate(zai_proxy_token_count_duration_seconds_bucket{variant="production"}[5m])) by (le)
)
```
Expected: < 0.1 seconds
---
## Phase 4: Post-Rollout Verification
### Version Verification
- [ ] All pods running new version
```bash
kubectl get pods -n mcp -l app=zai-proxy,variant=production \
-o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version
```
Expected: All pods show new version
- [ ] Deployment shows updated image
```bash
kubectl describe deployment/zai-proxy -n mcp | grep Image:
```
Expected: New image tag
### Worker Verification
- [ ] Workers are successfully making requests
```bash
# Check production logs for worker activity
kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep "Token usage"
```
Expected: Token usage log entries
- [ ] Token counting is working for workers
```bash
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \
grep zai_proxy_tokens_total | grep production
```
Expected: Incrementing counters
### Log Verification
- [ ] No errors in production logs
```bash
kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep -i error
```
Expected: No error entries (or only expected transient errors)
- [ ] Startup logs show correct configuration
```bash
kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep -E "(Token counting|DEPLOYMENT_VARIANT)"
```
Expected: `DEPLOYMENT_VARIANT=production`, `Token counting enabled`
### Metrics Verification
- [ ] Request rate is healthy
```promql
sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
```
Expected: Non-zero, similar to pre-rollout
- [ ] Success rate is high (>95%)
```promql
sum(rate(zai_proxy_requests_total{variant="production",status=~"2.."}[5m]))
/
sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
```
Expected: > 0.95
---
## Phase 5: Finalization
### VERSION File Update
- [ ] VERSION file updated (removed -canary suffix)
```bash
cd /home/coder/ardenone-cluster/containers/zai-proxy
echo "VERSION" > VERSION
cat VERSION
```
Replace `VERSION` with stable version number
### Git Commit and Tag
- [ ] VERSION change committed
```bash
cd /home/coder/ardenone-cluster/containers/zai-proxy
git add VERSION
git commit -m "chore: release zai-proxy vVERSION"
```
- [ ] Git tag created
```bash
git tag -a vVERSION -m "Release zai-proxy vVERSION"
```
- [ ] Tag pushed to remote
```bash
git push origin vVERSION
```
### Optional Cleanup
- [ ] Canary deployment scaled down (if no longer needed)
```bash
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0
```
- [ ] Canary manifest updated to next development version (if applicable)
```bash
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
# Edit zai-proxy-test.yml to next version
```
---
## Rollback Triggers
**Immediately rollback if ANY of these conditions occur:**
- [ ] Error rate exceeds 10% for more than 2 minutes
- [ ] P95 latency increases by >100% for more than 2 minutes
- [ ] More than 50% of pods are NotReady
- [ ] Pods are crash looping
- [ ] Token counting stops working
- [ ] Workers cannot connect or experience high failure rates
### Rollback Commands
```bash
# Quick rollback to previous version
kubectl rollout undo deployment/zai-proxy -n mcp
# Monitor rollback
kubectl rollout status deployment/zai-proxy -n mcp
# If rollback fails, scale to 0 and back up
kubectl scale deployment/zai-proxy -n mcp --replicas=0
kubectl scale deployment/zai-proxy -n mcp --replicas=1
```
---
## Post-Promotion Monitoring (First Hour)
Monitor these metrics for the first hour after promotion:
- [ ] Request rate remains stable
- [ ] Error rate stays below 5%
- [ ] P95 latency doesn't increase by >20%
- [ ] Token counting metrics are incrementing
- [ ] No new Prometheus alerts firing
- [ ] Worker logs show no unexpected errors
### Commands for Post-Promotion Monitoring
```bash
# Watch pod status
watch kubectl get pods -n mcp -l app=zai-proxy,variant=production
# Stream logs
kubectl logs -f -n mcp deployment/zai-proxy
# Check metrics every minute
watch -n 60 'curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total'
```
---
## Sign-off
**Promotion completed by:** _____________________ **Date:** ____________
**Verification completed by:** _____________________ **Date:** ____________
**Notes/Issues:**
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________