zai-proxy/docs/notes/CANARY_PROMOTION_CHECKLIST.md

# Canary to Production Promotion Checklist

Use this checklist when promoting a canary deployment to production.

## Quick Reference Commands

```bash
# Set kubeconfig for apexalgo-iad cluster
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig

# Production deployment update
kubectl set image deployment/zai-proxy proxy=ronaldraygun/zai-proxy:VERSION -n mcp

# Monitor rollout
kubectl rollout status deployment/zai-proxy -n mcp

# Rollback if needed
kubectl rollout undo deployment/zai-proxy -n mcp
```

---

## Phase 1: Pre-Promotion Validation

### Canary Health Check
- [ ] Canary pods are Running and Ready
  ```bash
  kubectl get pods -n mcp -l app=zai-proxy,variant=test
  ```
  Expected: All pods `Running` and `1/1` Ready

- [ ] No canary-specific Prometheus alerts firing
  ```bash
  # Check Grafana or AlertManager for:
  # - ZaiProxyCanaryDeploymentDown
  # - ZaiProxyCanaryErrorRateHigherThanProduction
  # - ZaiProxyCanaryHighErrorRate
  # - ZaiProxyCanaryLatencyDegraded
  ```

- [ ] Canary is receiving traffic (if using split traffic)
  ```bash
  curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total
  ```

### Functional Testing
- [ ] Health endpoint responds
  ```bash
  curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/health
  ```
  Expected: `{"status":"ok"}`

- [ ] Token counting is working
  ```bash
  kubectl logs -n mcp deployment/zai-proxy-test --tail=50 | grep "Token usage"
  ```
  Expected: Log entries showing `Token usage: input=X, output=Y`

- [ ] Metrics are being exported
  ```bash
  curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_tokens_total
  ```
  Expected: Metrics with `variant="test"` label

### Worker Testing
- [ ] At least one worker has tested canary endpoint
  ```bash
  # Check worker logs for canary endpoint usage
  grep -r "zai-proxy-test" ~/.beads-workers/*.log
  ```

- [ ] Worker token counting verified
  ```bash
  # Query Prometheus for test variant token metrics
  # Should show token counts from worker activity
  ```

### Version Confirmation
- [ ] Canary version is confirmed
  ```bash
  kubectl get deployment/zai-proxy-test -n mcp \
    -o jsonpath='{.spec.template.spec.containers[0].image}'
  ```
  Note the version (e.g., `1.2.1-canary`)

- [ ] Stable version is determined
  Example: `1.2.1-canary` → `1.2.1`

---

## Phase 2: Promotion Execution

### Image Update
- [ ] Production image updated to canary version (without -canary suffix)
  ```bash
  kubectl set image deployment/zai-proxy \
    proxy=ronaldraygun/zai-proxy:VERSION -n mcp
  ```
  Replace `VERSION` with the stable version number

- [ ] Image update verified
  ```bash
  kubectl get deployment/zai-proxy -n mcp \
    -o jsonpath='{.spec.template.spec.containers[0].image}' && echo
  ```
  Expected: New version tag

### Rollout Initiation
- [ ] Rollout status checked
  ```bash
  kubectl rollout status deployment/zai-proxy -n mcp
  ```

- [ ] Pod replacement monitored
  ```bash
  kubectl get pods -n mcp -l app=zai-proxy,variant=production -w
  ```
  Watch until all pods are new version

---

## Phase 3: During Rollout Monitoring

### Pod Health
- [ ] New pods are becoming Ready
  ```bash
  kubectl get pods -n mcp -l app=zai-proxy,variant=production
  ```
  Expected: All pods `Running` and `1/1` Ready

- [ ] No pod crash loops
  ```bash
  kubectl get pods -n mcp -l app=zai-proxy,variant=production \
    -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount
  ```
  Expected: RESTARTS = 0 or 1

### Error Rate
- [ ] Error rate below threshold (<5%)
  ```promql
  sum(rate(zai_proxy_requests_total{variant="production",status=~"5.."}[5m]))
  /
  sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
  ```
  Expected: < 0.05 (5%)

- [ ] No spike in 5xx errors
  Check logs: `kubectl logs -n mcp deployment/zai-proxy --tail=100`

### Latency
- [ ] P95 latency hasn't regressed (>50% increase)
  ```promql
  histogram_quantile(0.95,
    sum(rate(zai_proxy_request_duration_seconds_bucket{variant="production"}[5m])) by (le)
  )
  ```
  Expected: Similar to pre-rollout baseline

### Token Counting
- [ ] Token metrics are being exported
  ```bash
  curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \
    grep zai_proxy_tokens_total | grep production
  ```

- [ ] Token counting latency is acceptable (<100ms p99)
  ```promql
  histogram_quantile(0.99,
    sum(rate(zai_proxy_token_count_duration_seconds_bucket{variant="production"}[5m])) by (le)
  )
  ```
  Expected: < 0.1 seconds

---

## Phase 4: Post-Rollout Verification

### Version Verification
- [ ] All pods running new version
  ```bash
  kubectl get pods -n mcp -l app=zai-proxy,variant=production \
    -o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version
  ```
  Expected: All pods show new version

- [ ] Deployment shows updated image
  ```bash
  kubectl describe deployment/zai-proxy -n mcp | grep Image:
  ```
  Expected: New image tag

### Worker Verification
- [ ] Workers are successfully making requests
  ```bash
  # Check production logs for worker activity
  kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep "Token usage"
  ```
  Expected: Token usage log entries

- [ ] Token counting is working for workers
  ```bash
  curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \
    grep zai_proxy_tokens_total | grep production
  ```
  Expected: Incrementing counters

### Log Verification
- [ ] No errors in production logs
  ```bash
  kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep -i error
  ```
  Expected: No error entries (or only expected transient errors)

- [ ] Startup logs show correct configuration
  ```bash
  kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep -E "(Token counting|DEPLOYMENT_VARIANT)"
  ```
  Expected: `DEPLOYMENT_VARIANT=production`, `Token counting enabled`

### Metrics Verification
- [ ] Request rate is healthy
  ```promql
  sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
  ```
  Expected: Non-zero, similar to pre-rollout

- [ ] Success rate is high (>95%)
  ```promql
  sum(rate(zai_proxy_requests_total{variant="production",status=~"2.."}[5m]))
  /
  sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
  ```
  Expected: > 0.95

---

## Phase 5: Finalization

### VERSION File Update
- [ ] VERSION file updated (removed -canary suffix)
  ```bash
  cd /home/coder/ardenone-cluster/containers/zai-proxy
  echo "VERSION" > VERSION
  cat VERSION
  ```
  Replace `VERSION` with stable version number

### Git Commit and Tag
- [ ] VERSION change committed
  ```bash
  cd /home/coder/ardenone-cluster/containers/zai-proxy
  git add VERSION
  git commit -m "chore: release zai-proxy vVERSION"
  ```

- [ ] Git tag created
  ```bash
  git tag -a vVERSION -m "Release zai-proxy vVERSION"
  ```

- [ ] Tag pushed to remote
  ```bash
  git push origin vVERSION
  ```

### Optional Cleanup
- [ ] Canary deployment scaled down (if no longer needed)
  ```bash
  kubectl scale deployment/zai-proxy-test -n mcp --replicas=0
  ```

- [ ] Canary manifest updated to next development version (if applicable)
  ```bash
  cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
  # Edit zai-proxy-test.yml to next version
  ```

---

## Rollback Triggers

**Immediately rollback if ANY of these conditions occur:**

- [ ] Error rate exceeds 10% for more than 2 minutes
- [ ] P95 latency increases by >100% for more than 2 minutes
- [ ] More than 50% of pods are NotReady
- [ ] Pods are crash looping
- [ ] Token counting stops working
- [ ] Workers cannot connect or experience high failure rates

### Rollback Commands

```bash
# Quick rollback to previous version
kubectl rollout undo deployment/zai-proxy -n mcp

# Monitor rollback
kubectl rollout status deployment/zai-proxy -n mcp

# If rollback fails, scale to 0 and back up
kubectl scale deployment/zai-proxy -n mcp --replicas=0
kubectl scale deployment/zai-proxy -n mcp --replicas=1
```

---

## Post-Promotion Monitoring (First Hour)

Monitor these metrics for the first hour after promotion:

- [ ] Request rate remains stable
- [ ] Error rate stays below 5%
- [ ] P95 latency doesn't increase by >20%
- [ ] Token counting metrics are incrementing
- [ ] No new Prometheus alerts firing
- [ ] Worker logs show no unexpected errors

### Commands for Post-Promotion Monitoring

```bash
# Watch pod status
watch kubectl get pods -n mcp -l app=zai-proxy,variant=production

# Stream logs
kubectl logs -f -n mcp deployment/zai-proxy

# Check metrics every minute
watch -n 60 'curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total'
```

---

## Sign-off

**Promotion completed by:** _____________________ **Date:** ____________

**Verification completed by:** _____________________ **Date:** ____________

**Notes/Issues:**

_________________________________________________________________________

_________________________________________________________________________

_________________________________________________________________________