zai-proxy/docs/notes/CANARY_PROMOTION_CHECKLIST.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

8.9 KiB

Canary to Production Promotion Checklist

Use this checklist when promoting a canary deployment to production.

Quick Reference Commands

# Set kubeconfig for apexalgo-iad cluster
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig

# Production deployment update
kubectl set image deployment/zai-proxy proxy=ronaldraygun/zai-proxy:VERSION -n mcp

# Monitor rollout
kubectl rollout status deployment/zai-proxy -n mcp

# Rollback if needed
kubectl rollout undo deployment/zai-proxy -n mcp

Phase 1: Pre-Promotion Validation

Canary Health Check

  • Canary pods are Running and Ready

    kubectl get pods -n mcp -l app=zai-proxy,variant=test
    

    Expected: All pods Running and 1/1 Ready

  • No canary-specific Prometheus alerts firing

    # Check Grafana or AlertManager for:
    # - ZaiProxyCanaryDeploymentDown
    # - ZaiProxyCanaryErrorRateHigherThanProduction
    # - ZaiProxyCanaryHighErrorRate
    # - ZaiProxyCanaryLatencyDegraded
    
  • Canary is receiving traffic (if using split traffic)

    curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total
    

Functional Testing

  • Health endpoint responds

    curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/health
    

    Expected: {"status":"ok"}

  • Token counting is working

    kubectl logs -n mcp deployment/zai-proxy-test --tail=50 | grep "Token usage"
    

    Expected: Log entries showing Token usage: input=X, output=Y

  • Metrics are being exported

    curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_tokens_total
    

    Expected: Metrics with variant="test" label

Worker Testing

  • At least one worker has tested canary endpoint

    # Check worker logs for canary endpoint usage
    grep -r "zai-proxy-test" ~/.beads-workers/*.log
    
  • Worker token counting verified

    # Query Prometheus for test variant token metrics
    # Should show token counts from worker activity
    

Version Confirmation

  • Canary version is confirmed

    kubectl get deployment/zai-proxy-test -n mcp \
      -o jsonpath='{.spec.template.spec.containers[0].image}'
    

    Note the version (e.g., 1.2.1-canary)

  • Stable version is determined Example: 1.2.1-canary1.2.1


Phase 2: Promotion Execution

Image Update

  • Production image updated to canary version (without -canary suffix)

    kubectl set image deployment/zai-proxy \
      proxy=ronaldraygun/zai-proxy:VERSION -n mcp
    

    Replace VERSION with the stable version number

  • Image update verified

    kubectl get deployment/zai-proxy -n mcp \
      -o jsonpath='{.spec.template.spec.containers[0].image}' && echo
    

    Expected: New version tag

Rollout Initiation

  • Rollout status checked

    kubectl rollout status deployment/zai-proxy -n mcp
    
  • Pod replacement monitored

    kubectl get pods -n mcp -l app=zai-proxy,variant=production -w
    

    Watch until all pods are new version


Phase 3: During Rollout Monitoring

Pod Health

  • New pods are becoming Ready

    kubectl get pods -n mcp -l app=zai-proxy,variant=production
    

    Expected: All pods Running and 1/1 Ready

  • No pod crash loops

    kubectl get pods -n mcp -l app=zai-proxy,variant=production \
      -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount
    

    Expected: RESTARTS = 0 or 1

Error Rate

  • Error rate below threshold (<5%)

    sum(rate(zai_proxy_requests_total{variant="production",status=~"5.."}[5m]))
    /
    sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
    

    Expected: < 0.05 (5%)

  • No spike in 5xx errors Check logs: kubectl logs -n mcp deployment/zai-proxy --tail=100

Latency

  • P95 latency hasn't regressed (>50% increase)
    histogram_quantile(0.95,
      sum(rate(zai_proxy_request_duration_seconds_bucket{variant="production"}[5m])) by (le)
    )
    
    Expected: Similar to pre-rollout baseline

Token Counting

  • Token metrics are being exported

    curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \
      grep zai_proxy_tokens_total | grep production
    
  • Token counting latency is acceptable (<100ms p99)

    histogram_quantile(0.99,
      sum(rate(zai_proxy_token_count_duration_seconds_bucket{variant="production"}[5m])) by (le)
    )
    

    Expected: < 0.1 seconds


Phase 4: Post-Rollout Verification

Version Verification

  • All pods running new version

    kubectl get pods -n mcp -l app=zai-proxy,variant=production \
      -o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version
    

    Expected: All pods show new version

  • Deployment shows updated image

    kubectl describe deployment/zai-proxy -n mcp | grep Image:
    

    Expected: New image tag

Worker Verification

  • Workers are successfully making requests

    # Check production logs for worker activity
    kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep "Token usage"
    

    Expected: Token usage log entries

  • Token counting is working for workers

    curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \
      grep zai_proxy_tokens_total | grep production
    

    Expected: Incrementing counters

Log Verification

  • No errors in production logs

    kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep -i error
    

    Expected: No error entries (or only expected transient errors)

  • Startup logs show correct configuration

    kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep -E "(Token counting|DEPLOYMENT_VARIANT)"
    

    Expected: DEPLOYMENT_VARIANT=production, Token counting enabled

Metrics Verification

  • Request rate is healthy

    sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
    

    Expected: Non-zero, similar to pre-rollout

  • Success rate is high (>95%)

    sum(rate(zai_proxy_requests_total{variant="production",status=~"2.."}[5m]))
    /
    sum(rate(zai_proxy_requests_total{variant="production"}[5m]))
    

    Expected: > 0.95


Phase 5: Finalization

VERSION File Update

  • VERSION file updated (removed -canary suffix)
    cd /home/coder/ardenone-cluster/containers/zai-proxy
    echo "VERSION" > VERSION
    cat VERSION
    
    Replace VERSION with stable version number

Git Commit and Tag

  • VERSION change committed

    cd /home/coder/ardenone-cluster/containers/zai-proxy
    git add VERSION
    git commit -m "chore: release zai-proxy vVERSION"
    
  • Git tag created

    git tag -a vVERSION -m "Release zai-proxy vVERSION"
    
  • Tag pushed to remote

    git push origin vVERSION
    

Optional Cleanup

  • Canary deployment scaled down (if no longer needed)

    kubectl scale deployment/zai-proxy-test -n mcp --replicas=0
    
  • Canary manifest updated to next development version (if applicable)

    cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
    # Edit zai-proxy-test.yml to next version
    

Rollback Triggers

Immediately rollback if ANY of these conditions occur:

  • Error rate exceeds 10% for more than 2 minutes
  • P95 latency increases by >100% for more than 2 minutes
  • More than 50% of pods are NotReady
  • Pods are crash looping
  • Token counting stops working
  • Workers cannot connect or experience high failure rates

Rollback Commands

# Quick rollback to previous version
kubectl rollout undo deployment/zai-proxy -n mcp

# Monitor rollback
kubectl rollout status deployment/zai-proxy -n mcp

# If rollback fails, scale to 0 and back up
kubectl scale deployment/zai-proxy -n mcp --replicas=0
kubectl scale deployment/zai-proxy -n mcp --replicas=1

Post-Promotion Monitoring (First Hour)

Monitor these metrics for the first hour after promotion:

  • Request rate remains stable
  • Error rate stays below 5%
  • P95 latency doesn't increase by >20%
  • Token counting metrics are incrementing
  • No new Prometheus alerts firing
  • Worker logs show no unexpected errors

Commands for Post-Promotion Monitoring

# Watch pod status
watch kubectl get pods -n mcp -l app=zai-proxy,variant=production

# Stream logs
kubectl logs -f -n mcp deployment/zai-proxy

# Check metrics every minute
watch -n 60 'curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total'

Sign-off

Promotion completed by: _____________________ Date: ____________

Verification completed by: _____________________ Date: ____________

Notes/Issues: