zai-proxy/docs/notes/CANARY_PROMOTION_PROCEDURE.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

13 KiB

Canary to Production Promotion Procedure

This document describes the procedure to promote a canary deployment to production after successful testing.

Overview

The zai-proxy deployment uses a dual-deployment strategy:

  • Production deployment (zai-proxy): Live traffic
  • Canary deployment (zai-proxy-test): Testing new versions

After a canary version has been validated, it can be promoted to production using this procedure.

Prerequisites

Before promoting canary to production, ensure:

  1. Canary is healthy: No alerts firing in Prometheus
  2. Testing complete: Functional tests pass on canary endpoint
  3. Metrics validated: Token counting, rate limiting, and error rates are acceptable
  4. Workers tested: At least one worker has successfully used the canary endpoint
  5. Version ready: Canary image tag is known and available

Architecture Reference

                    ┌─────────────────────────────────────┐
                    │    Canary (zai-proxy-test)          │
                    │    Image: X.Y.Z-canary              │
                    └──────────────┬──────────────────────┘
                                   │ Promote
                                   ▼
                    ┌─────────────────────────────────────┐
                    │    Production (zai-proxy)           │
                    │    Image: X.Y.Z (no -canary)        │
                    └─────────────────────────────────────┘

Procedure

Step 1: Update Production Deployment Image Tag

The production deployment image tag needs to be updated to match the canary version (without -canary suffix if applicable).

Example: Promoting 1.2.1-canary to 1.2.1

# Using kubectl set image (fastest method)
kubectl set image deployment/zai-proxy \
  proxy=ronaldraygun/zai-proxy:1.2.1 \
  -n mcp

Alternative: Edit deployment manifest

# Edit the deployment directly
kubectl edit deployment/zai-proxy -n mcp

# Change: image: ronaldraygun/zai-proxy:1.2.0
# To:    image: ronaldraygun/zai-proxy:1.2.1

For GitOps (ArgoCD):

  1. Update the manifest in Git:
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
  1. Edit zai-proxy.yml:
# Change from:
image: ronaldraygun/zai-proxy:1.2.0

# To:
image: ronaldraygun/zai-proxy:1.2.1
  1. Commit and push:
git add zai-proxy.yml
git commit -m "chore: promote zai-proxy to v1.2.1"
git push origin main
  1. ArgoCD will automatically sync the change

Step 2: Choose Rollout Strategy

Option A: Rolling Update (Default)

Kubernetes performs a rolling update automatically when the image changes:

# Monitor the rollout
kubectl rollout status deployment/zai-proxy -n mcp

# Watch pods being replaced
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w

Characteristics:

  • Gradual replacement of pods
  • Zero downtime (old pods serve traffic until new pods are ready)
  • Automatic rollback if failure: kubectl rollout undo deployment/zai-proxy -n mcp

Option B: Blue-Green Switch

For an instant cutover (requires manual scaling):

# Step 1: Scale production to 0 (cut off traffic)
kubectl scale deployment/zai-proxy -n mcp --replicas=0

# Step 2: Scale canary to production replica count
kubectl scale deployment/zai-proxy -n mcp --replicas=1

# Step 3: Update production image (with no traffic)
kubectl set image deployment/zai-proxy proxy=ronaldraygun/zai-proxy:1.2.1 -n mcp

# Step 4: Scale production back up
kubectl scale deployment/zai-proxy -n mcp --replicas=1

Note: This approach is not recommended for zai-proxy as it causes brief downtime.

Step 3: Monitor Production Metrics During Rollout

While the rollout is in progress, monitor these metrics:

Prometheus Queries:

# 1. Check new pods are serving traffic
sum by (variant) (rate(zai_proxy_requests_total{variant="production"}[1m]))

# 2. Verify error rate is not elevated
sum by (variant) (
  rate(zai_proxy_requests_total{variant="production",status=~"5.."}[5m])
  /
  rate(zai_proxy_requests_total{variant="production"}[5m])
)

# 3. Check latency hasn't regressed
histogram_quantile(0.95,
  sum(rate(zai_proxy_request_duration_seconds_bucket{variant="production"}[5m])) by (le)
)

# 4. Verify token counting is working
sum by (variant) (rate(zai_proxy_tokens_total{variant="production"}[5m]))

Real-time monitoring:

# Watch pod status
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w

# Stream production logs
kubectl logs -f -n mcp deployment/zai-proxy --tail=100

# Check metrics endpoint directly
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy

Step 4: Verify Workers Successfully Use New Version

After rollout completes, verify workers are using the new version:

1. Check pod versions:

# Get current version label
kubectl get pods -n mcp \
  -l app=zai-proxy,variant=production \
  -o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version

# Expected output:
# NAME                             VERSION
# zai-proxy-7d8f9c6b5-x2k4p        1.2.1

2. Verify worker connectivity:

# Check which pods workers are connecting to
kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep "Token usage"

# The logs should show activity from workers
# Expected: Token usage: input=X, output=Y

3. Check worker token counting metrics:

# Query Prometheus for token usage by workers
# This verifies workers are successfully using the new version
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \
  grep zai_proxy_tokens_total | \
  grep 'variant="production"'

4. Verify image tag in deployment:

kubectl get deployment/zai-proxy -n mcp \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Expected output: ronaldraygun/zai-proxy:1.2.1

Step 5: Update VERSION File

Remove the -canary suffix from the VERSION file in the repository:

cd /home/coder/ardenone-cluster/containers/zai-proxy

# View current version
cat VERSION
# Current: 1.2.1-canary

# Update to stable version
echo "1.2.1" > VERSION

# Verify
cat VERSION
# Should show: 1.2.1

Step 6: Tag Git Commit as Release

Create a git tag for the release:

cd /home/coder/ardenone-cluster/containers/zai-proxy

# Ensure all changes are committed
git status
git add VERSION
git commit -m "chore: release zai-proxy v1.2.1"

# Create annotated tag
git tag -a v1.2.1 -m "Release zai-proxy v1.2.1

- [Description of changes in this version]
- Validated via canary testing
- Promoted from canary to production"

# Push tag to remote
git push origin v1.2.1

# Optionally push all tags
git push origin --tags

View release tags:

# List all tags
git tag -l

# Show tag details
git show v1.2.1

Rollout Checklist

Use this checklist during the promotion process:

Pre-Promotion

  • Canary deployment is healthy (kubectl get pods -n mcp -l variant=test)
  • No Prometheus alerts firing for canary
  • Functional tests pass on canary endpoint
  • Token counting verified on canary
  • At least one worker tested with canary endpoint
  • Canary image tag is confirmed (e.g., 1.2.1-canary)

Promotion

  • Production deployment image updated to new version
  • Rollout initiated (kubectl set image or manifest update)
  • Rollout status monitored (kubectl rollout status)

During Rollout

  • New pods are becoming Ready
  • Old pods are terminating gracefully
  • Error rate remains below threshold (<5%)
  • P95 latency hasn't regressed significantly
  • Token counting metrics are being exported

Post-Rollout

  • All production pods running new version (kubectl describe deployment)
  • Workers are successfully making requests (check logs/metrics)
  • Token counting is working (check logs for "Token usage")
  • No errors in production logs
  • Metrics show healthy traffic patterns

Finalization

  • VERSION file updated (removed -canary suffix)
  • Git commit created for VERSION change
  • Git tag created for release (e.g., v1.2.1)
  • Tag pushed to remote repository
  • Canary deployment can be scaled down (optional)

Cleanup (Optional)

  • Scale down canary deployment if no longer needed: kubectl scale deployment/zai-proxy-test -n mcp --replicas=0
  • Update canary manifest to next development version

Rollback Procedure

If issues are detected after promotion, use this rollback procedure:

Immediate Rollback (kubectl)

# Rollback to previous version
kubectl rollout undo deployment/zai-proxy -n mcp

# Monitor rollback
kubectl rollout status deployment/zai-proxy -n mcp

# Verify rollback completed
kubectl get pods -n mcp -l app=zai-proxy,variant=production

Rollback to Specific Version

# View rollout history
kubectl rollout history deployment/zai-proxy -n mcp

# Rollback to specific revision
kubectl rollout undo deployment/zai-proxy -n mcp --to-revision=2

GitOps Rollback (ArgoCD)

# Revert the commit that changed the image tag
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
git revert HEAD
git push origin main

# ArgoCD will automatically sync the revert

kubectl Commands Reference

Image Management

# Update production image
kubectl set image deployment/zai-proxy proxy=ronaldraygun/zai-proxy:X.Y.Z -n mcp

# Update canary image
kubectl set image deployment/zai-proxy-test proxy=ronaldraygun/zai-proxy:X.Y.Z-canary -n mcp

# View current image
kubectl get deployment/zai-proxy -n mcp -o jsonpath='{.spec.template.spec.containers[0].image}'

Rollout Management

# Check rollout status
kubectl rollout status deployment/zai-proxy -n mcp

# View rollout history
kubectl rollout history deployment/zai-proxy -n mcp

# Pause rollout
kubectl rollout pause deployment/zai-proxy -n mcp

# Resume rollout
kubectl rollout resume deployment/zai-proxy -n mcp

# Undo last rollout
kubectl rollout undo deployment/zai-proxy -n mcp

# Undo to specific revision
kubectl rollout undo deployment/zai-proxy -n mcp --to-revision=2

Pod Management

# List production pods
kubectl get pods -n mcp -l app=zai-proxy,variant=production

# List canary pods
kubectl get pods -n mcp -l app=zai-proxy,variant=test

# Watch pod changes
kubectl get pods -n mcp -l app=zai-proxy -w

# Delete specific pod (forces restart)
kubectl delete pod <pod-name> -n mcp

# Restart all pods in deployment
kubectl rollout restart deployment/zai-proxy -n mcp

Scaling

# Scale production
kubectl scale deployment/zai-proxy -n mcp --replicas=1

# Scale canary
kubectl scale deployment/zai-proxy-test -n mcp --replicas=1

# View replica status
kubectl get deployment/zai-proxy -n mcp

Verification

# Get deployment details
kubectl describe deployment/zai-proxy -n mcp

# View pod logs
kubectl logs -n mcp deployment/zai-proxy --tail=100
kubectl logs -f -n mcp deployment/zai-proxy

# Check pod versions
kubectl get pods -n mcp -l app=zai-proxy,variant=production \
  -o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version

# Get pod resource usage
kubectl top pods -n mcp -l app=zai-proxy

Metrics and Health

# Check service endpoints
kubectl get endpoints -n mcp | grep zai-proxy

# Test health endpoint
kubectl exec -n mcp deployment/zai-proxy -- curl -s http://localhost:8080/health

# Port-forward for local testing
kubectl port-forward -n mcp deployment/zai-proxy 8080:8080
curl http://localhost:8080/metrics

Troubleshooting

Rollout Stuck

# Check rollout status
kubectl rollout status deployment/zai-proxy -n mcp

# If stuck, pause and inspect
kubectl rollout pause deployment/zai-proxy -n mcp
kubectl describe deployment/zai-proxy -n mcp

# Resume when ready
kubectl rollout resume deployment/zai-proxy -n mcp

Pods Not Ready

# Describe pod to see events
kubectl describe pod <pod-name> -n mcp

# Check pod logs
kubectl logs <pod-name> -n mcp

# Common issues:
# - Image pull errors: Check image name and registry secrets
# - Resource limits: Check pod resource requests/limits
# - Health check failures: Check /health endpoint

Workers Not Connecting

# Check service is accessible
kubectl get svc -n mcp | grep zai-proxy

# Test from devpod
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/health

# Check worker logs
tail -f ~/.beads-workers/*.log