zai-proxy/docs/notes/CANARY_PROMOTION_PROCEDURE.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

496 lines
13 KiB
Markdown

# Canary to Production Promotion Procedure
This document describes the procedure to promote a canary deployment to production after successful testing.
## Overview
The zai-proxy deployment uses a dual-deployment strategy:
- **Production deployment** (`zai-proxy`): Live traffic
- **Canary deployment** (`zai-proxy-test`): Testing new versions
After a canary version has been validated, it can be promoted to production using this procedure.
## Prerequisites
Before promoting canary to production, ensure:
1. **Canary is healthy**: No alerts firing in Prometheus
2. **Testing complete**: Functional tests pass on canary endpoint
3. **Metrics validated**: Token counting, rate limiting, and error rates are acceptable
4. **Workers tested**: At least one worker has successfully used the canary endpoint
5. **Version ready**: Canary image tag is known and available
## Architecture Reference
```
┌─────────────────────────────────────┐
│ Canary (zai-proxy-test) │
│ Image: X.Y.Z-canary │
└──────────────┬──────────────────────┘
│ Promote
┌─────────────────────────────────────┐
│ Production (zai-proxy) │
│ Image: X.Y.Z (no -canary) │
└─────────────────────────────────────┘
```
## Procedure
### Step 1: Update Production Deployment Image Tag
The production deployment image tag needs to be updated to match the canary version (without `-canary` suffix if applicable).
**Example: Promoting 1.2.1-canary to 1.2.1**
```bash
# Using kubectl set image (fastest method)
kubectl set image deployment/zai-proxy \
proxy=ronaldraygun/zai-proxy:1.2.1 \
-n mcp
```
**Alternative: Edit deployment manifest**
```bash
# Edit the deployment directly
kubectl edit deployment/zai-proxy -n mcp
# Change: image: ronaldraygun/zai-proxy:1.2.0
# To: image: ronaldraygun/zai-proxy:1.2.1
```
**For GitOps (ArgoCD):**
1. Update the manifest in Git:
```bash
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
```
2. Edit `zai-proxy.yml`:
```yaml
# Change from:
image: ronaldraygun/zai-proxy:1.2.0
# To:
image: ronaldraygun/zai-proxy:1.2.1
```
3. Commit and push:
```bash
git add zai-proxy.yml
git commit -m "chore: promote zai-proxy to v1.2.1"
git push origin main
```
4. ArgoCD will automatically sync the change
### Step 2: Choose Rollout Strategy
#### Option A: Rolling Update (Default)
Kubernetes performs a rolling update automatically when the image changes:
```bash
# Monitor the rollout
kubectl rollout status deployment/zai-proxy -n mcp
# Watch pods being replaced
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w
```
**Characteristics:**
- Gradual replacement of pods
- Zero downtime (old pods serve traffic until new pods are ready)
- Automatic rollback if failure: `kubectl rollout undo deployment/zai-proxy -n mcp`
#### Option B: Blue-Green Switch
For an instant cutover (requires manual scaling):
```bash
# Step 1: Scale production to 0 (cut off traffic)
kubectl scale deployment/zai-proxy -n mcp --replicas=0
# Step 2: Scale canary to production replica count
kubectl scale deployment/zai-proxy -n mcp --replicas=1
# Step 3: Update production image (with no traffic)
kubectl set image deployment/zai-proxy proxy=ronaldraygun/zai-proxy:1.2.1 -n mcp
# Step 4: Scale production back up
kubectl scale deployment/zai-proxy -n mcp --replicas=1
```
**Note:** This approach is not recommended for zai-proxy as it causes brief downtime.
### Step 3: Monitor Production Metrics During Rollout
While the rollout is in progress, monitor these metrics:
**Prometheus Queries:**
```promql
# 1. Check new pods are serving traffic
sum by (variant) (rate(zai_proxy_requests_total{variant="production"}[1m]))
# 2. Verify error rate is not elevated
sum by (variant) (
rate(zai_proxy_requests_total{variant="production",status=~"5.."}[5m])
/
rate(zai_proxy_requests_total{variant="production"}[5m])
)
# 3. Check latency hasn't regressed
histogram_quantile(0.95,
sum(rate(zai_proxy_request_duration_seconds_bucket{variant="production"}[5m])) by (le)
)
# 4. Verify token counting is working
sum by (variant) (rate(zai_proxy_tokens_total{variant="production"}[5m]))
```
**Real-time monitoring:**
```bash
# Watch pod status
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w
# Stream production logs
kubectl logs -f -n mcp deployment/zai-proxy --tail=100
# Check metrics endpoint directly
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy
```
### Step 4: Verify Workers Successfully Use New Version
After rollout completes, verify workers are using the new version:
**1. Check pod versions:**
```bash
# Get current version label
kubectl get pods -n mcp \
-l app=zai-proxy,variant=production \
-o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version
# Expected output:
# NAME VERSION
# zai-proxy-7d8f9c6b5-x2k4p 1.2.1
```
**2. Verify worker connectivity:**
```bash
# Check which pods workers are connecting to
kubectl logs -n mcp deployment/zai-proxy --tail=100 | grep "Token usage"
# The logs should show activity from workers
# Expected: Token usage: input=X, output=Y
```
**3. Check worker token counting metrics:**
```bash
# Query Prometheus for token usage by workers
# This verifies workers are successfully using the new version
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | \
grep zai_proxy_tokens_total | \
grep 'variant="production"'
```
**4. Verify image tag in deployment:**
```bash
kubectl get deployment/zai-proxy -n mcp \
-o jsonpath='{.spec.template.spec.containers[0].image}'
# Expected output: ronaldraygun/zai-proxy:1.2.1
```
### Step 5: Update VERSION File
Remove the `-canary` suffix from the VERSION file in the repository:
```bash
cd /home/coder/ardenone-cluster/containers/zai-proxy
# View current version
cat VERSION
# Current: 1.2.1-canary
# Update to stable version
echo "1.2.1" > VERSION
# Verify
cat VERSION
# Should show: 1.2.1
```
### Step 6: Tag Git Commit as Release
Create a git tag for the release:
```bash
cd /home/coder/ardenone-cluster/containers/zai-proxy
# Ensure all changes are committed
git status
git add VERSION
git commit -m "chore: release zai-proxy v1.2.1"
# Create annotated tag
git tag -a v1.2.1 -m "Release zai-proxy v1.2.1
- [Description of changes in this version]
- Validated via canary testing
- Promoted from canary to production"
# Push tag to remote
git push origin v1.2.1
# Optionally push all tags
git push origin --tags
```
**View release tags:**
```bash
# List all tags
git tag -l
# Show tag details
git show v1.2.1
```
## Rollout Checklist
Use this checklist during the promotion process:
### Pre-Promotion
- [ ] Canary deployment is healthy (`kubectl get pods -n mcp -l variant=test`)
- [ ] No Prometheus alerts firing for canary
- [ ] Functional tests pass on canary endpoint
- [ ] Token counting verified on canary
- [ ] At least one worker tested with canary endpoint
- [ ] Canary image tag is confirmed (e.g., `1.2.1-canary`)
### Promotion
- [ ] Production deployment image updated to new version
- [ ] Rollout initiated (`kubectl set image` or manifest update)
- [ ] Rollout status monitored (`kubectl rollout status`)
### During Rollout
- [ ] New pods are becoming Ready
- [ ] Old pods are terminating gracefully
- [ ] Error rate remains below threshold (<5%)
- [ ] P95 latency hasn't regressed significantly
- [ ] Token counting metrics are being exported
### Post-Rollout
- [ ] All production pods running new version (`kubectl describe deployment`)
- [ ] Workers are successfully making requests (check logs/metrics)
- [ ] Token counting is working (check logs for "Token usage")
- [ ] No errors in production logs
- [ ] Metrics show healthy traffic patterns
### Finalization
- [ ] VERSION file updated (removed `-canary` suffix)
- [ ] Git commit created for VERSION change
- [ ] Git tag created for release (e.g., `v1.2.1`)
- [ ] Tag pushed to remote repository
- [ ] Canary deployment can be scaled down (optional)
### Cleanup (Optional)
- [ ] Scale down canary deployment if no longer needed: `kubectl scale deployment/zai-proxy-test -n mcp --replicas=0`
- [ ] Update canary manifest to next development version
## Rollback Procedure
If issues are detected after promotion, use this rollback procedure:
### Immediate Rollback (kubectl)
```bash
# Rollback to previous version
kubectl rollout undo deployment/zai-proxy -n mcp
# Monitor rollback
kubectl rollout status deployment/zai-proxy -n mcp
# Verify rollback completed
kubectl get pods -n mcp -l app=zai-proxy,variant=production
```
### Rollback to Specific Version
```bash
# View rollout history
kubectl rollout history deployment/zai-proxy -n mcp
# Rollback to specific revision
kubectl rollout undo deployment/zai-proxy -n mcp --to-revision=2
```
### GitOps Rollback (ArgoCD)
```bash
# Revert the commit that changed the image tag
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
git revert HEAD
git push origin main
# ArgoCD will automatically sync the revert
```
## kubectl Commands Reference
### Image Management
```bash
# Update production image
kubectl set image deployment/zai-proxy proxy=ronaldraygun/zai-proxy:X.Y.Z -n mcp
# Update canary image
kubectl set image deployment/zai-proxy-test proxy=ronaldraygun/zai-proxy:X.Y.Z-canary -n mcp
# View current image
kubectl get deployment/zai-proxy -n mcp -o jsonpath='{.spec.template.spec.containers[0].image}'
```
### Rollout Management
```bash
# Check rollout status
kubectl rollout status deployment/zai-proxy -n mcp
# View rollout history
kubectl rollout history deployment/zai-proxy -n mcp
# Pause rollout
kubectl rollout pause deployment/zai-proxy -n mcp
# Resume rollout
kubectl rollout resume deployment/zai-proxy -n mcp
# Undo last rollout
kubectl rollout undo deployment/zai-proxy -n mcp
# Undo to specific revision
kubectl rollout undo deployment/zai-proxy -n mcp --to-revision=2
```
### Pod Management
```bash
# List production pods
kubectl get pods -n mcp -l app=zai-proxy,variant=production
# List canary pods
kubectl get pods -n mcp -l app=zai-proxy,variant=test
# Watch pod changes
kubectl get pods -n mcp -l app=zai-proxy -w
# Delete specific pod (forces restart)
kubectl delete pod <pod-name> -n mcp
# Restart all pods in deployment
kubectl rollout restart deployment/zai-proxy -n mcp
```
### Scaling
```bash
# Scale production
kubectl scale deployment/zai-proxy -n mcp --replicas=1
# Scale canary
kubectl scale deployment/zai-proxy-test -n mcp --replicas=1
# View replica status
kubectl get deployment/zai-proxy -n mcp
```
### Verification
```bash
# Get deployment details
kubectl describe deployment/zai-proxy -n mcp
# View pod logs
kubectl logs -n mcp deployment/zai-proxy --tail=100
kubectl logs -f -n mcp deployment/zai-proxy
# Check pod versions
kubectl get pods -n mcp -l app=zai-proxy,variant=production \
-o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version
# Get pod resource usage
kubectl top pods -n mcp -l app=zai-proxy
```
### Metrics and Health
```bash
# Check service endpoints
kubectl get endpoints -n mcp | grep zai-proxy
# Test health endpoint
kubectl exec -n mcp deployment/zai-proxy -- curl -s http://localhost:8080/health
# Port-forward for local testing
kubectl port-forward -n mcp deployment/zai-proxy 8080:8080
curl http://localhost:8080/metrics
```
## Troubleshooting
### Rollout Stuck
```bash
# Check rollout status
kubectl rollout status deployment/zai-proxy -n mcp
# If stuck, pause and inspect
kubectl rollout pause deployment/zai-proxy -n mcp
kubectl describe deployment/zai-proxy -n mcp
# Resume when ready
kubectl rollout resume deployment/zai-proxy -n mcp
```
### Pods Not Ready
```bash
# Describe pod to see events
kubectl describe pod <pod-name> -n mcp
# Check pod logs
kubectl logs <pod-name> -n mcp
# Common issues:
# - Image pull errors: Check image name and registry secrets
# - Resource limits: Check pod resource requests/limits
# - Health check failures: Check /health endpoint
```
### Workers Not Connecting
```bash
# Check service is accessible
kubectl get svc -n mcp | grep zai-proxy
# Test from devpod
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/health
# Check worker logs
tail -f ~/.beads-workers/*.log
```
## Related Documentation
- [DEPLOYMENT.md](DEPLOYMENT.md) - Worker configuration and dual-deployment workflow
- [TOKEN_COUNTING.md](TOKEN_COUNTING.md) - Token counting implementation and monitoring
- [REGRESSION_TESTING.md](REGRESSION_TESTING.md) - Running regression tests before promotion