# Canary Rollback Procedure This document describes the procedure to roll back a failed canary deployment or revert a promoted canary from production. ## Overview The zai-proxy deployment uses a dual-deployment strategy: - **Production deployment** (`zai-proxy`): Live traffic - **Canary deployment** (`zai-proxy-test`): Testing new versions When a canary deployment fails or a promoted version causes issues in production, use this rollback procedure to restore service. ## Architecture Reference ``` ┌─────────────────────────────────────┐ │ Canary (zai-proxy-test) │ │ Image: X.Y.Z-canary │ └──────────────┬──────────────────────┘ │ Fails Testing ▼ ┌─────────────────────────────────────┐ │ Production (zai-proxy) │ │ Image: X.Y.Z (no -canary) │ │ UNCHANGED - Still serving │ └─────────────────────────────────────┘ ``` ## Prerequisites Before rolling back, ensure you have: 1. **kubectl access** to the apexalgo-iad cluster 2. **kubeconfig** mounted at `/home/coder/.kube/apexalgo-iad.kubeconfig` 3. **Current deployment status** information 4. **Root cause understanding** (if available) ## Quick Rollback Commands ```bash # Set kubeconfig for apexalgo-iad cluster export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig # Quick rollback to previous version kubectl rollout undo deployment/zai-proxy -n mcp # Monitor rollback kubectl rollout status deployment/zai-proxy -n mcp # If rollback fails, scale to 0 and back up kubectl scale deployment/zai-proxy -n mcp --replicas=0 kubectl scale deployment/zai-proxy -n mcp --replicas=1 ``` --- ## Part 1: Canary Deployment Rollback ### Scenario: Canary Testing Reveals Critical Issues When canary deployment fails testing, keep production unchanged and clean up canary resources. ### Rollback Triggers **Immediately rollback if ANY of these conditions occur:** - [ ] Error rate exceeds 10% for more than 2 minutes - [ ] P95 latency increases by >100% for more than 2 minutes - [ ] More than 50% of canary pods are NotReady or CrashLoopBackOff - [ ] Token counting stops working or shows incorrect values - [ ] Workers report high failure rates or timeouts - [ ] Security vulnerabilities detected in canary image - [ ] Data corruption or incorrect behavior observed ### Step 1: Verify Current State ```bash # Set kubeconfig export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig # Check canary pod status kubectl get pods -n mcp -l app=zai-proxy,variant=test # Check production pod status (should be healthy) kubectl get pods -n mcp -l app=zai-proxy,variant=production # Get canary deployment details kubectl describe deployment/zai-proxy-test -n mcp # Check recent canary logs kubectl logs -n mcp deployment/zai-proxy-test --tail=100 ``` ### Step 2: Document Issues Found Create an incident report or bead to track the issues: ```bash cd /home/coder/ardenone-cluster/containers/zai-proxy # Create bead for tracking the rollback br create "Investigate canary deployment failure - vVERSION" \ --type bug \ --priority P0 \ --description "Canary deployment VERSION failed testing with: [describe symptoms] Symptoms: - [List observed issues] Root Cause (if known): - [Describe root cause] Impact: - Production UNCHANGED and serving normally - Canary deployment isolated from traffic " \ --labels bug,canary,rollback,urgent # Note the bead ID for blocking future deployments ``` ### Step 3: Delete Canary Resources ```bash # Scale canary deployment to 0 kubectl scale deployment/zai-proxy-test -n mcp --replicas=0 # Verify canary is scaled down kubectl get deployment/zai-proxy-test -n mcp # Optionally delete canary deployment entirely (only if you're sure) kubectl delete deployment/zai-proxy-test -n mcp ``` **Note:** Keep the canary deployment manifest but scale to 0 if you want to quickly redeploy after fixing issues. ### Step 4: Revert Code Changes (If Needed) If the canary failure was due to code issues: ```bash cd /home/coder/ardenone-cluster/containers/zai-proxy # View recent commits git log --oneline -5 # Revert the problematic commit git revert # OR reset to previous commit (if not yet pushed) git reset --hard HEAD~1 # Push the revert git push origin main ``` ### Step 5: Verify Production Unchanged ```bash # Verify production is still serving kubectl get pods -n mcp -l app=zai-proxy,variant=production # Check production health kubectl exec -n mcp deployment/zai-proxy -- curl -s http://localhost:8080/health # Verify production metrics curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total ``` ### Step 6: Clean Up Canary Resources ```bash # Verify canary is not receiving traffic kubectl get svc -n mcp | grep zai-proxy # If using zai-proxy-canary service, ensure workers are using zai-proxy service instead # Check worker configuration points to production service # Clean up failed canary image (optional) # Only delete if you're sure you won't need to debug docker rmi ronaldraygun/zai-proxy:VERSION-canary ``` ### Step 7: Collect Diagnostic Information ```bash # Export canary logs before deletion kubectl logs -n mcp deployment/zai-proxy-test --tail=1000 > \ /tmp/canary-failure-logs-$(date +%Y%m%d-%H%M%S).txt # Export canary metrics kubectl exec -n mcp deployment/zai-proxy-test -- \ curl -s http://localhost:8080/metrics > \ /tmp/canary-failure-metrics-$(date +%Y%m%d-%H%M%S).txt # Export deployment state kubectl get deployment/zai-proxy-test -n mcp -o yaml > \ /tmp/canary-failure-deployment-$(date +%Y%m%d-%H%M%S).yaml # Export pod events kubectl describe pods -n mcp -l app=zai-proxy,variant=test > \ /tmp/canary-failure-events-$(date +%Y%m%d-%H%M%S).txt ``` --- ## Part 2: Production Rollback After Promotion ### Scenario: Promoted Canary Causes Production Issues When a canary version has been promoted to production but causes issues, roll back production immediately. ### Step 1: Immediate Rollback (kubectl) ```bash # Set kubeconfig export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig # QUICK ROLLBACK - Undo to previous version kubectl rollout undo deployment/zai-proxy -n mcp # Monitor rollback kubectl rollout status deployment/zai-proxy -n mcp # Watch pods being replaced kubectl get pods -n mcp -l app=zai-proxy,variant=production -w ``` ### Step 2: Rollback to Specific Version ```bash # View rollout history kubectl rollout history deployment/zai-proxy -n mcp # Rollback to specific revision kubectl rollout undo deployment/zai-proxy -n mcp --to-revision=2 # Verify the rollback kubectl get deployment/zai-proxy -n mcp \ -o jsonpath='{.spec.template.spec.containers[0].image}' && echo ``` ### Step 3: GitOps Rollback (ArgoCD) If using GitOps with ArgoCD: ```bash # Navigate to cluster configuration cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp # View recent commits git log --oneline -5 # Revert the promotion commit git revert HEAD # Push the revert git add zai-proxy.yml git commit -m "fix: rollback zai-proxy to previous stable version" git push origin main # ArgoCD will automatically sync the revert ``` ### Step 4: Emergency Rollback (If kubectl fails) ```bash # If standard rollback fails, scale to 0 and back up kubectl scale deployment/zai-proxy -n mcp --replicas=0 sleep 5 kubectl scale deployment/zai-proxy -n mcp --replicas=1 # Monitor recovery kubectl get pods -n mcp -l app=zai-proxy,variant=production -w ``` ### Step 5: Verify Rollback Complete ```bash # Check all pods are running kubectl get pods -n mcp -l app=zai-proxy,variant=production # Verify image version kubectl get deployment/zai-proxy -n mcp \ -o jsonpath='{.spec.template.spec.containers[0].image}' # Check health endpoint kubectl exec -n mcp deployment/zai-proxy -- curl -s http://localhost:8080/health # Verify metrics are being exported curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total ``` ### Step 6: Document the Rollback ```bash cd /home/coder/ardenone-cluster/containers/zai-proxy # Create incident report br create "Production rollback after vVERSION promotion" \ --type bug \ --priority P0 \ --description "Production rollback from VERSION to PREVIOUS_VERSION Time: $(date -u +%Y-%m-%dT%H:%M:%SZ) Symptoms: - [List observed issues in production] Rollback Action: - kubectl rollout undo deployment/zai-proxy -n mcp - Rolled back to revision X Impact: - Brief service interruption during rollback - Production now running on PREVIOUS_VERSION Next Steps: - Investigate root cause - Fix canary issues - Re-test before re-promotion " \ --labels bug,production,rollback,critical ``` --- ## Part 3: Troubleshooting Guide ### Common Failure Scenarios #### Scenario 1: Canary Pods CrashLoopBackOff **Symptoms:** - Canary pods in CrashLoopBackOff state - Production pods healthy - Can't access canary logs (RBAC blocked) **Rollback Procedure:** ```bash # 1. Verify production is healthy kubectl get pods -n mcp -l app=zai-proxy,variant=production # 2. Scale down canary kubectl scale deployment/zai-proxy-test -n mcp --replicas=0 # 3. Check canary deployment image kubectl get deployment/zai-proxy-test -n mcp \ -o jsonpath='{.spec.template.spec.containers[0].image}' # 4. Verify image exists on Docker Hub curl -s "https://registry.hub.docker.com/v2/repositories/ronaldraygun/zai-proxy/tags/" | \ jq '.results[] | select(.name == "VERSION-canary")' # 5. If image issue, rebuild and redeploy # See: /home/coder/ardenone-cluster/containers/zai-proxy/docs/DEPLOYMENT.md ``` **Prevention:** - Always test images locally before pushing - Validate image exists on Docker Hub before deployment - Check image pull secrets are configured #### Scenario 2: Canary High Error Rate **Symptoms:** - Canary pods running but returning 5xx errors - Prometheus alert `ZaiProxyCanaryHighErrorRate` firing - Error rate > 5% **Rollback Procedure:** ```bash # 1. Check error rate in Prometheus # Query: sum(rate(zai_proxy_requests_total{variant="test",status_code=~"5.."}[5m])) / sum(rate(zai_proxy_requests_total{variant="test"}[5m])) # 2. Check canary logs for errors kubectl logs -n mcp deployment/zai-proxy-test --tail=100 | grep -i error # 3. If critical, scale down canary immediately kubectl scale deployment/zai-proxy-test -n mcp --replicas=0 # 4. Check if production can handle the load kubectl get pods -n mcp -l app=zai-proxy,variant=production # 5. Document the issue cd /home/coder/ardenone-cluster/containers/zai-proxy br create "Fix canary high error rate - vVERSION" \ --type bug --priority P1 --labels bug,canary,errors ``` **Prevention:** - Run regression tests before deployment - Monitor canary metrics continuously - Set up alerts for error rates #### Scenario 3: Canary Latency Degraded **Symptoms:** - Canary p90/p95 latency > 1.5x production - Prometheus alert `ZaiProxyCanaryLatencyDegraded` firing - Slow response times on canary endpoint **Rollback Procedure:** ```bash # 1. Check latency in Prometheus # Query: histogram_quantile(0.90, sum(rate(zai_proxy_request_duration_seconds_bucket{variant="test"}[5m])) by (le)) # 2. Check token counting overhead curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | \ grep zai_proxy_token_count_duration_seconds # 3. If token counting is slow (>100ms p99), disable it temporarily kubectl set env deployment/zai-proxy-test -n mcp \ ENABLE_TOKEN_COUNTING=false # 4. Restart canary to pick up new config kubectl rollout restart deployment/zai-proxy-test -n mcp # 5. Monitor recovery kubectl rollout status deployment/zai-proxy-test -n mcp ``` **Prevention:** - Profile token counting performance - Set appropriate timeouts - Use caching for token counting results #### Scenario 4: Production Rollout Stuck **Symptoms:** - Production rollout not progressing - New pods not becoming Ready - Old pods still serving traffic **Rollback Procedure:** ```bash # 1. Check rollout status kubectl rollout status deployment/zai-proxy -n mcp # 2. If stuck, pause rollout kubectl rollout pause deployment/zai-proxy -n mcp # 3. Describe deployment to see issues kubectl describe deployment/zai-proxy -n mcp # 4. Describe failing pods kubectl describe pod -n mcp # 5. If critical, rollback immediately kubectl rollout undo deployment/zai-proxy -n mcp # 6. Monitor rollback kubectl rollout status deployment/zai-proxy -n mcp ``` **Prevention:** - Use rolling update strategy with appropriate thresholds - Set resource limits appropriately - Monitor pod health during rollout #### Scenario 5: Production Image Crash Loop **Symptoms:** - Production pods entering CrashLoopBackOff - Recent image promotion caused crashes - Service disruption **Emergency Rollback:** ```bash # 1. IMMEDIATE ROLLBACK - Use kubectl kubectl rollout undo deployment/zai-proxy -n mcp # 2. If undo fails, scale to 0 and back up kubectl scale deployment/zai-proxy -n mcp --replicas=0 sleep 5 kubectl scale deployment/zai-proxy -n mcp --replicas=1 # 3. Verify pods are coming up kubectl get pods -n mcp -l app=zai-proxy,variant=production -w # 4. Check ReplicaSets to find working version kubectl get replicasets -n mcp -l app=zai-proxy,variant=production # 5. Patch deployment to use working version kubectl patch deployment zai-proxy -n mcp \ -p '{"spec":{"template":{"metadata":{"labels":{"version":"WORKING_VERSION"}}}}}' # 6. Set image to working version kubectl set image deployment/zai-proxy -n mcp \ proxy=ronaldraygun/zai-proxy:WORKING_VERSION ``` **Prevention:** - Always test canary thoroughly before promotion - Use proper health checks - Monitor crash counts #### Scenario 6: ArgoCD Sync Delay **Symptoms:** - Git revert pushed but ArgoCD not syncing - Production still running failed version - Manual intervention needed **Rollback Procedure:** ```bash # 1. Force immediate rollback via kubectl (bypass ArgoCD) kubectl rollout undo deployment/zai-proxy -n mcp # 2. Check ArgoCD sync status # In ArgoCD UI: https://argocd./application/zai-proxy # 3. If sync stuck, manually sync in ArgoCD UI # Or use argocd CLI: argocd app sync zai-proxy # 4. Verify sync completed argocd app get zai-proxy # 5. Monitor rollout kubectl rollout status deployment/zai-proxy -n mcp # 6. Once stable, ArgoCD will reconcile with Git # The kubectl change may be overwritten, so update Git to match: cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp # Edit zai-proxy.yml to match the rolled-back version git add zai-proxy.yml git commit -m "fix: sync git with rolled-back version" git push origin main ``` **Prevention:** - Monitor ArgoCD sync status - Use ArgoCD sync waves if needed - Have manual rollback ready as backup #### Scenario 7: Workers Not Connecting After Rollback **Symptoms:** - Rollback completed but workers not connecting - Worker logs showing connection errors - No metrics from production **Rollback Procedure:** ```bash # 1. Check service endpoints kubectl get endpoints -n mcp | grep zai-proxy # 2. Test service from devpod curl -s http://zai-proxy.mcp.svc.cluster.local:8080/health # 3. Check worker configuration grep -r "zai-proxy" ~/.beads-workers/*.log # 4. If workers pointing to canary service, update them # Workers should use: http://zai-proxy.mcp.svc.cluster.local:8080 # NOT: http://zai-proxy-canary.mcp.svc.cluster.local:8080 # 5. Restart affected workers # Find worker session tlist # Kill and restart worker tkill # 6. Verify worker connectivity tail -f ~/.beads-workers/.log ``` **Prevention:** - Use service discovery correctly - Document worker configuration - Test worker connectivity after changes --- ## Part 4: Rollback Verification Checklist Use this checklist after performing any rollback: ### Canary Rollback Verification - [ ] Canaries scaled to 0 (kubectl scale) - [ ] Production pods still healthy - [ ] Production serving traffic normally - [ ] No Prometheus alerts for production - [ ] Incident report/bead created - [ ] Code changes reverted (if needed) - [ ] Root cause documented - [ ] Fix plan created ### Production Rollback Verification - [ ] Rollback command executed - [ ] Rollout status shows completion - [ ] All production pods Ready - [ ] Pods running previous image version - [ ] Health endpoint responding - [ ] Workers connecting successfully - [ ] Metrics being exported - [ ] Error rate below threshold - [ ] Latency back to baseline - [ ] Incident report/bead created - [ ] Git revert pushed (if GitOps) - [ ] ArgoCD synced (if applicable) --- ## Part 5: Rollback Dry-Run Test ### Testing Rollback Procedure To verify rollback procedures work, perform a dry-run test: ```bash # Set kubeconfig export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig # 1. Save current state kubectl get deployment/zai-proxy -n mcp -o yaml > /tmp/zai-proxy-before.yml kubectl get deployment/zai-proxy-test -n mcp -o yaml > /tmp/zai-proxy-test-before.yml # 2. Check current image current_image=$(kubectl get deployment/zai-proxy -n mcp \ -o jsonpath='{.spec.template.spec.containers[0].image}') echo "Current image: $current_image" # 3. Check rollout history kubectl rollout history deployment/zai-proxy -n mcp # 4. Test rollback command (dry-run) kubectl rollout undo deployment/zai-proxy -n mcp --dry-run=server # 5. Test scaling to 0 (don't actually do it) kubectl scale deployment/zai-proxy-test -n mcp --replicas=0 --dry-run=server # 6. Verify you can access logs kubectl logs -n mcp deployment/zai-proxy --tail=10 # 7. Verify you can access metrics curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | head -20 # 8. Check service endpoints kubectl get endpoints -n mcp zai-proxy # 9. Restore state (if needed) # kubectl apply -f /tmp/zai-proxy-before.yml ``` ### Automated Rollback Test Script ```bash #!/bin/bash # Test rollback procedures set -e NAMESPACE="mcp" PRODUCTION_DEPLOYMENT="zai-proxy" CANARY_DEPLOYMENT="zai-proxy-test" echo "=== Testing Canary Rollback Procedure ===" # Test 1: Can we scale canary to 0? echo "Test 1: Scale canary to 0" kubectl scale deployment/$CANARY_DEPLOYMENT -n $NAMESPACE --replicas=0 --dry-run=server # Test 2: Can we undo production rollout? echo "Test 2: Undo production rollout" kubectl rollout undo deployment/$PRODUCTION_DEPLOYMENT -n $NAMESPACE --dry-run=server # Test 3: Can we get rollout history? echo "Test 3: Get rollout history" kubectl rollout history deployment/$PRODUCTION_DEPLOYMENT -n $NAMESPACE # Test 4: Can we check pod status? echo "Test 4: Check pod status" kubectl get pods -n $NAMESPACE -l app=zai-proxy,variant=production # Test 5: Can we access logs? echo "Test 5: Access logs" kubectl logs -n $NAMESPACE deployment/$PRODUCTION_DEPLOYMENT --tail=10 # Test 6: Can we access metrics? echo "Test 6: Access metrics" curl -s http://zai-proxy.$NAMESPACE.svc.cluster.local:8080/metrics | head -5 # Test 7: Can we check health? echo "Test 7: Check health" kubectl exec -n $NAMESPACE deployment/$PRODUCTION_DEPLOYMENT -- \ curl -s http://localhost:8080/health echo "=== All rollback tests passed ===" ``` --- ## Part 6: Post-Rollback Actions ### After Rolling Back Canary 1. **Fix the issues:** - Investigate root cause - Fix code or configuration - Add regression tests 2. **Re-test canary:** - Deploy fixed version to canary - Run functional tests - Monitor metrics - Validate with worker traffic 3. **Re-promote when ready:** - Follow promotion procedure - Monitor production metrics - Have rollback plan ready ### After Rolling Back Production 1. **Stabilize service:** - Verify production is healthy - Monitor metrics continuously - Check worker connectivity 2. **Investigate failure:** - Analyze logs from failed version - Identify root cause - Document findings 3. **Fix and re-test:** - Fix issues in canary - Thoroughly test fixes - Consider extended canary testing 4. **Re-promote carefully:** - Use smaller traffic split initially - Monitor continuously - Have rollback command ready --- ## Part 7: kubectl Rollback Commands Reference ### Deployment Rollback ```bash # Undo to previous version kubectl rollout undo deployment/ -n # Undo to specific revision kubectl rollout undo deployment/ -n --to-revision= # View rollout history kubectl rollout history deployment/ -n # Check rollout status kubectl rollout status deployment/ -n # Pause rollout kubectl rollout pause deployment/ -n # Resume rollout kubectl rollout resume deployment/ -n # Restart deployment kubectl rollout restart deployment/ -n ``` ### Scaling Operations ```bash # Scale deployment to 0 kubectl scale deployment/ -n --replicas=0 # Scale deployment up kubectl scale deployment/ -n --replicas= # Scale multiple deployments kubectl scale deployment/ deployment/ -n --replicas=0 ``` ### Image Management ```bash # Set image kubectl set image deployment/ -n \ =: # Get current image kubectl get deployment/ -n \ -o jsonpath='{.spec.template.spec.containers[0].image}' # Patch deployment with new image kubectl patch deployment/ -n \ -p '{"spec":{"template":{"spec":{"containers":[{"name":"","image":":"}]}}}}' ``` ### Verification Commands ```bash # Get pods kubectl get pods -n -l app= # Watch pod changes kubectl get pods -n -l app= -w # Describe deployment kubectl describe deployment/ -n # Describe pod kubectl describe pod/ -n # View logs kubectl logs -n deployment/ --tail=100 # Stream logs kubectl logs -f -n deployment/ # Get endpoints kubectl get endpoints -n | grep # Test health endpoint kubectl exec -n deployment/ -- \ curl -s http://localhost:8080/health ``` --- ## Part 8: Rollback Decision Flowchart ``` ┌─────────────────────┐ │ Canary Testing │ └──────────┬──────────┘ │ ┌──────────▼──────────┐ │ Issues Detected? │ └──────────┬──────────┘ │ ┌────────────────┴────────────────┐ │ No │ Yes ▼ ▼ ┌───────────────────┐ ┌─────────────────────┐ │ Continue Testing │ │ Critical Issue? │ └───────────────────┘ └──────────┬──────────┘ │ ┌─────────┴─────────┐ │ Yes │ No ▼ ▼ ┌───────────────────┐ ┌───────────────────┐ │ Immediate Rollback│ │ Document & Monitor│ └─────────┬─────────┘ └───────────────────┘ │ ┌───────────────┴───────────────┐ │ │ ▼ ▼ ┌───────────────────┐ ┌───────────────────┐ │ Scale Canary to 0 │ │ Collect Diagnostics│ └─────────┬─────────┘ └─────────┬─────────┘ │ │ ▼ ▼ ┌───────────────────┐ ┌───────────────────┐ │ Delete Canary │◄─────────│ Create Failure │ │ Resources │ │ Report │ └─────────┬─────────┘ └───────────────────┘ │ ▼ ┌───────────────────┐ │ Verify Production │ │ Still Healthy │ └─────────┬─────────┘ │ ▼ ┌───────────────────┐ │ Document Lessons │ │ Learned │ └───────────────────┘ ``` --- ## Part 9: RBAC Considerations ### Important: Read-Only Access from Devpods When running rollback procedures from devpods using the `devpod-observer` ServiceAccount: **Available Operations (Read-Only):** - `kubectl get pods` - View pod status - `kubectl get deployments` - View deployment status - `kubectl get svc` - View service status - `kubectl rollout history` - View rollout history - `kubectl logs` - View pod logs - `kubectl describe` - View resource details **NOT Available (Requires Write Permissions):** - `kubectl scale` - Cannot scale deployments - `kubectl rollout undo` - Cannot rollback deployments - `kubectl delete` - Cannot delete resources - `kubectl set image` - Cannot update images - `kubectl patch` - Cannot patch resources - `kubectl exec` - Cannot execute commands in pods ### Rollback with Read-Only Access When you only have read-only access (e.g., from devpods), use these alternative approaches: **Option 1: GitOps Rollback (Recommended for ArgoCD-managed deployments)** ```bash # Navigate to cluster configuration cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp # Revert the problematic commit git log --oneline -5 git revert HEAD # Push the revert git add zai-proxy.yml git commit -m "fix: rollback zai-proxy to previous stable version" git push origin main # ArgoCD will automatically sync the revert ``` **Option 2: Request Rollback via Human Intervention** ```bash # Create a bead to request rollback cd /home/coder/ardenone-cluster/containers/zai-proxy br create "URGENT: Request production rollback for zai-proxy" \ --type bug \ --priority P0 \ --description "CRITICAL: Production rollback requested Current Issues: - [Describe symptoms] Requested Action: - kubectl rollout undo deployment/zai-proxy -n mcp - OR: Scale to 0 and back up Verified via Read-Only: - Production pods: $(kubectl get pods -n mcp -l app=zai-proxy,variant=production) - Current image: $(kubectl get deployment/zai-proxy -n mcp -o jsonpath='{.spec.template.spec.containers[0].image}') - Rollout history available " \ --labels critical,rollback,production,human-required ``` **Option 3: Direct Cluster Access (If Available)** If you have direct kubectl access with admin permissions (not via devpod-observer): ```bash # Use local kubeconfig or admin credentials kubectl rollout undo deployment/zai-proxy -n mcp ``` ### Verification with Read-Only Access You CAN verify the cluster state even with read-only access: ```bash # Check deployment status kubectl get deployment/zai-proxy -n mcp # Check pod status kubectl get pods -n mcp -l app=zai-proxy,variant=production # View recent logs kubectl logs -n mcp deployment/zai-proxy --tail=50 # Check rollout history kubectl rollout history deployment/zai-proxy -n mcp # View service endpoints kubectl get endpoints -n mcp zai-proxy ``` --- ## Related Documentation - [CANARY_PROMOTION_PROCEDURE.md](CANARY_PROMOTION_PROCEDURE.md) - Promoting canary to production - [CANARY_PROMOTION_CHECKLIST.md](CANARY_PROMOTION_CHECKLIST.md) - Promotion checklist - [DEPLOYMENT.md](DEPLOYMENT.md) - Worker configuration and dual-deployment workflow - [TOKEN_COUNTING.md](TOKEN_COUNTING.md) - Token counting implementation - [REGRESSION_TESTING.md](REGRESSION_TESTING.md) - Running regression tests - [README-traffic-splitting.md](../../cluster-configuration/apexalgo-iad/mcp/README-traffic-splitting.md) - Traffic splitting options --- ## Recovery Timeline | Action | Time | Notes | |--------|------|-------| | Scale canary to 0 | <10s | Immediate stop | | Delete canary resources | <30s | Full cleanup | | Verify production healthy | <1min | Confirm no impact | | Production rollback | <2min | Full rollout undo | | Collect diagnostics | <5min | For analysis | | Document failure | <10min | Postmortem | | **Canary rollback time** | **<15min** | Production unaffected | | **Production rollback time** | **<5min** | Brief interruption | **Key Point:** Production is never modified during canary rollback, so downtime is zero. --- **Document Version:** 2.1.0 **Last Updated:** 2026-02-08 **Maintained By:** Claude Code Workers **Related Bead:** bd-2s5 --- ## Important RBAC Note **When accessing from devpods via kubectl-proxy:** The `devpod-observer` ServiceAccount has **limited permissions** and cannot perform write operations like `kubectl rollout undo` or `kubectl scale`. **For devpod access, use GitOps rollback (Option 1) instead of direct kubectl commands.** **Direct kubectl rollback commands work when:** - Running from within the apexalgo-iad cluster directly - Using a ServiceAccount with deployment edit permissions - The deployment is managed by ArgoCD (use Git revert instead)