Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1034 lines
30 KiB
Markdown
1034 lines
30 KiB
Markdown
# Canary Rollback Procedure
|
|
|
|
This document describes the procedure to roll back a failed canary deployment or revert a promoted canary from production.
|
|
|
|
## Overview
|
|
|
|
The zai-proxy deployment uses a dual-deployment strategy:
|
|
- **Production deployment** (`zai-proxy`): Live traffic
|
|
- **Canary deployment** (`zai-proxy-test`): Testing new versions
|
|
|
|
When a canary deployment fails or a promoted version causes issues in production, use this rollback procedure to restore service.
|
|
|
|
## Architecture Reference
|
|
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ Canary (zai-proxy-test) │
|
|
│ Image: X.Y.Z-canary │
|
|
└──────────────┬──────────────────────┘
|
|
│ Fails Testing
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Production (zai-proxy) │
|
|
│ Image: X.Y.Z (no -canary) │
|
|
│ UNCHANGED - Still serving │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
## Prerequisites
|
|
|
|
Before rolling back, ensure you have:
|
|
1. **kubectl access** to the apexalgo-iad cluster
|
|
2. **kubeconfig** mounted at `/home/coder/.kube/apexalgo-iad.kubeconfig`
|
|
3. **Current deployment status** information
|
|
4. **Root cause understanding** (if available)
|
|
|
|
## Quick Rollback Commands
|
|
|
|
```bash
|
|
# Set kubeconfig for apexalgo-iad cluster
|
|
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
|
|
|
|
# Quick rollback to previous version
|
|
kubectl rollout undo deployment/zai-proxy -n mcp
|
|
|
|
# Monitor rollback
|
|
kubectl rollout status deployment/zai-proxy -n mcp
|
|
|
|
# If rollback fails, scale to 0 and back up
|
|
kubectl scale deployment/zai-proxy -n mcp --replicas=0
|
|
kubectl scale deployment/zai-proxy -n mcp --replicas=1
|
|
```
|
|
|
|
---
|
|
|
|
## Part 1: Canary Deployment Rollback
|
|
|
|
### Scenario: Canary Testing Reveals Critical Issues
|
|
|
|
When canary deployment fails testing, keep production unchanged and clean up canary resources.
|
|
|
|
### Rollback Triggers
|
|
|
|
**Immediately rollback if ANY of these conditions occur:**
|
|
|
|
- [ ] Error rate exceeds 10% for more than 2 minutes
|
|
- [ ] P95 latency increases by >100% for more than 2 minutes
|
|
- [ ] More than 50% of canary pods are NotReady or CrashLoopBackOff
|
|
- [ ] Token counting stops working or shows incorrect values
|
|
- [ ] Workers report high failure rates or timeouts
|
|
- [ ] Security vulnerabilities detected in canary image
|
|
- [ ] Data corruption or incorrect behavior observed
|
|
|
|
### Step 1: Verify Current State
|
|
|
|
```bash
|
|
# Set kubeconfig
|
|
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
|
|
|
|
# Check canary pod status
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=test
|
|
|
|
# Check production pod status (should be healthy)
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production
|
|
|
|
# Get canary deployment details
|
|
kubectl describe deployment/zai-proxy-test -n mcp
|
|
|
|
# Check recent canary logs
|
|
kubectl logs -n mcp deployment/zai-proxy-test --tail=100
|
|
```
|
|
|
|
### Step 2: Document Issues Found
|
|
|
|
Create an incident report or bead to track the issues:
|
|
|
|
```bash
|
|
cd /home/coder/ardenone-cluster/containers/zai-proxy
|
|
|
|
# Create bead for tracking the rollback
|
|
br create "Investigate canary deployment failure - vVERSION" \
|
|
--type bug \
|
|
--priority P0 \
|
|
--description "Canary deployment VERSION failed testing with: [describe symptoms]
|
|
|
|
Symptoms:
|
|
- [List observed issues]
|
|
|
|
Root Cause (if known):
|
|
- [Describe root cause]
|
|
|
|
Impact:
|
|
- Production UNCHANGED and serving normally
|
|
- Canary deployment isolated from traffic
|
|
" \
|
|
--labels bug,canary,rollback,urgent
|
|
|
|
# Note the bead ID for blocking future deployments
|
|
```
|
|
|
|
### Step 3: Delete Canary Resources
|
|
|
|
```bash
|
|
# Scale canary deployment to 0
|
|
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0
|
|
|
|
# Verify canary is scaled down
|
|
kubectl get deployment/zai-proxy-test -n mcp
|
|
|
|
# Optionally delete canary deployment entirely (only if you're sure)
|
|
kubectl delete deployment/zai-proxy-test -n mcp
|
|
```
|
|
|
|
**Note:** Keep the canary deployment manifest but scale to 0 if you want to quickly redeploy after fixing issues.
|
|
|
|
### Step 4: Revert Code Changes (If Needed)
|
|
|
|
If the canary failure was due to code issues:
|
|
|
|
```bash
|
|
cd /home/coder/ardenone-cluster/containers/zai-proxy
|
|
|
|
# View recent commits
|
|
git log --oneline -5
|
|
|
|
# Revert the problematic commit
|
|
git revert <commit-hash>
|
|
|
|
# OR reset to previous commit (if not yet pushed)
|
|
git reset --hard HEAD~1
|
|
|
|
# Push the revert
|
|
git push origin main
|
|
```
|
|
|
|
### Step 5: Verify Production Unchanged
|
|
|
|
```bash
|
|
# Verify production is still serving
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production
|
|
|
|
# Check production health
|
|
kubectl exec -n mcp deployment/zai-proxy -- curl -s http://localhost:8080/health
|
|
|
|
# Verify production metrics
|
|
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total
|
|
```
|
|
|
|
### Step 6: Clean Up Canary Resources
|
|
|
|
```bash
|
|
# Verify canary is not receiving traffic
|
|
kubectl get svc -n mcp | grep zai-proxy
|
|
|
|
# If using zai-proxy-canary service, ensure workers are using zai-proxy service instead
|
|
# Check worker configuration points to production service
|
|
|
|
# Clean up failed canary image (optional)
|
|
# Only delete if you're sure you won't need to debug
|
|
docker rmi ronaldraygun/zai-proxy:VERSION-canary
|
|
```
|
|
|
|
### Step 7: Collect Diagnostic Information
|
|
|
|
```bash
|
|
# Export canary logs before deletion
|
|
kubectl logs -n mcp deployment/zai-proxy-test --tail=1000 > \
|
|
/tmp/canary-failure-logs-$(date +%Y%m%d-%H%M%S).txt
|
|
|
|
# Export canary metrics
|
|
kubectl exec -n mcp deployment/zai-proxy-test -- \
|
|
curl -s http://localhost:8080/metrics > \
|
|
/tmp/canary-failure-metrics-$(date +%Y%m%d-%H%M%S).txt
|
|
|
|
# Export deployment state
|
|
kubectl get deployment/zai-proxy-test -n mcp -o yaml > \
|
|
/tmp/canary-failure-deployment-$(date +%Y%m%d-%H%M%S).yaml
|
|
|
|
# Export pod events
|
|
kubectl describe pods -n mcp -l app=zai-proxy,variant=test > \
|
|
/tmp/canary-failure-events-$(date +%Y%m%d-%H%M%S).txt
|
|
```
|
|
|
|
---
|
|
|
|
## Part 2: Production Rollback After Promotion
|
|
|
|
### Scenario: Promoted Canary Causes Production Issues
|
|
|
|
When a canary version has been promoted to production but causes issues, roll back production immediately.
|
|
|
|
### Step 1: Immediate Rollback (kubectl)
|
|
|
|
```bash
|
|
# Set kubeconfig
|
|
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
|
|
|
|
# QUICK ROLLBACK - Undo to previous version
|
|
kubectl rollout undo deployment/zai-proxy -n mcp
|
|
|
|
# Monitor rollback
|
|
kubectl rollout status deployment/zai-proxy -n mcp
|
|
|
|
# Watch pods being replaced
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w
|
|
```
|
|
|
|
### Step 2: Rollback to Specific Version
|
|
|
|
```bash
|
|
# View rollout history
|
|
kubectl rollout history deployment/zai-proxy -n mcp
|
|
|
|
# Rollback to specific revision
|
|
kubectl rollout undo deployment/zai-proxy -n mcp --to-revision=2
|
|
|
|
# Verify the rollback
|
|
kubectl get deployment/zai-proxy -n mcp \
|
|
-o jsonpath='{.spec.template.spec.containers[0].image}' && echo
|
|
```
|
|
|
|
### Step 3: GitOps Rollback (ArgoCD)
|
|
|
|
If using GitOps with ArgoCD:
|
|
|
|
```bash
|
|
# Navigate to cluster configuration
|
|
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
|
|
|
|
# View recent commits
|
|
git log --oneline -5
|
|
|
|
# Revert the promotion commit
|
|
git revert HEAD
|
|
|
|
# Push the revert
|
|
git add zai-proxy.yml
|
|
git commit -m "fix: rollback zai-proxy to previous stable version"
|
|
git push origin main
|
|
|
|
# ArgoCD will automatically sync the revert
|
|
```
|
|
|
|
### Step 4: Emergency Rollback (If kubectl fails)
|
|
|
|
```bash
|
|
# If standard rollback fails, scale to 0 and back up
|
|
kubectl scale deployment/zai-proxy -n mcp --replicas=0
|
|
sleep 5
|
|
kubectl scale deployment/zai-proxy -n mcp --replicas=1
|
|
|
|
# Monitor recovery
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w
|
|
```
|
|
|
|
### Step 5: Verify Rollback Complete
|
|
|
|
```bash
|
|
# Check all pods are running
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production
|
|
|
|
# Verify image version
|
|
kubectl get deployment/zai-proxy -n mcp \
|
|
-o jsonpath='{.spec.template.spec.containers[0].image}'
|
|
|
|
# Check health endpoint
|
|
kubectl exec -n mcp deployment/zai-proxy -- curl -s http://localhost:8080/health
|
|
|
|
# Verify metrics are being exported
|
|
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total
|
|
```
|
|
|
|
### Step 6: Document the Rollback
|
|
|
|
```bash
|
|
cd /home/coder/ardenone-cluster/containers/zai-proxy
|
|
|
|
# Create incident report
|
|
br create "Production rollback after vVERSION promotion" \
|
|
--type bug \
|
|
--priority P0 \
|
|
--description "Production rollback from VERSION to PREVIOUS_VERSION
|
|
|
|
Time: $(date -u +%Y-%m-%dT%H:%M:%SZ)
|
|
|
|
Symptoms:
|
|
- [List observed issues in production]
|
|
|
|
Rollback Action:
|
|
- kubectl rollout undo deployment/zai-proxy -n mcp
|
|
- Rolled back to revision X
|
|
|
|
Impact:
|
|
- Brief service interruption during rollback
|
|
- Production now running on PREVIOUS_VERSION
|
|
|
|
Next Steps:
|
|
- Investigate root cause
|
|
- Fix canary issues
|
|
- Re-test before re-promotion
|
|
" \
|
|
--labels bug,production,rollback,critical
|
|
```
|
|
|
|
---
|
|
|
|
## Part 3: Troubleshooting Guide
|
|
|
|
### Common Failure Scenarios
|
|
|
|
#### Scenario 1: Canary Pods CrashLoopBackOff
|
|
|
|
**Symptoms:**
|
|
- Canary pods in CrashLoopBackOff state
|
|
- Production pods healthy
|
|
- Can't access canary logs (RBAC blocked)
|
|
|
|
**Rollback Procedure:**
|
|
|
|
```bash
|
|
# 1. Verify production is healthy
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production
|
|
|
|
# 2. Scale down canary
|
|
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0
|
|
|
|
# 3. Check canary deployment image
|
|
kubectl get deployment/zai-proxy-test -n mcp \
|
|
-o jsonpath='{.spec.template.spec.containers[0].image}'
|
|
|
|
# 4. Verify image exists on Docker Hub
|
|
curl -s "https://registry.hub.docker.com/v2/repositories/ronaldraygun/zai-proxy/tags/" | \
|
|
jq '.results[] | select(.name == "VERSION-canary")'
|
|
|
|
# 5. If image issue, rebuild and redeploy
|
|
# See: /home/coder/ardenone-cluster/containers/zai-proxy/docs/DEPLOYMENT.md
|
|
```
|
|
|
|
**Prevention:**
|
|
- Always test images locally before pushing
|
|
- Validate image exists on Docker Hub before deployment
|
|
- Check image pull secrets are configured
|
|
|
|
#### Scenario 2: Canary High Error Rate
|
|
|
|
**Symptoms:**
|
|
- Canary pods running but returning 5xx errors
|
|
- Prometheus alert `ZaiProxyCanaryHighErrorRate` firing
|
|
- Error rate > 5%
|
|
|
|
**Rollback Procedure:**
|
|
|
|
```bash
|
|
# 1. Check error rate in Prometheus
|
|
# Query: sum(rate(zai_proxy_requests_total{variant="test",status_code=~"5.."}[5m])) / sum(rate(zai_proxy_requests_total{variant="test"}[5m]))
|
|
|
|
# 2. Check canary logs for errors
|
|
kubectl logs -n mcp deployment/zai-proxy-test --tail=100 | grep -i error
|
|
|
|
# 3. If critical, scale down canary immediately
|
|
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0
|
|
|
|
# 4. Check if production can handle the load
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production
|
|
|
|
# 5. Document the issue
|
|
cd /home/coder/ardenone-cluster/containers/zai-proxy
|
|
br create "Fix canary high error rate - vVERSION" \
|
|
--type bug --priority P1 --labels bug,canary,errors
|
|
```
|
|
|
|
**Prevention:**
|
|
- Run regression tests before deployment
|
|
- Monitor canary metrics continuously
|
|
- Set up alerts for error rates
|
|
|
|
#### Scenario 3: Canary Latency Degraded
|
|
|
|
**Symptoms:**
|
|
- Canary p90/p95 latency > 1.5x production
|
|
- Prometheus alert `ZaiProxyCanaryLatencyDegraded` firing
|
|
- Slow response times on canary endpoint
|
|
|
|
**Rollback Procedure:**
|
|
|
|
```bash
|
|
# 1. Check latency in Prometheus
|
|
# Query: histogram_quantile(0.90, sum(rate(zai_proxy_request_duration_seconds_bucket{variant="test"}[5m])) by (le))
|
|
|
|
# 2. Check token counting overhead
|
|
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | \
|
|
grep zai_proxy_token_count_duration_seconds
|
|
|
|
# 3. If token counting is slow (>100ms p99), disable it temporarily
|
|
kubectl set env deployment/zai-proxy-test -n mcp \
|
|
ENABLE_TOKEN_COUNTING=false
|
|
|
|
# 4. Restart canary to pick up new config
|
|
kubectl rollout restart deployment/zai-proxy-test -n mcp
|
|
|
|
# 5. Monitor recovery
|
|
kubectl rollout status deployment/zai-proxy-test -n mcp
|
|
```
|
|
|
|
**Prevention:**
|
|
- Profile token counting performance
|
|
- Set appropriate timeouts
|
|
- Use caching for token counting results
|
|
|
|
#### Scenario 4: Production Rollout Stuck
|
|
|
|
**Symptoms:**
|
|
- Production rollout not progressing
|
|
- New pods not becoming Ready
|
|
- Old pods still serving traffic
|
|
|
|
**Rollback Procedure:**
|
|
|
|
```bash
|
|
# 1. Check rollout status
|
|
kubectl rollout status deployment/zai-proxy -n mcp
|
|
|
|
# 2. If stuck, pause rollout
|
|
kubectl rollout pause deployment/zai-proxy -n mcp
|
|
|
|
# 3. Describe deployment to see issues
|
|
kubectl describe deployment/zai-proxy -n mcp
|
|
|
|
# 4. Describe failing pods
|
|
kubectl describe pod <pod-name> -n mcp
|
|
|
|
# 5. If critical, rollback immediately
|
|
kubectl rollout undo deployment/zai-proxy -n mcp
|
|
|
|
# 6. Monitor rollback
|
|
kubectl rollout status deployment/zai-proxy -n mcp
|
|
```
|
|
|
|
**Prevention:**
|
|
- Use rolling update strategy with appropriate thresholds
|
|
- Set resource limits appropriately
|
|
- Monitor pod health during rollout
|
|
|
|
#### Scenario 5: Production Image Crash Loop
|
|
|
|
**Symptoms:**
|
|
- Production pods entering CrashLoopBackOff
|
|
- Recent image promotion caused crashes
|
|
- Service disruption
|
|
|
|
**Emergency Rollback:**
|
|
|
|
```bash
|
|
# 1. IMMEDIATE ROLLBACK - Use kubectl
|
|
kubectl rollout undo deployment/zai-proxy -n mcp
|
|
|
|
# 2. If undo fails, scale to 0 and back up
|
|
kubectl scale deployment/zai-proxy -n mcp --replicas=0
|
|
sleep 5
|
|
kubectl scale deployment/zai-proxy -n mcp --replicas=1
|
|
|
|
# 3. Verify pods are coming up
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w
|
|
|
|
# 4. Check ReplicaSets to find working version
|
|
kubectl get replicasets -n mcp -l app=zai-proxy,variant=production
|
|
|
|
# 5. Patch deployment to use working version
|
|
kubectl patch deployment zai-proxy -n mcp \
|
|
-p '{"spec":{"template":{"metadata":{"labels":{"version":"WORKING_VERSION"}}}}}'
|
|
|
|
# 6. Set image to working version
|
|
kubectl set image deployment/zai-proxy -n mcp \
|
|
proxy=ronaldraygun/zai-proxy:WORKING_VERSION
|
|
```
|
|
|
|
**Prevention:**
|
|
- Always test canary thoroughly before promotion
|
|
- Use proper health checks
|
|
- Monitor crash counts
|
|
|
|
#### Scenario 6: ArgoCD Sync Delay
|
|
|
|
**Symptoms:**
|
|
- Git revert pushed but ArgoCD not syncing
|
|
- Production still running failed version
|
|
- Manual intervention needed
|
|
|
|
**Rollback Procedure:**
|
|
|
|
```bash
|
|
# 1. Force immediate rollback via kubectl (bypass ArgoCD)
|
|
kubectl rollout undo deployment/zai-proxy -n mcp
|
|
|
|
# 2. Check ArgoCD sync status
|
|
# In ArgoCD UI: https://argocd.<domain>/application/zai-proxy
|
|
|
|
# 3. If sync stuck, manually sync in ArgoCD UI
|
|
# Or use argocd CLI:
|
|
argocd app sync zai-proxy
|
|
|
|
# 4. Verify sync completed
|
|
argocd app get zai-proxy
|
|
|
|
# 5. Monitor rollout
|
|
kubectl rollout status deployment/zai-proxy -n mcp
|
|
|
|
# 6. Once stable, ArgoCD will reconcile with Git
|
|
# The kubectl change may be overwritten, so update Git to match:
|
|
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
|
|
# Edit zai-proxy.yml to match the rolled-back version
|
|
git add zai-proxy.yml
|
|
git commit -m "fix: sync git with rolled-back version"
|
|
git push origin main
|
|
```
|
|
|
|
**Prevention:**
|
|
- Monitor ArgoCD sync status
|
|
- Use ArgoCD sync waves if needed
|
|
- Have manual rollback ready as backup
|
|
|
|
#### Scenario 7: Workers Not Connecting After Rollback
|
|
|
|
**Symptoms:**
|
|
- Rollback completed but workers not connecting
|
|
- Worker logs showing connection errors
|
|
- No metrics from production
|
|
|
|
**Rollback Procedure:**
|
|
|
|
```bash
|
|
# 1. Check service endpoints
|
|
kubectl get endpoints -n mcp | grep zai-proxy
|
|
|
|
# 2. Test service from devpod
|
|
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/health
|
|
|
|
# 3. Check worker configuration
|
|
grep -r "zai-proxy" ~/.beads-workers/*.log
|
|
|
|
# 4. If workers pointing to canary service, update them
|
|
# Workers should use: http://zai-proxy.mcp.svc.cluster.local:8080
|
|
# NOT: http://zai-proxy-canary.mcp.svc.cluster.local:8080
|
|
|
|
# 5. Restart affected workers
|
|
# Find worker session
|
|
tlist
|
|
|
|
# Kill and restart worker
|
|
tkill <session-name>
|
|
|
|
# 6. Verify worker connectivity
|
|
tail -f ~/.beads-workers/<session-name>.log
|
|
```
|
|
|
|
**Prevention:**
|
|
- Use service discovery correctly
|
|
- Document worker configuration
|
|
- Test worker connectivity after changes
|
|
|
|
---
|
|
|
|
## Part 4: Rollback Verification Checklist
|
|
|
|
Use this checklist after performing any rollback:
|
|
|
|
### Canary Rollback Verification
|
|
|
|
- [ ] Canaries scaled to 0 (kubectl scale)
|
|
- [ ] Production pods still healthy
|
|
- [ ] Production serving traffic normally
|
|
- [ ] No Prometheus alerts for production
|
|
- [ ] Incident report/bead created
|
|
- [ ] Code changes reverted (if needed)
|
|
- [ ] Root cause documented
|
|
- [ ] Fix plan created
|
|
|
|
### Production Rollback Verification
|
|
|
|
- [ ] Rollback command executed
|
|
- [ ] Rollout status shows completion
|
|
- [ ] All production pods Ready
|
|
- [ ] Pods running previous image version
|
|
- [ ] Health endpoint responding
|
|
- [ ] Workers connecting successfully
|
|
- [ ] Metrics being exported
|
|
- [ ] Error rate below threshold
|
|
- [ ] Latency back to baseline
|
|
- [ ] Incident report/bead created
|
|
- [ ] Git revert pushed (if GitOps)
|
|
- [ ] ArgoCD synced (if applicable)
|
|
|
|
---
|
|
|
|
## Part 5: Rollback Dry-Run Test
|
|
|
|
### Testing Rollback Procedure
|
|
|
|
To verify rollback procedures work, perform a dry-run test:
|
|
|
|
```bash
|
|
# Set kubeconfig
|
|
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
|
|
|
|
# 1. Save current state
|
|
kubectl get deployment/zai-proxy -n mcp -o yaml > /tmp/zai-proxy-before.yml
|
|
kubectl get deployment/zai-proxy-test -n mcp -o yaml > /tmp/zai-proxy-test-before.yml
|
|
|
|
# 2. Check current image
|
|
current_image=$(kubectl get deployment/zai-proxy -n mcp \
|
|
-o jsonpath='{.spec.template.spec.containers[0].image}')
|
|
echo "Current image: $current_image"
|
|
|
|
# 3. Check rollout history
|
|
kubectl rollout history deployment/zai-proxy -n mcp
|
|
|
|
# 4. Test rollback command (dry-run)
|
|
kubectl rollout undo deployment/zai-proxy -n mcp --dry-run=server
|
|
|
|
# 5. Test scaling to 0 (don't actually do it)
|
|
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0 --dry-run=server
|
|
|
|
# 6. Verify you can access logs
|
|
kubectl logs -n mcp deployment/zai-proxy --tail=10
|
|
|
|
# 7. Verify you can access metrics
|
|
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | head -20
|
|
|
|
# 8. Check service endpoints
|
|
kubectl get endpoints -n mcp zai-proxy
|
|
|
|
# 9. Restore state (if needed)
|
|
# kubectl apply -f /tmp/zai-proxy-before.yml
|
|
```
|
|
|
|
### Automated Rollback Test Script
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Test rollback procedures
|
|
|
|
set -e
|
|
|
|
NAMESPACE="mcp"
|
|
PRODUCTION_DEPLOYMENT="zai-proxy"
|
|
CANARY_DEPLOYMENT="zai-proxy-test"
|
|
|
|
echo "=== Testing Canary Rollback Procedure ==="
|
|
|
|
# Test 1: Can we scale canary to 0?
|
|
echo "Test 1: Scale canary to 0"
|
|
kubectl scale deployment/$CANARY_DEPLOYMENT -n $NAMESPACE --replicas=0 --dry-run=server
|
|
|
|
# Test 2: Can we undo production rollout?
|
|
echo "Test 2: Undo production rollout"
|
|
kubectl rollout undo deployment/$PRODUCTION_DEPLOYMENT -n $NAMESPACE --dry-run=server
|
|
|
|
# Test 3: Can we get rollout history?
|
|
echo "Test 3: Get rollout history"
|
|
kubectl rollout history deployment/$PRODUCTION_DEPLOYMENT -n $NAMESPACE
|
|
|
|
# Test 4: Can we check pod status?
|
|
echo "Test 4: Check pod status"
|
|
kubectl get pods -n $NAMESPACE -l app=zai-proxy,variant=production
|
|
|
|
# Test 5: Can we access logs?
|
|
echo "Test 5: Access logs"
|
|
kubectl logs -n $NAMESPACE deployment/$PRODUCTION_DEPLOYMENT --tail=10
|
|
|
|
# Test 6: Can we access metrics?
|
|
echo "Test 6: Access metrics"
|
|
curl -s http://zai-proxy.$NAMESPACE.svc.cluster.local:8080/metrics | head -5
|
|
|
|
# Test 7: Can we check health?
|
|
echo "Test 7: Check health"
|
|
kubectl exec -n $NAMESPACE deployment/$PRODUCTION_DEPLOYMENT -- \
|
|
curl -s http://localhost:8080/health
|
|
|
|
echo "=== All rollback tests passed ==="
|
|
```
|
|
|
|
---
|
|
|
|
## Part 6: Post-Rollback Actions
|
|
|
|
### After Rolling Back Canary
|
|
|
|
1. **Fix the issues:**
|
|
- Investigate root cause
|
|
- Fix code or configuration
|
|
- Add regression tests
|
|
|
|
2. **Re-test canary:**
|
|
- Deploy fixed version to canary
|
|
- Run functional tests
|
|
- Monitor metrics
|
|
- Validate with worker traffic
|
|
|
|
3. **Re-promote when ready:**
|
|
- Follow promotion procedure
|
|
- Monitor production metrics
|
|
- Have rollback plan ready
|
|
|
|
### After Rolling Back Production
|
|
|
|
1. **Stabilize service:**
|
|
- Verify production is healthy
|
|
- Monitor metrics continuously
|
|
- Check worker connectivity
|
|
|
|
2. **Investigate failure:**
|
|
- Analyze logs from failed version
|
|
- Identify root cause
|
|
- Document findings
|
|
|
|
3. **Fix and re-test:**
|
|
- Fix issues in canary
|
|
- Thoroughly test fixes
|
|
- Consider extended canary testing
|
|
|
|
4. **Re-promote carefully:**
|
|
- Use smaller traffic split initially
|
|
- Monitor continuously
|
|
- Have rollback command ready
|
|
|
|
---
|
|
|
|
## Part 7: kubectl Rollback Commands Reference
|
|
|
|
### Deployment Rollback
|
|
|
|
```bash
|
|
# Undo to previous version
|
|
kubectl rollout undo deployment/<name> -n <namespace>
|
|
|
|
# Undo to specific revision
|
|
kubectl rollout undo deployment/<name> -n <namespace> --to-revision=<n>
|
|
|
|
# View rollout history
|
|
kubectl rollout history deployment/<name> -n <namespace>
|
|
|
|
# Check rollout status
|
|
kubectl rollout status deployment/<name> -n <namespace>
|
|
|
|
# Pause rollout
|
|
kubectl rollout pause deployment/<name> -n <namespace>
|
|
|
|
# Resume rollout
|
|
kubectl rollout resume deployment/<name> -n <namespace>
|
|
|
|
# Restart deployment
|
|
kubectl rollout restart deployment/<name> -n <namespace>
|
|
```
|
|
|
|
### Scaling Operations
|
|
|
|
```bash
|
|
# Scale deployment to 0
|
|
kubectl scale deployment/<name> -n <namespace> --replicas=0
|
|
|
|
# Scale deployment up
|
|
kubectl scale deployment/<name> -n <namespace> --replicas=<n>
|
|
|
|
# Scale multiple deployments
|
|
kubectl scale deployment/<name1> deployment/<name2> -n <namespace> --replicas=0
|
|
```
|
|
|
|
### Image Management
|
|
|
|
```bash
|
|
# Set image
|
|
kubectl set image deployment/<name> -n <namespace> \
|
|
<container>=<image>:<tag>
|
|
|
|
# Get current image
|
|
kubectl get deployment/<name> -n <namespace> \
|
|
-o jsonpath='{.spec.template.spec.containers[0].image}'
|
|
|
|
# Patch deployment with new image
|
|
kubectl patch deployment/<name> -n <namespace> \
|
|
-p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","image":"<image>:<tag>"}]}}}}'
|
|
```
|
|
|
|
### Verification Commands
|
|
|
|
```bash
|
|
# Get pods
|
|
kubectl get pods -n <namespace> -l app=<app>
|
|
|
|
# Watch pod changes
|
|
kubectl get pods -n <namespace> -l app=<app> -w
|
|
|
|
# Describe deployment
|
|
kubectl describe deployment/<name> -n <namespace>
|
|
|
|
# Describe pod
|
|
kubectl describe pod/<pod-name> -n <namespace>
|
|
|
|
# View logs
|
|
kubectl logs -n <namespace> deployment/<name> --tail=100
|
|
|
|
# Stream logs
|
|
kubectl logs -f -n <namespace> deployment/<name>
|
|
|
|
# Get endpoints
|
|
kubectl get endpoints -n <namespace> | grep <service>
|
|
|
|
# Test health endpoint
|
|
kubectl exec -n <namespace> deployment/<name> -- \
|
|
curl -s http://localhost:8080/health
|
|
```
|
|
|
|
---
|
|
|
|
## Part 8: Rollback Decision Flowchart
|
|
|
|
```
|
|
┌─────────────────────┐
|
|
│ Canary Testing │
|
|
└──────────┬──────────┘
|
|
│
|
|
┌──────────▼──────────┐
|
|
│ Issues Detected? │
|
|
└──────────┬──────────┘
|
|
│
|
|
┌────────────────┴────────────────┐
|
|
│ No │ Yes
|
|
▼ ▼
|
|
┌───────────────────┐ ┌─────────────────────┐
|
|
│ Continue Testing │ │ Critical Issue? │
|
|
└───────────────────┘ └──────────┬──────────┘
|
|
│
|
|
┌─────────┴─────────┐
|
|
│ Yes │ No
|
|
▼ ▼
|
|
┌───────────────────┐ ┌───────────────────┐
|
|
│ Immediate Rollback│ │ Document & Monitor│
|
|
└─────────┬─────────┘ └───────────────────┘
|
|
│
|
|
┌───────────────┴───────────────┐
|
|
│ │
|
|
▼ ▼
|
|
┌───────────────────┐ ┌───────────────────┐
|
|
│ Scale Canary to 0 │ │ Collect Diagnostics│
|
|
└─────────┬─────────┘ └─────────┬─────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌───────────────────┐ ┌───────────────────┐
|
|
│ Delete Canary │◄─────────│ Create Failure │
|
|
│ Resources │ │ Report │
|
|
└─────────┬─────────┘ └───────────────────┘
|
|
│
|
|
▼
|
|
┌───────────────────┐
|
|
│ Verify Production │
|
|
│ Still Healthy │
|
|
└─────────┬─────────┘
|
|
│
|
|
▼
|
|
┌───────────────────┐
|
|
│ Document Lessons │
|
|
│ Learned │
|
|
└───────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Part 9: RBAC Considerations
|
|
|
|
### Important: Read-Only Access from Devpods
|
|
|
|
When running rollback procedures from devpods using the `devpod-observer` ServiceAccount:
|
|
|
|
**Available Operations (Read-Only):**
|
|
- `kubectl get pods` - View pod status
|
|
- `kubectl get deployments` - View deployment status
|
|
- `kubectl get svc` - View service status
|
|
- `kubectl rollout history` - View rollout history
|
|
- `kubectl logs` - View pod logs
|
|
- `kubectl describe` - View resource details
|
|
|
|
**NOT Available (Requires Write Permissions):**
|
|
- `kubectl scale` - Cannot scale deployments
|
|
- `kubectl rollout undo` - Cannot rollback deployments
|
|
- `kubectl delete` - Cannot delete resources
|
|
- `kubectl set image` - Cannot update images
|
|
- `kubectl patch` - Cannot patch resources
|
|
- `kubectl exec` - Cannot execute commands in pods
|
|
|
|
### Rollback with Read-Only Access
|
|
|
|
When you only have read-only access (e.g., from devpods), use these alternative approaches:
|
|
|
|
**Option 1: GitOps Rollback (Recommended for ArgoCD-managed deployments)**
|
|
|
|
```bash
|
|
# Navigate to cluster configuration
|
|
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
|
|
|
|
# Revert the problematic commit
|
|
git log --oneline -5
|
|
git revert HEAD
|
|
|
|
# Push the revert
|
|
git add zai-proxy.yml
|
|
git commit -m "fix: rollback zai-proxy to previous stable version"
|
|
git push origin main
|
|
|
|
# ArgoCD will automatically sync the revert
|
|
```
|
|
|
|
**Option 2: Request Rollback via Human Intervention**
|
|
|
|
```bash
|
|
# Create a bead to request rollback
|
|
cd /home/coder/ardenone-cluster/containers/zai-proxy
|
|
|
|
br create "URGENT: Request production rollback for zai-proxy" \
|
|
--type bug \
|
|
--priority P0 \
|
|
--description "CRITICAL: Production rollback requested
|
|
|
|
Current Issues:
|
|
- [Describe symptoms]
|
|
|
|
Requested Action:
|
|
- kubectl rollout undo deployment/zai-proxy -n mcp
|
|
- OR: Scale to 0 and back up
|
|
|
|
Verified via Read-Only:
|
|
- Production pods: $(kubectl get pods -n mcp -l app=zai-proxy,variant=production)
|
|
- Current image: $(kubectl get deployment/zai-proxy -n mcp -o jsonpath='{.spec.template.spec.containers[0].image}')
|
|
- Rollout history available
|
|
" \
|
|
--labels critical,rollback,production,human-required
|
|
```
|
|
|
|
**Option 3: Direct Cluster Access (If Available)**
|
|
|
|
If you have direct kubectl access with admin permissions (not via devpod-observer):
|
|
|
|
```bash
|
|
# Use local kubeconfig or admin credentials
|
|
kubectl rollout undo deployment/zai-proxy -n mcp
|
|
```
|
|
|
|
### Verification with Read-Only Access
|
|
|
|
You CAN verify the cluster state even with read-only access:
|
|
|
|
```bash
|
|
# Check deployment status
|
|
kubectl get deployment/zai-proxy -n mcp
|
|
|
|
# Check pod status
|
|
kubectl get pods -n mcp -l app=zai-proxy,variant=production
|
|
|
|
# View recent logs
|
|
kubectl logs -n mcp deployment/zai-proxy --tail=50
|
|
|
|
# Check rollout history
|
|
kubectl rollout history deployment/zai-proxy -n mcp
|
|
|
|
# View service endpoints
|
|
kubectl get endpoints -n mcp zai-proxy
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [CANARY_PROMOTION_PROCEDURE.md](CANARY_PROMOTION_PROCEDURE.md) - Promoting canary to production
|
|
- [CANARY_PROMOTION_CHECKLIST.md](CANARY_PROMOTION_CHECKLIST.md) - Promotion checklist
|
|
- [DEPLOYMENT.md](DEPLOYMENT.md) - Worker configuration and dual-deployment workflow
|
|
- [TOKEN_COUNTING.md](TOKEN_COUNTING.md) - Token counting implementation
|
|
- [REGRESSION_TESTING.md](REGRESSION_TESTING.md) - Running regression tests
|
|
- [README-traffic-splitting.md](../../cluster-configuration/apexalgo-iad/mcp/README-traffic-splitting.md) - Traffic splitting options
|
|
|
|
---
|
|
|
|
## Recovery Timeline
|
|
|
|
| Action | Time | Notes |
|
|
|--------|------|-------|
|
|
| Scale canary to 0 | <10s | Immediate stop |
|
|
| Delete canary resources | <30s | Full cleanup |
|
|
| Verify production healthy | <1min | Confirm no impact |
|
|
| Production rollback | <2min | Full rollout undo |
|
|
| Collect diagnostics | <5min | For analysis |
|
|
| Document failure | <10min | Postmortem |
|
|
| **Canary rollback time** | **<15min** | Production unaffected |
|
|
| **Production rollback time** | **<5min** | Brief interruption |
|
|
|
|
**Key Point:** Production is never modified during canary rollback, so downtime is zero.
|
|
|
|
---
|
|
|
|
**Document Version:** 2.1.0
|
|
**Last Updated:** 2026-02-08
|
|
**Maintained By:** Claude Code Workers
|
|
**Related Bead:** bd-2s5
|
|
|
|
---
|
|
|
|
## Important RBAC Note
|
|
|
|
**When accessing from devpods via kubectl-proxy:** The `devpod-observer` ServiceAccount has **limited permissions** and cannot perform write operations like `kubectl rollout undo` or `kubectl scale`.
|
|
|
|
**For devpod access, use GitOps rollback (Option 1) instead of direct kubectl commands.**
|
|
|
|
**Direct kubectl rollback commands work when:**
|
|
- Running from within the apexalgo-iad cluster directly
|
|
- Using a ServiceAccount with deployment edit permissions
|
|
- The deployment is managed by ArgoCD (use Git revert instead)
|