jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo

Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:53:52 -04:00

30 KiB

Raw Blame History

Canary Rollback Procedure

This document describes the procedure to roll back a failed canary deployment or revert a promoted canary from production.

Overview

The zai-proxy deployment uses a dual-deployment strategy:

Production deployment (zai-proxy): Live traffic
Canary deployment (zai-proxy-test): Testing new versions

When a canary deployment fails or a promoted version causes issues in production, use this rollback procedure to restore service.

Architecture Reference

                    ┌─────────────────────────────────────┐
                    │    Canary (zai-proxy-test)          │
                    │    Image: X.Y.Z-canary              │
                    └──────────────┬──────────────────────┘
                                   │ Fails Testing
                                   ▼
                    ┌─────────────────────────────────────┐
                    │    Production (zai-proxy)           │
                    │    Image: X.Y.Z (no -canary)        │
                    │    UNCHANGED - Still serving        │
                    └─────────────────────────────────────┘

Prerequisites

Before rolling back, ensure you have:

kubectl access to the apexalgo-iad cluster
kubeconfig mounted at /home/coder/.kube/apexalgo-iad.kubeconfig
Current deployment status information
Root cause understanding (if available)

Quick Rollback Commands

# Set kubeconfig for apexalgo-iad cluster
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig

# Quick rollback to previous version
kubectl rollout undo deployment/zai-proxy -n mcp

# Monitor rollback
kubectl rollout status deployment/zai-proxy -n mcp

# If rollback fails, scale to 0 and back up
kubectl scale deployment/zai-proxy -n mcp --replicas=0
kubectl scale deployment/zai-proxy -n mcp --replicas=1

Part 1: Canary Deployment Rollback

Scenario: Canary Testing Reveals Critical Issues

When canary deployment fails testing, keep production unchanged and clean up canary resources.

Rollback Triggers

Immediately rollback if ANY of these conditions occur:

Error rate exceeds 10% for more than 2 minutes
P95 latency increases by >100% for more than 2 minutes
More than 50% of canary pods are NotReady or CrashLoopBackOff
Token counting stops working or shows incorrect values
Workers report high failure rates or timeouts
Security vulnerabilities detected in canary image
Data corruption or incorrect behavior observed

Step 1: Verify Current State

# Set kubeconfig
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig

# Check canary pod status
kubectl get pods -n mcp -l app=zai-proxy,variant=test

# Check production pod status (should be healthy)
kubectl get pods -n mcp -l app=zai-proxy,variant=production

# Get canary deployment details
kubectl describe deployment/zai-proxy-test -n mcp

# Check recent canary logs
kubectl logs -n mcp deployment/zai-proxy-test --tail=100

Step 2: Document Issues Found

Create an incident report or bead to track the issues:

cd /home/coder/ardenone-cluster/containers/zai-proxy

# Create bead for tracking the rollback
br create "Investigate canary deployment failure - vVERSION" \
  --type bug \
  --priority P0 \
  --description "Canary deployment VERSION failed testing with: [describe symptoms]

  Symptoms:
  - [List observed issues]

  Root Cause (if known):
  - [Describe root cause]

  Impact:
  - Production UNCHANGED and serving normally
  - Canary deployment isolated from traffic
  " \
  --labels bug,canary,rollback,urgent

# Note the bead ID for blocking future deployments

Step 3: Delete Canary Resources

# Scale canary deployment to 0
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0

# Verify canary is scaled down
kubectl get deployment/zai-proxy-test -n mcp

# Optionally delete canary deployment entirely (only if you're sure)
kubectl delete deployment/zai-proxy-test -n mcp

Note: Keep the canary deployment manifest but scale to 0 if you want to quickly redeploy after fixing issues.

Step 4: Revert Code Changes (If Needed)

If the canary failure was due to code issues:

cd /home/coder/ardenone-cluster/containers/zai-proxy

# View recent commits
git log --oneline -5

# Revert the problematic commit
git revert <commit-hash>

# OR reset to previous commit (if not yet pushed)
git reset --hard HEAD~1

# Push the revert
git push origin main

Step 5: Verify Production Unchanged

# Verify production is still serving
kubectl get pods -n mcp -l app=zai-proxy,variant=production

# Check production health
kubectl exec -n mcp deployment/zai-proxy -- curl -s http://localhost:8080/health

# Verify production metrics
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total

Step 6: Clean Up Canary Resources

# Verify canary is not receiving traffic
kubectl get svc -n mcp | grep zai-proxy

# If using zai-proxy-canary service, ensure workers are using zai-proxy service instead
# Check worker configuration points to production service

# Clean up failed canary image (optional)
# Only delete if you're sure you won't need to debug
docker rmi ronaldraygun/zai-proxy:VERSION-canary

Step 7: Collect Diagnostic Information

# Export canary logs before deletion
kubectl logs -n mcp deployment/zai-proxy-test --tail=1000 > \
  /tmp/canary-failure-logs-$(date +%Y%m%d-%H%M%S).txt

# Export canary metrics
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics > \
  /tmp/canary-failure-metrics-$(date +%Y%m%d-%H%M%S).txt

# Export deployment state
kubectl get deployment/zai-proxy-test -n mcp -o yaml > \
  /tmp/canary-failure-deployment-$(date +%Y%m%d-%H%M%S).yaml

# Export pod events
kubectl describe pods -n mcp -l app=zai-proxy,variant=test > \
  /tmp/canary-failure-events-$(date +%Y%m%d-%H%M%S).txt

Part 2: Production Rollback After Promotion

Scenario: Promoted Canary Causes Production Issues

When a canary version has been promoted to production but causes issues, roll back production immediately.

Step 1: Immediate Rollback (kubectl)

# Set kubeconfig
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig

# QUICK ROLLBACK - Undo to previous version
kubectl rollout undo deployment/zai-proxy -n mcp

# Monitor rollback
kubectl rollout status deployment/zai-proxy -n mcp

# Watch pods being replaced
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w

Step 2: Rollback to Specific Version

# View rollout history
kubectl rollout history deployment/zai-proxy -n mcp

# Rollback to specific revision
kubectl rollout undo deployment/zai-proxy -n mcp --to-revision=2

# Verify the rollback
kubectl get deployment/zai-proxy -n mcp \
  -o jsonpath='{.spec.template.spec.containers[0].image}' && echo

Step 3: GitOps Rollback (ArgoCD)

If using GitOps with ArgoCD:

# Navigate to cluster configuration
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp

# View recent commits
git log --oneline -5

# Revert the promotion commit
git revert HEAD

# Push the revert
git add zai-proxy.yml
git commit -m "fix: rollback zai-proxy to previous stable version"
git push origin main

# ArgoCD will automatically sync the revert

Step 4: Emergency Rollback (If kubectl fails)

# If standard rollback fails, scale to 0 and back up
kubectl scale deployment/zai-proxy -n mcp --replicas=0
sleep 5
kubectl scale deployment/zai-proxy -n mcp --replicas=1

# Monitor recovery
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w

Step 5: Verify Rollback Complete

# Check all pods are running
kubectl get pods -n mcp -l app=zai-proxy,variant=production

# Verify image version
kubectl get deployment/zai-proxy -n mcp \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check health endpoint
kubectl exec -n mcp deployment/zai-proxy -- curl -s http://localhost:8080/health

# Verify metrics are being exported
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | grep zai_proxy_requests_total

Step 6: Document the Rollback

cd /home/coder/ardenone-cluster/containers/zai-proxy

# Create incident report
br create "Production rollback after vVERSION promotion" \
  --type bug \
  --priority P0 \
  --description "Production rollback from VERSION to PREVIOUS_VERSION

  Time: $(date -u +%Y-%m-%dT%H:%M:%SZ)

  Symptoms:
  - [List observed issues in production]

  Rollback Action:
  - kubectl rollout undo deployment/zai-proxy -n mcp
  - Rolled back to revision X

  Impact:
  - Brief service interruption during rollback
  - Production now running on PREVIOUS_VERSION

  Next Steps:
  - Investigate root cause
  - Fix canary issues
  - Re-test before re-promotion
  " \
  --labels bug,production,rollback,critical

Part 3: Troubleshooting Guide

Common Failure Scenarios

Scenario 1: Canary Pods CrashLoopBackOff

Symptoms:

Canary pods in CrashLoopBackOff state
Production pods healthy
Can't access canary logs (RBAC blocked)

Rollback Procedure:

# 1. Verify production is healthy
kubectl get pods -n mcp -l app=zai-proxy,variant=production

# 2. Scale down canary
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0

# 3. Check canary deployment image
kubectl get deployment/zai-proxy-test -n mcp \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# 4. Verify image exists on Docker Hub
curl -s "https://registry.hub.docker.com/v2/repositories/ronaldraygun/zai-proxy/tags/" | \
  jq '.results[] | select(.name == "VERSION-canary")'

# 5. If image issue, rebuild and redeploy
# See: /home/coder/ardenone-cluster/containers/zai-proxy/docs/DEPLOYMENT.md

Prevention:

Always test images locally before pushing
Validate image exists on Docker Hub before deployment
Check image pull secrets are configured

Scenario 2: Canary High Error Rate

Symptoms:

Canary pods running but returning 5xx errors
Prometheus alert ZaiProxyCanaryHighErrorRate firing
Error rate > 5%

Rollback Procedure:

# 1. Check error rate in Prometheus
# Query: sum(rate(zai_proxy_requests_total{variant="test",status_code=~"5.."}[5m])) / sum(rate(zai_proxy_requests_total{variant="test"}[5m]))

# 2. Check canary logs for errors
kubectl logs -n mcp deployment/zai-proxy-test --tail=100 | grep -i error

# 3. If critical, scale down canary immediately
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0

# 4. Check if production can handle the load
kubectl get pods -n mcp -l app=zai-proxy,variant=production

# 5. Document the issue
cd /home/coder/ardenone-cluster/containers/zai-proxy
br create "Fix canary high error rate - vVERSION" \
  --type bug --priority P1 --labels bug,canary,errors

Prevention:

Run regression tests before deployment
Monitor canary metrics continuously
Set up alerts for error rates

Scenario 3: Canary Latency Degraded

Symptoms:

Canary p90/p95 latency > 1.5x production
Prometheus alert ZaiProxyCanaryLatencyDegraded firing
Slow response times on canary endpoint

Rollback Procedure:

# 1. Check latency in Prometheus
# Query: histogram_quantile(0.90, sum(rate(zai_proxy_request_duration_seconds_bucket{variant="test"}[5m])) by (le))

# 2. Check token counting overhead
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/metrics | \
  grep zai_proxy_token_count_duration_seconds

# 3. If token counting is slow (>100ms p99), disable it temporarily
kubectl set env deployment/zai-proxy-test -n mcp \
  ENABLE_TOKEN_COUNTING=false

# 4. Restart canary to pick up new config
kubectl rollout restart deployment/zai-proxy-test -n mcp

# 5. Monitor recovery
kubectl rollout status deployment/zai-proxy-test -n mcp

Prevention:

Profile token counting performance
Set appropriate timeouts
Use caching for token counting results

Scenario 4: Production Rollout Stuck

Symptoms:

Production rollout not progressing
New pods not becoming Ready
Old pods still serving traffic

Rollback Procedure:

# 1. Check rollout status
kubectl rollout status deployment/zai-proxy -n mcp

# 2. If stuck, pause rollout
kubectl rollout pause deployment/zai-proxy -n mcp

# 3. Describe deployment to see issues
kubectl describe deployment/zai-proxy -n mcp

# 4. Describe failing pods
kubectl describe pod <pod-name> -n mcp

# 5. If critical, rollback immediately
kubectl rollout undo deployment/zai-proxy -n mcp

# 6. Monitor rollback
kubectl rollout status deployment/zai-proxy -n mcp

Prevention:

Use rolling update strategy with appropriate thresholds
Set resource limits appropriately
Monitor pod health during rollout

Scenario 5: Production Image Crash Loop

Symptoms:

Production pods entering CrashLoopBackOff
Recent image promotion caused crashes
Service disruption

Emergency Rollback:

# 1. IMMEDIATE ROLLBACK - Use kubectl
kubectl rollout undo deployment/zai-proxy -n mcp

# 2. If undo fails, scale to 0 and back up
kubectl scale deployment/zai-proxy -n mcp --replicas=0
sleep 5
kubectl scale deployment/zai-proxy -n mcp --replicas=1

# 3. Verify pods are coming up
kubectl get pods -n mcp -l app=zai-proxy,variant=production -w

# 4. Check ReplicaSets to find working version
kubectl get replicasets -n mcp -l app=zai-proxy,variant=production

# 5. Patch deployment to use working version
kubectl patch deployment zai-proxy -n mcp \
  -p '{"spec":{"template":{"metadata":{"labels":{"version":"WORKING_VERSION"}}}}}'

# 6. Set image to working version
kubectl set image deployment/zai-proxy -n mcp \
  proxy=ronaldraygun/zai-proxy:WORKING_VERSION

Prevention:

Always test canary thoroughly before promotion
Use proper health checks
Monitor crash counts

Scenario 6: ArgoCD Sync Delay

Symptoms:

Git revert pushed but ArgoCD not syncing
Production still running failed version
Manual intervention needed

Rollback Procedure:

# 1. Force immediate rollback via kubectl (bypass ArgoCD)
kubectl rollout undo deployment/zai-proxy -n mcp

# 2. Check ArgoCD sync status
# In ArgoCD UI: https://argocd.<domain>/application/zai-proxy

# 3. If sync stuck, manually sync in ArgoCD UI
# Or use argocd CLI:
argocd app sync zai-proxy

# 4. Verify sync completed
argocd app get zai-proxy

# 5. Monitor rollout
kubectl rollout status deployment/zai-proxy -n mcp

# 6. Once stable, ArgoCD will reconcile with Git
# The kubectl change may be overwritten, so update Git to match:
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp
# Edit zai-proxy.yml to match the rolled-back version
git add zai-proxy.yml
git commit -m "fix: sync git with rolled-back version"
git push origin main

Prevention:

Monitor ArgoCD sync status
Use ArgoCD sync waves if needed
Have manual rollback ready as backup

Scenario 7: Workers Not Connecting After Rollback

Symptoms:

Rollback completed but workers not connecting
Worker logs showing connection errors
No metrics from production

Rollback Procedure:

# 1. Check service endpoints
kubectl get endpoints -n mcp | grep zai-proxy

# 2. Test service from devpod
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/health

# 3. Check worker configuration
grep -r "zai-proxy" ~/.beads-workers/*.log

# 4. If workers pointing to canary service, update them
# Workers should use: http://zai-proxy.mcp.svc.cluster.local:8080
# NOT: http://zai-proxy-canary.mcp.svc.cluster.local:8080

# 5. Restart affected workers
# Find worker session
tlist

# Kill and restart worker
tkill <session-name>

# 6. Verify worker connectivity
tail -f ~/.beads-workers/<session-name>.log

Prevention:

Use service discovery correctly
Document worker configuration
Test worker connectivity after changes

Part 4: Rollback Verification Checklist

Use this checklist after performing any rollback:

Canary Rollback Verification

Canaries scaled to 0 (kubectl scale)
Production pods still healthy
Production serving traffic normally
No Prometheus alerts for production
Incident report/bead created
Code changes reverted (if needed)
Root cause documented
Fix plan created

Production Rollback Verification

Rollback command executed
Rollout status shows completion
All production pods Ready
Pods running previous image version
Health endpoint responding
Workers connecting successfully
Metrics being exported
Error rate below threshold
Latency back to baseline
Incident report/bead created
Git revert pushed (if GitOps)
ArgoCD synced (if applicable)

Part 5: Rollback Dry-Run Test

Testing Rollback Procedure

To verify rollback procedures work, perform a dry-run test:

# Set kubeconfig
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig

# 1. Save current state
kubectl get deployment/zai-proxy -n mcp -o yaml > /tmp/zai-proxy-before.yml
kubectl get deployment/zai-proxy-test -n mcp -o yaml > /tmp/zai-proxy-test-before.yml

# 2. Check current image
current_image=$(kubectl get deployment/zai-proxy -n mcp \
  -o jsonpath='{.spec.template.spec.containers[0].image}')
echo "Current image: $current_image"

# 3. Check rollout history
kubectl rollout history deployment/zai-proxy -n mcp

# 4. Test rollback command (dry-run)
kubectl rollout undo deployment/zai-proxy -n mcp --dry-run=server

# 5. Test scaling to 0 (don't actually do it)
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0 --dry-run=server

# 6. Verify you can access logs
kubectl logs -n mcp deployment/zai-proxy --tail=10

# 7. Verify you can access metrics
curl -s http://zai-proxy.mcp.svc.cluster.local:8080/metrics | head -20

# 8. Check service endpoints
kubectl get endpoints -n mcp zai-proxy

# 9. Restore state (if needed)
# kubectl apply -f /tmp/zai-proxy-before.yml

Automated Rollback Test Script

#!/bin/bash
# Test rollback procedures

set -e

NAMESPACE="mcp"
PRODUCTION_DEPLOYMENT="zai-proxy"
CANARY_DEPLOYMENT="zai-proxy-test"

echo "=== Testing Canary Rollback Procedure ==="

# Test 1: Can we scale canary to 0?
echo "Test 1: Scale canary to 0"
kubectl scale deployment/$CANARY_DEPLOYMENT -n $NAMESPACE --replicas=0 --dry-run=server

# Test 2: Can we undo production rollout?
echo "Test 2: Undo production rollout"
kubectl rollout undo deployment/$PRODUCTION_DEPLOYMENT -n $NAMESPACE --dry-run=server

# Test 3: Can we get rollout history?
echo "Test 3: Get rollout history"
kubectl rollout history deployment/$PRODUCTION_DEPLOYMENT -n $NAMESPACE

# Test 4: Can we check pod status?
echo "Test 4: Check pod status"
kubectl get pods -n $NAMESPACE -l app=zai-proxy,variant=production

# Test 5: Can we access logs?
echo "Test 5: Access logs"
kubectl logs -n $NAMESPACE deployment/$PRODUCTION_DEPLOYMENT --tail=10

# Test 6: Can we access metrics?
echo "Test 6: Access metrics"
curl -s http://zai-proxy.$NAMESPACE.svc.cluster.local:8080/metrics | head -5

# Test 7: Can we check health?
echo "Test 7: Check health"
kubectl exec -n $NAMESPACE deployment/$PRODUCTION_DEPLOYMENT -- \
  curl -s http://localhost:8080/health

echo "=== All rollback tests passed ==="

Part 6: Post-Rollback Actions

After Rolling Back Canary

Fix the issues:
- Investigate root cause
- Fix code or configuration
- Add regression tests
Re-test canary:
- Deploy fixed version to canary
- Run functional tests
- Monitor metrics
- Validate with worker traffic
Re-promote when ready:
- Follow promotion procedure
- Monitor production metrics
- Have rollback plan ready

After Rolling Back Production

Stabilize service:
- Verify production is healthy
- Monitor metrics continuously
- Check worker connectivity
Investigate failure:
- Analyze logs from failed version
- Identify root cause
- Document findings
Fix and re-test:
- Fix issues in canary
- Thoroughly test fixes
- Consider extended canary testing
Re-promote carefully:
- Use smaller traffic split initially
- Monitor continuously
- Have rollback command ready

Part 7: kubectl Rollback Commands Reference

Deployment Rollback

# Undo to previous version
kubectl rollout undo deployment/<name> -n <namespace>

# Undo to specific revision
kubectl rollout undo deployment/<name> -n <namespace> --to-revision=<n>

# View rollout history
kubectl rollout history deployment/<name> -n <namespace>

# Check rollout status
kubectl rollout status deployment/<name> -n <namespace>

# Pause rollout
kubectl rollout pause deployment/<name> -n <namespace>

# Resume rollout
kubectl rollout resume deployment/<name> -n <namespace>

# Restart deployment
kubectl rollout restart deployment/<name> -n <namespace>

Scaling Operations

# Scale deployment to 0
kubectl scale deployment/<name> -n <namespace> --replicas=0

# Scale deployment up
kubectl scale deployment/<name> -n <namespace> --replicas=<n>

# Scale multiple deployments
kubectl scale deployment/<name1> deployment/<name2> -n <namespace> --replicas=0

Image Management

# Set image
kubectl set image deployment/<name> -n <namespace> \
  <container>=<image>:<tag>

# Get current image
kubectl get deployment/<name> -n <namespace> \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Patch deployment with new image
kubectl patch deployment/<name> -n <namespace> \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","image":"<image>:<tag>"}]}}}}'

Verification Commands

# Get pods
kubectl get pods -n <namespace> -l app=<app>

# Watch pod changes
kubectl get pods -n <namespace> -l app=<app> -w

# Describe deployment
kubectl describe deployment/<name> -n <namespace>

# Describe pod
kubectl describe pod/<pod-name> -n <namespace>

# View logs
kubectl logs -n <namespace> deployment/<name> --tail=100

# Stream logs
kubectl logs -f -n <namespace> deployment/<name>

# Get endpoints
kubectl get endpoints -n <namespace> | grep <service>

# Test health endpoint
kubectl exec -n <namespace> deployment/<name> -- \
  curl -s http://localhost:8080/health

Part 8: Rollback Decision Flowchart

                    ┌─────────────────────┐
                    │  Canary Testing     │
                    └──────────┬──────────┘
                               │
                    ┌──────────▼──────────┐
                    │  Issues Detected?   │
                    └──────────┬──────────┘
                               │
              ┌────────────────┴────────────────┐
              │ No                              │ Yes
              ▼                                 ▼
    ┌───────────────────┐          ┌─────────────────────┐
    │ Continue Testing  │          │ Critical Issue?     │
    └───────────────────┘          └──────────┬──────────┘
                                             │
                                   ┌─────────┴─────────┐
                                   │ Yes                │ No
                                   ▼                    ▼
                    ┌───────────────────┐  ┌───────────────────┐
                    │ Immediate Rollback│  │ Document & Monitor│
                    └─────────┬─────────┘  └───────────────────┘
                              │
              ┌───────────────┴───────────────┐
              │                               │
              ▼                               ▼
    ┌───────────────────┐          ┌───────────────────┐
    │ Scale Canary to 0 │          │ Collect Diagnostics│
    └─────────┬─────────┘          └─────────┬─────────┘
              │                               │
              ▼                               ▼
    ┌───────────────────┐          ┌───────────────────┐
    │ Delete Canary     │◄─────────│ Create Failure    │
    │ Resources         │          │ Report            │
    └─────────┬─────────┘          └───────────────────┘
              │
              ▼
    ┌───────────────────┐
    │ Verify Production │
    │ Still Healthy     │
    └─────────┬─────────┘
              │
              ▼
    ┌───────────────────┐
    │ Document Lessons  │
    │ Learned           │
    └───────────────────┘

Part 9: RBAC Considerations

Important: Read-Only Access from Devpods

When running rollback procedures from devpods using the devpod-observer ServiceAccount:

Available Operations (Read-Only):

kubectl get pods - View pod status
kubectl get deployments - View deployment status
kubectl get svc - View service status
kubectl rollout history - View rollout history
kubectl logs - View pod logs
kubectl describe - View resource details

NOT Available (Requires Write Permissions):

kubectl scale - Cannot scale deployments
kubectl rollout undo - Cannot rollback deployments
kubectl delete - Cannot delete resources
kubectl set image - Cannot update images
kubectl patch - Cannot patch resources
kubectl exec - Cannot execute commands in pods

Rollback with Read-Only Access

When you only have read-only access (e.g., from devpods), use these alternative approaches:

Option 1: GitOps Rollback (Recommended for ArgoCD-managed deployments)

# Navigate to cluster configuration
cd /home/coder/ardenone-cluster/cluster-configuration/apexalgo-iad/mcp

# Revert the problematic commit
git log --oneline -5
git revert HEAD

# Push the revert
git add zai-proxy.yml
git commit -m "fix: rollback zai-proxy to previous stable version"
git push origin main

# ArgoCD will automatically sync the revert

Option 2: Request Rollback via Human Intervention

# Create a bead to request rollback
cd /home/coder/ardenone-cluster/containers/zai-proxy

br create "URGENT: Request production rollback for zai-proxy" \
  --type bug \
  --priority P0 \
  --description "CRITICAL: Production rollback requested

  Current Issues:
  - [Describe symptoms]

  Requested Action:
  - kubectl rollout undo deployment/zai-proxy -n mcp
  - OR: Scale to 0 and back up

  Verified via Read-Only:
  - Production pods: $(kubectl get pods -n mcp -l app=zai-proxy,variant=production)
  - Current image: $(kubectl get deployment/zai-proxy -n mcp -o jsonpath='{.spec.template.spec.containers[0].image}')
  - Rollout history available
  " \
  --labels critical,rollback,production,human-required

Option 3: Direct Cluster Access (If Available)

If you have direct kubectl access with admin permissions (not via devpod-observer):

# Use local kubeconfig or admin credentials
kubectl rollout undo deployment/zai-proxy -n mcp

Verification with Read-Only Access

You CAN verify the cluster state even with read-only access:

# Check deployment status
kubectl get deployment/zai-proxy -n mcp

# Check pod status
kubectl get pods -n mcp -l app=zai-proxy,variant=production

# View recent logs
kubectl logs -n mcp deployment/zai-proxy --tail=50

# Check rollout history
kubectl rollout history deployment/zai-proxy -n mcp

# View service endpoints
kubectl get endpoints -n mcp zai-proxy

CANARY_PROMOTION_PROCEDURE.md - Promoting canary to production
CANARY_PROMOTION_CHECKLIST.md - Promotion checklist
DEPLOYMENT.md - Worker configuration and dual-deployment workflow
TOKEN_COUNTING.md - Token counting implementation
REGRESSION_TESTING.md - Running regression tests
README-traffic-splitting.md - Traffic splitting options

Recovery Timeline

Action	Time	Notes
Scale canary to 0	<10s	Immediate stop
Delete canary resources	<30s	Full cleanup
Verify production healthy	<1min	Confirm no impact
Production rollback	<2min	Full rollout undo
Collect diagnostics	<5min	For analysis
Document failure	<10min	Postmortem
Canary rollback time	<15min	Production unaffected
Production rollback time	<5min	Brief interruption

Key Point: Production is never modified during canary rollback, so downtime is zero.

Document Version: 2.1.0 Last Updated: 2026-02-08 Maintained By: Claude Code Workers Related Bead: bd-2s5

Important RBAC Note

When accessing from devpods via kubectl-proxy: The devpod-observer ServiceAccount has limited permissions and cannot perform write operations like kubectl rollout undo or kubectl scale.

For devpod access, use GitOps rollback (Option 1) instead of direct kubectl commands.

Direct kubectl rollback commands work when:

Running from within the apexalgo-iad cluster directly
Using a ServiceAccount with deployment edit permissions
The deployment is managed by ArgoCD (use Git revert instead)

30 KiB Raw Blame History

Canary Rollback Procedure

Overview

Architecture Reference

Prerequisites

Quick Rollback Commands

Part 1: Canary Deployment Rollback

Scenario: Canary Testing Reveals Critical Issues

Rollback Triggers

Step 1: Verify Current State

Step 2: Document Issues Found

Step 3: Delete Canary Resources

Step 4: Revert Code Changes (If Needed)

Step 5: Verify Production Unchanged

Step 6: Clean Up Canary Resources

Step 7: Collect Diagnostic Information

Part 2: Production Rollback After Promotion

Scenario: Promoted Canary Causes Production Issues

Step 1: Immediate Rollback (kubectl)

Step 2: Rollback to Specific Version

Step 3: GitOps Rollback (ArgoCD)

Step 4: Emergency Rollback (If kubectl fails)

Step 5: Verify Rollback Complete

Step 6: Document the Rollback

Part 3: Troubleshooting Guide

Common Failure Scenarios

Scenario 1: Canary Pods CrashLoopBackOff

Scenario 2: Canary High Error Rate

Scenario 3: Canary Latency Degraded

Scenario 4: Production Rollout Stuck

Scenario 5: Production Image Crash Loop

Scenario 6: ArgoCD Sync Delay

Scenario 7: Workers Not Connecting After Rollback

Part 4: Rollback Verification Checklist

Canary Rollback Verification

Production Rollback Verification

Part 5: Rollback Dry-Run Test

Testing Rollback Procedure

Automated Rollback Test Script

Part 6: Post-Rollback Actions

After Rolling Back Canary

After Rolling Back Production

Part 7: kubectl Rollback Commands Reference

Deployment Rollback

Scaling Operations

Image Management

Verification Commands

Part 8: Rollback Decision Flowchart

Part 9: RBAC Considerations

Important: Read-Only Access from Devpods

Rollback with Read-Only Access

Verification with Read-Only Access

Related Documentation

Recovery Timeline

Important RBAC Note

30 KiB

Raw Blame History