zai-proxy/docs/notes/zai-proxy-blue-green-switchover.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

3 KiB

Z.AI Proxy Blue-Green Deployment - Traffic Switchover

Current Status

  • V1 (Old): zai-proxy deployment running ronaldraygun/zai-proxy:1.1.0
  • V2 (New): zai-proxy-v2 deployment running ronaldraygun/zai-proxy:1.3.0
  • Service: Currently routes to V1 (selector: app=zai-proxy without version label)

Switchover Procedure

Step 1: Verify V2 is Running and Healthy

kubectl get deployment zai-proxy-v2 -n devpod
kubectl get pods -n devpod -l version=v2
kubectl logs -n devpod -l version=v2 --tail=20

# Test V2 directly (bypass service)
POD_IP=$(kubectl get pod -n devpod -l version=v2 -o jsonpath='{.items[0].status.podIP}')
curl http://$POD_IP:8080/health
curl http://$POD_IP:8080/metrics | grep zai_proxy_rate_limit

Step 2: Update Service Selector to Route to V2

kubectl patch service zai-proxy -n devpod --type=merge -p '
{
  "spec": {
    "selector": {
      "app": "zai-proxy",
      "version": "v2"
    }
  }
}'

Step 3: Verify Traffic is Flowing to V2

# Check service endpoints
kubectl get endpoints zai-proxy -n devpod

# Test through service
curl http://zai-proxy.devpod.svc.cluster.local:8080/health
curl http://zai-proxy.devpod.svc.cluster.local:8080/metrics | grep "deployment_variant"

# Should see: deployment_variant="v2"

Step 4: Monitor Metrics in Grafana

Check that new metrics are now available:

  • Current Rate Limit
  • Token counting metrics
  • Adaptive rate limit adjustments

Step 5: Delete Old V1 Deployment (Optional - Keep for Rollback)

Option A: Keep V1 for Quick Rollback (Recommended for 24h)

# Scale V1 to 0 replicas but keep deployment
kubectl scale deployment zai-proxy -n devpod --replicas=0

Option B: Delete V1 Completely

kubectl delete deployment zai-proxy -n devpod

Rollback Procedure (If Needed)

If V2 has issues, instantly rollback to V1:

# If V1 is scaled to 0
kubectl scale deployment zai-proxy -n devpod --replicas=1

# Switch service back to V1
kubectl patch service zai-proxy -n devpod --type=merge -p '
{
  "spec": {
    "selector": {
      "app": "zai-proxy"
    }
  }
}'

# Or directly update to no version label
kubectl patch service zai-proxy -n devpod --type=json -p='[
  {"op": "remove", "path": "/spec/selector/version"}
]'

Benefits of This Approach

  1. Zero Downtime: V2 starts before V1 stops
  2. Instant Rollback: Keep V1 running or scaled to 0
  3. Gradual Verification: Test V2 directly before switching traffic
  4. Safe: Can test without affecting users

Worker Impact

  • Workers will continue using the proxy without interruption
  • Existing connections may be briefly reset during service selector change
  • Rate limiting will reset to initial values on V2 (RATE_LIMIT_INITIAL=2)

Monitoring Checklist

  • V2 pod is Running
  • V2 health check passes
  • V2 metrics endpoint accessible
  • Service endpoints point to V2 pod
  • Workers can make requests successfully
  • Grafana shows new metrics
  • No 429 or 502 errors in V2 logs