zai-proxy/docs/notes/zai-proxy-blue-green-switchover.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

116 lines
3 KiB
Markdown

# Z.AI Proxy Blue-Green Deployment - Traffic Switchover
## Current Status
- **V1 (Old)**: `zai-proxy` deployment running `ronaldraygun/zai-proxy:1.1.0`
- **V2 (New)**: `zai-proxy-v2` deployment running `ronaldraygun/zai-proxy:1.3.0`
- **Service**: Currently routes to V1 (`selector: app=zai-proxy` without version label)
## Switchover Procedure
### Step 1: Verify V2 is Running and Healthy
```bash
kubectl get deployment zai-proxy-v2 -n devpod
kubectl get pods -n devpod -l version=v2
kubectl logs -n devpod -l version=v2 --tail=20
# Test V2 directly (bypass service)
POD_IP=$(kubectl get pod -n devpod -l version=v2 -o jsonpath='{.items[0].status.podIP}')
curl http://$POD_IP:8080/health
curl http://$POD_IP:8080/metrics | grep zai_proxy_rate_limit
```
### Step 2: Update Service Selector to Route to V2
```bash
kubectl patch service zai-proxy -n devpod --type=merge -p '
{
"spec": {
"selector": {
"app": "zai-proxy",
"version": "v2"
}
}
}'
```
### Step 3: Verify Traffic is Flowing to V2
```bash
# Check service endpoints
kubectl get endpoints zai-proxy -n devpod
# Test through service
curl http://zai-proxy.devpod.svc.cluster.local:8080/health
curl http://zai-proxy.devpod.svc.cluster.local:8080/metrics | grep "deployment_variant"
# Should see: deployment_variant="v2"
```
### Step 4: Monitor Metrics in Grafana
Check that new metrics are now available:
- Current Rate Limit
- Token counting metrics
- Adaptive rate limit adjustments
### Step 5: Delete Old V1 Deployment (Optional - Keep for Rollback)
**Option A: Keep V1 for Quick Rollback (Recommended for 24h)**
```bash
# Scale V1 to 0 replicas but keep deployment
kubectl scale deployment zai-proxy -n devpod --replicas=0
```
**Option B: Delete V1 Completely**
```bash
kubectl delete deployment zai-proxy -n devpod
```
## Rollback Procedure (If Needed)
If V2 has issues, instantly rollback to V1:
```bash
# If V1 is scaled to 0
kubectl scale deployment zai-proxy -n devpod --replicas=1
# Switch service back to V1
kubectl patch service zai-proxy -n devpod --type=merge -p '
{
"spec": {
"selector": {
"app": "zai-proxy"
}
}
}'
# Or directly update to no version label
kubectl patch service zai-proxy -n devpod --type=json -p='[
{"op": "remove", "path": "/spec/selector/version"}
]'
```
## Benefits of This Approach
1. **Zero Downtime**: V2 starts before V1 stops
2. **Instant Rollback**: Keep V1 running or scaled to 0
3. **Gradual Verification**: Test V2 directly before switching traffic
4. **Safe**: Can test without affecting users
## Worker Impact
- Workers will continue using the proxy without interruption
- Existing connections may be briefly reset during service selector change
- Rate limiting will reset to initial values on V2 (RATE_LIMIT_INITIAL=2)
## Monitoring Checklist
- [ ] V2 pod is Running
- [ ] V2 health check passes
- [ ] V2 metrics endpoint accessible
- [ ] Service endpoints point to V2 pod
- [ ] Workers can make requests successfully
- [ ] Grafana shows new metrics
- [ ] No 429 or 502 errors in V2 logs