jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo

Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:53:52 -04:00

3 KiB

Raw Permalink Blame History

Z.AI Proxy Blue-Green Deployment - Traffic Switchover

Current Status

V1 (Old): zai-proxy deployment running ronaldraygun/zai-proxy:1.1.0
V2 (New): zai-proxy-v2 deployment running ronaldraygun/zai-proxy:1.3.0
Service: Currently routes to V1 (selector: app=zai-proxy without version label)

Switchover Procedure

Step 1: Verify V2 is Running and Healthy

kubectl get deployment zai-proxy-v2 -n devpod
kubectl get pods -n devpod -l version=v2
kubectl logs -n devpod -l version=v2 --tail=20

# Test V2 directly (bypass service)
POD_IP=$(kubectl get pod -n devpod -l version=v2 -o jsonpath='{.items[0].status.podIP}')
curl http://$POD_IP:8080/health
curl http://$POD_IP:8080/metrics | grep zai_proxy_rate_limit

Step 2: Update Service Selector to Route to V2

kubectl patch service zai-proxy -n devpod --type=merge -p '
{
  "spec": {
    "selector": {
      "app": "zai-proxy",
      "version": "v2"
    }
  }
}'

Step 3: Verify Traffic is Flowing to V2

# Check service endpoints
kubectl get endpoints zai-proxy -n devpod

# Test through service
curl http://zai-proxy.devpod.svc.cluster.local:8080/health
curl http://zai-proxy.devpod.svc.cluster.local:8080/metrics | grep "deployment_variant"

# Should see: deployment_variant="v2"

Step 4: Monitor Metrics in Grafana

Check that new metrics are now available:

Current Rate Limit
Token counting metrics
Adaptive rate limit adjustments

Step 5: Delete Old V1 Deployment (Optional - Keep for Rollback)

Option A: Keep V1 for Quick Rollback (Recommended for 24h)

# Scale V1 to 0 replicas but keep deployment
kubectl scale deployment zai-proxy -n devpod --replicas=0

Option B: Delete V1 Completely

kubectl delete deployment zai-proxy -n devpod

Rollback Procedure (If Needed)

If V2 has issues, instantly rollback to V1:

# If V1 is scaled to 0
kubectl scale deployment zai-proxy -n devpod --replicas=1

# Switch service back to V1
kubectl patch service zai-proxy -n devpod --type=merge -p '
{
  "spec": {
    "selector": {
      "app": "zai-proxy"
    }
  }
}'

# Or directly update to no version label
kubectl patch service zai-proxy -n devpod --type=json -p='[
  {"op": "remove", "path": "/spec/selector/version"}
]'

Benefits of This Approach

Zero Downtime: V2 starts before V1 stops
Instant Rollback: Keep V1 running or scaled to 0
Gradual Verification: Test V2 directly before switching traffic
Safe: Can test without affecting users

Worker Impact

Workers will continue using the proxy without interruption
Existing connections may be briefly reset during service selector change
Rate limiting will reset to initial values on V2 (RATE_LIMIT_INITIAL=2)

Monitoring Checklist

V2 pod is Running
V2 health check passes
V2 metrics endpoint accessible
Service endpoints point to V2 pod
Workers can make requests successfully
Grafana shows new metrics
No 429 or 502 errors in V2 logs

3 KiB Raw Permalink Blame History