jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo

Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:53:52 -04:00

15 KiB

Raw Blame History

Z.AI Proxy Metrics and Autoscaling

Overview

The zai-proxy has been enhanced with comprehensive Prometheus metrics and autoscaling capabilities to maximize utilization of the z.ai coding subscription.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Z.AI Proxy Cluster                      │
│                                                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │  zai-proxy  │  │  zai-proxy  │  │  zai-proxy  │          │
│  │   Pod 1     │  │   Pod 2     │  │   Pod N     │          │
│  │             │  │             │  │             │          │
│  │ MAX_WORKERS │  │ MAX_WORKERS │  │ MAX_WORKERS │          │
│  │     = 20    │  │     = 20    │  │     = 20    │          │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘          │
│         │                │                │                  │
│         └────────────────┴────────────────┘                  │
│                          │                                   │
│                   /metrics endpoint                          │
│                          │                                   │
│         ┌────────────────▼────────────────┐                  │
│         │      ServiceMonitor             │                  │
│         │   (scrapes every 15s)           │                  │
│         └────────────────┬────────────────┘                  │
│                          │                                   │
│         ┌────────────────▼────────────────┐                  │
│         │       Prometheus                │                  │
│         │  (stores time-series data)      │                  │
│         └────────────────┬────────────────┘                  │
│                          │                                   │
│         ┌────────────────▼────────────────┐                  │
│         │  HorizontalPodAutoscaler        │                  │
│         │  - CPU > 70%: scale up          │                  │
│         │  - Memory > 80%: scale up       │                  │
│         │  - Worker util > 80%: scale up  │                  │
│         │  Min: 1, Max: 5 replicas        │                  │
│         └─────────────────────────────────┘                  │
│                                                               │
│  ┌───────────────────────────────────────────────────────┐   │
│  │              Grafana Dashboard                        │   │
│  │  - Worker utilization gauge                           │   │
│  │  - Request rate by status code                        │   │
│  │  - Concurrent requests vs max workers                 │   │
│  │  - Request duration percentiles (p50, p90, p99)       │   │
│  │  - Request/response size metrics                      │   │
│  │  - Upstream error tracking                            │   │
│  └───────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Metrics Exposed

Request Metrics

zai_proxy_requests_total (Counter)
- Total number of requests by method, path, and status code
- Labels: method, path, status_code
- Example: zai_proxy_requests_total{method="POST",path="/v1/messages",status_code="200"}
zai_proxy_request_duration_seconds (Histogram)
- Request duration in seconds
- Buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s, 30s, 60s, 120s, 300s
- Labels: method, path, status_code
- Useful for: p50, p90, p99 latency calculations
zai_proxy_request_size_bytes (Histogram)
- Request payload size in bytes
- Exponential buckets: 100, 1000, 10000, ...
- Labels: method, path
zai_proxy_response_size_bytes (Histogram)
- Response payload size in bytes
- Exponential buckets: 100, 1000, 10000, ...
- Labels: method, path, status_code

Worker Metrics

zai_proxy_concurrent_requests (Gauge)
- Number of requests currently being processed
- Real-time view of active connections
zai_proxy_max_workers (Gauge)
- Maximum number of concurrent workers allowed per pod
- Set via MAX_WORKERS environment variable (default: 20)
zai_proxy_worker_utilization_ratio (Gauge)
- Current worker utilization ratio (concurrent_requests / max_workers)
- Range: 0.0 to 1.0+
- Key metric for autoscaling decisions

Error Metrics

zai_proxy_upstream_errors_total (Counter)
- Total number of upstream errors by type
- Labels: error_type
- Error types:
  - request_creation - Failed to create upstream request
  - upstream_connection - Failed to connect to z.ai API
  - read_error - Error reading response from z.ai
  - write_error - Error writing response to client

Configuration

Environment Variables

Both deployments (ardenone-cluster/devpod and apexalgo-iad/mcp) support:

env:
  - name: ZAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: zai-api-key
        key: api-key

  - name: MAX_WORKERS
    value: "20"  # Adjust based on subscription limits

MAX_WORKERS: Controls the maximum number of concurrent requests a single pod will handle. When exceeded, the proxy returns 503 Service Unavailable to trigger autoscaling.

Autoscaling Behavior

Scale Up:

Stabilization window: 30 seconds
Policies:
- Can double pod count instantly (100% increase)
- Or add 2 pods at a time
- Uses the most aggressive policy

Scale Down:

Stabilization window: 300 seconds (5 minutes)
Policies:
- Maximum 25% reduction at a time
- Slow scale-down to avoid thrashing

Replica Limits:

Minimum: 1 pod
Maximum: 5 pods

Scaling Triggers

CPU Utilization > 70%
Memory Utilization > 80%
Worker Utilization > 80% (requires prometheus-adapter - see below)

Prometheus Adapter Configuration (Optional)

To enable custom metric-based autoscaling (worker utilization), configure prometheus-adapter:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'zai_proxy_worker_utilization_ratio'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)$"
        as: "zai_proxy_worker_utilization_ratio"
      metricsQuery: 'avg_over_time(zai_proxy_worker_utilization_ratio[2m])'

Then uncomment the custom metric section in the HPA manifests.

Querying Metrics

Useful PromQL Queries

Request rate (req/s):

sum(rate(zai_proxy_requests_total[5m]))

Request rate by status code:

sum(rate(zai_proxy_requests_total[5m])) by (status_code)

p99 latency:

histogram_quantile(0.99, sum(rate(zai_proxy_request_duration_seconds_bucket[5m])) by (le))

Worker utilization (current):

sum(zai_proxy_worker_utilization_ratio)

Total concurrent capacity:

sum(zai_proxy_max_workers)

Error rate:

sum(rate(zai_proxy_upstream_errors_total[5m])) by (error_type)

Success rate (non-5xx):

sum(rate(zai_proxy_requests_total{status_code!~"5.."}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))

Grafana Dashboard

A pre-configured Grafana dashboard is deployed to monitoring namespace:

Panels:

Worker Utilization Gauge - Real-time utilization percentage
Request Rate by Status Code - Time-series of req/s grouped by HTTP status
Concurrent Requests vs Max Workers - Visual capacity tracking
Request Duration Percentiles - p50, p90, p99 latency trends
Request/Response Size (p90) - Bandwidth usage
Upstream Errors - Error rate by type

Access:

Navigate to Grafana (check IngressRoute for URL)
Search for "Z.AI Proxy Metrics" dashboard

Deployment Workflow

1. Build New Container Image

cd /home/coder/ardenone-cluster

# Version is already bumped to 1.1.0
git add containers/zai-proxy/
git commit -m "feat(zai-proxy): add Prometheus metrics and worker pool management

- Add comprehensive Prometheus metrics for requests, durations, sizes
- Track concurrent requests and worker utilization
- Add MAX_WORKERS environment variable for capacity control
- Expose /metrics endpoint for Prometheus scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"

git push origin main

Wait for GitHub Actions to complete (~5 minutes). Check:

https://github.com/ardenone/ardenone-cluster/actions

2. Deploy ServiceMonitors and HPAs

git add cluster-configuration/
git commit -m "feat(zai-proxy): add ServiceMonitors, HPAs, and Grafana dashboard

- Add ServiceMonitor for both ardenone-cluster and apexalgo-iad
- Configure HorizontalPodAutoscaler with CPU/memory/worker metrics
- Deploy Grafana dashboard for visualization
- Update deployments with MAX_WORKERS=20 and metrics port

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"

git push origin main

ArgoCD will automatically sync these changes.

3. Update Deployment to v1.1.0

ONLY AFTER GitHub Actions build succeeds:

# Update image version in both deployments
sed -i 's|ronaldraygun/zai-proxy:1.0.0|ronaldraygun/zai-proxy:1.1.0|g' \
  cluster-configuration/ardenone-cluster/devpod/zai-proxy.yml \
  cluster-configuration/apexalgo-iad/mcp/zai-proxy.yml

git add cluster-configuration/
git commit -m "chore(zai-proxy): bump to v1.1.0 with metrics support

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"

git push origin main

4. Verify Deployment

Check pods:

kubectl get pods -n devpod -l app=zai-proxy
kubectl get pods -n mcp -l app=zai-proxy --kubeconfig=/home/coder/.kube/apexalgo-iad.kubeconfig

Check metrics endpoint:

kubectl port-forward -n devpod svc/zai-proxy 8080:8080 &
curl http://localhost:8080/metrics | grep zai_proxy

Check HPA status:

kubectl get hpa -n devpod zai-proxy
kubectl describe hpa -n devpod zai-proxy

Check ServiceMonitor:

kubectl get servicemonitor -n devpod zai-proxy

Tuning for Maximum Subscription Utilization

Strategy 1: Fixed Worker Pool

Set MAX_WORKERS based on your z.ai subscription limits:

If subscription allows 50 concurrent requests:
- Set MAX_WORKERS=10 with maxReplicas=5 (10 * 5 = 50 total)
- Or MAX_WORKERS=25 with maxReplicas=2 (25 * 2 = 50 total)

Strategy 2: Dynamic Scaling

Monitor zai_proxy_worker_utilization_ratio in Grafana
If consistently below 0.5 (50%), reduce MAX_WORKERS or maxReplicas
If frequently hitting 1.0 (100%), increase MAX_WORKERS or maxReplicas

Strategy 3: Cost Optimization

Low-traffic periods: Set minReplicas=1
High-traffic periods: Use aggressive scale-up policies
Balance: Slow scale-down (5 min stabilization) prevents over-provisioning

Alerting Rules (Optional)

Add Prometheus alerting rules to get notified of issues:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: zai-proxy-alerts
  namespace: monitoring
spec:
  groups:
  - name: zai-proxy
    interval: 30s
    rules:
    - alert: ZaiProxyHighErrorRate
      expr: |
        sum(rate(zai_proxy_requests_total{status_code=~"5.."}[5m]))
        /
        sum(rate(zai_proxy_requests_total[5m]))
        > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Z.AI Proxy error rate > 5%"

    - alert: ZaiProxyHighUtilization
      expr: sum(zai_proxy_worker_utilization_ratio) > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Z.AI Proxy worker utilization > 90% for 10 minutes"

    - alert: ZaiProxyAtMaxCapacity
      expr: |
        sum(zai_proxy_concurrent_requests)
        >=
        sum(zai_proxy_max_workers)
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Z.AI Proxy at maximum capacity - requests being rejected"

Troubleshooting

Metrics not appearing in Prometheus

Check ServiceMonitor is deployed:
```
kubectl get servicemonitor -n devpod
```

Check Prometheus is scraping:

kubectl logs -n monitoring -l app=prometheus

Verify metrics endpoint is accessible:

kubectl exec -n devpod deploy/zai-proxy -- wget -O- http://localhost:8080/metrics

HPA not scaling

Check HPA status:

kubectl describe hpa -n devpod zai-proxy

Verify metrics-server is running:

kubectl get pods -n kube-system -l k8s-app=metrics-server

Check current metrics:

kubectl get hpa -n devpod zai-proxy -o yaml

Pods stuck at capacity (503 errors)

Check worker utilization:

sum(zai_proxy_worker_utilization_ratio)

Increase MAX_WORKERS or maxReplicas in HPA

Verify HPA is allowed to scale up:

kubectl get hpa -n devpod zai-proxy
# Current replicas should be < maxReplicas

Next Steps

Monitor for 1-2 weeks - Collect baseline metrics
Tune MAX_WORKERS - Adjust based on actual utilization
Enable custom metrics - Configure prometheus-adapter for worker-based autoscaling
Set up alerts - Get notified of capacity issues
Cost analysis - Measure subscription utilization vs pod costs

References

Prometheus Operator: https://prometheus-operator.dev/
HPA documentation: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
Prometheus adapter: https://github.com/kubernetes-sigs/prometheus-adapter

15 KiB Raw Blame History