Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
15 KiB
Z.AI Proxy Metrics and Autoscaling
Overview
The zai-proxy has been enhanced with comprehensive Prometheus metrics and autoscaling capabilities to maximize utilization of the z.ai coding subscription.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Z.AI Proxy Cluster │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ zai-proxy │ │ zai-proxy │ │ zai-proxy │ │
│ │ Pod 1 │ │ Pod 2 │ │ Pod N │ │
│ │ │ │ │ │ │ │
│ │ MAX_WORKERS │ │ MAX_WORKERS │ │ MAX_WORKERS │ │
│ │ = 20 │ │ = 20 │ │ = 20 │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┴────────────────┘ │
│ │ │
│ /metrics endpoint │
│ │ │
│ ┌────────────────▼────────────────┐ │
│ │ ServiceMonitor │ │
│ │ (scrapes every 15s) │ │
│ └────────────────┬────────────────┘ │
│ │ │
│ ┌────────────────▼────────────────┐ │
│ │ Prometheus │ │
│ │ (stores time-series data) │ │
│ └────────────────┬────────────────┘ │
│ │ │
│ ┌────────────────▼────────────────┐ │
│ │ HorizontalPodAutoscaler │ │
│ │ - CPU > 70%: scale up │ │
│ │ - Memory > 80%: scale up │ │
│ │ - Worker util > 80%: scale up │ │
│ │ Min: 1, Max: 5 replicas │ │
│ └─────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Grafana Dashboard │ │
│ │ - Worker utilization gauge │ │
│ │ - Request rate by status code │ │
│ │ - Concurrent requests vs max workers │ │
│ │ - Request duration percentiles (p50, p90, p99) │ │
│ │ - Request/response size metrics │ │
│ │ - Upstream error tracking │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Metrics Exposed
Request Metrics
-
zai_proxy_requests_total(Counter)- Total number of requests by method, path, and status code
- Labels:
method,path,status_code - Example:
zai_proxy_requests_total{method="POST",path="/v1/messages",status_code="200"}
-
zai_proxy_request_duration_seconds(Histogram)- Request duration in seconds
- Buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s, 30s, 60s, 120s, 300s
- Labels:
method,path,status_code - Useful for: p50, p90, p99 latency calculations
-
zai_proxy_request_size_bytes(Histogram)- Request payload size in bytes
- Exponential buckets: 100, 1000, 10000, ...
- Labels:
method,path
-
zai_proxy_response_size_bytes(Histogram)- Response payload size in bytes
- Exponential buckets: 100, 1000, 10000, ...
- Labels:
method,path,status_code
Worker Metrics
-
zai_proxy_concurrent_requests(Gauge)- Number of requests currently being processed
- Real-time view of active connections
-
zai_proxy_max_workers(Gauge)- Maximum number of concurrent workers allowed per pod
- Set via
MAX_WORKERSenvironment variable (default: 20)
-
zai_proxy_worker_utilization_ratio(Gauge)- Current worker utilization ratio (concurrent_requests / max_workers)
- Range: 0.0 to 1.0+
- Key metric for autoscaling decisions
Error Metrics
zai_proxy_upstream_errors_total(Counter)- Total number of upstream errors by type
- Labels:
error_type - Error types:
request_creation- Failed to create upstream requestupstream_connection- Failed to connect to z.ai APIread_error- Error reading response from z.aiwrite_error- Error writing response to client
Configuration
Environment Variables
Both deployments (ardenone-cluster/devpod and apexalgo-iad/mcp) support:
env:
- name: ZAI_API_KEY
valueFrom:
secretKeyRef:
name: zai-api-key
key: api-key
- name: MAX_WORKERS
value: "20" # Adjust based on subscription limits
MAX_WORKERS: Controls the maximum number of concurrent requests a single pod will handle. When exceeded, the proxy returns 503 Service Unavailable to trigger autoscaling.
Autoscaling Behavior
Scale Up:
- Stabilization window: 30 seconds
- Policies:
- Can double pod count instantly (100% increase)
- Or add 2 pods at a time
- Uses the most aggressive policy
Scale Down:
- Stabilization window: 300 seconds (5 minutes)
- Policies:
- Maximum 25% reduction at a time
- Slow scale-down to avoid thrashing
Replica Limits:
- Minimum: 1 pod
- Maximum: 5 pods
Scaling Triggers
- CPU Utilization > 70%
- Memory Utilization > 80%
- Worker Utilization > 80% (requires prometheus-adapter - see below)
Prometheus Adapter Configuration (Optional)
To enable custom metric-based autoscaling (worker utilization), configure prometheus-adapter:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'zai_proxy_worker_utilization_ratio'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "zai_proxy_worker_utilization_ratio"
metricsQuery: 'avg_over_time(zai_proxy_worker_utilization_ratio[2m])'
Then uncomment the custom metric section in the HPA manifests.
Querying Metrics
Useful PromQL Queries
Request rate (req/s):
sum(rate(zai_proxy_requests_total[5m]))
Request rate by status code:
sum(rate(zai_proxy_requests_total[5m])) by (status_code)
p99 latency:
histogram_quantile(0.99, sum(rate(zai_proxy_request_duration_seconds_bucket[5m])) by (le))
Worker utilization (current):
sum(zai_proxy_worker_utilization_ratio)
Total concurrent capacity:
sum(zai_proxy_max_workers)
Error rate:
sum(rate(zai_proxy_upstream_errors_total[5m])) by (error_type)
Success rate (non-5xx):
sum(rate(zai_proxy_requests_total{status_code!~"5.."}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))
Grafana Dashboard
A pre-configured Grafana dashboard is deployed to monitoring namespace:
Panels:
- Worker Utilization Gauge - Real-time utilization percentage
- Request Rate by Status Code - Time-series of req/s grouped by HTTP status
- Concurrent Requests vs Max Workers - Visual capacity tracking
- Request Duration Percentiles - p50, p90, p99 latency trends
- Request/Response Size (p90) - Bandwidth usage
- Upstream Errors - Error rate by type
Access:
- Navigate to Grafana (check IngressRoute for URL)
- Search for "Z.AI Proxy Metrics" dashboard
Deployment Workflow
1. Build New Container Image
cd /home/coder/ardenone-cluster
# Version is already bumped to 1.1.0
git add containers/zai-proxy/
git commit -m "feat(zai-proxy): add Prometheus metrics and worker pool management
- Add comprehensive Prometheus metrics for requests, durations, sizes
- Track concurrent requests and worker utilization
- Add MAX_WORKERS environment variable for capacity control
- Expose /metrics endpoint for Prometheus scraping
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
git push origin main
Wait for GitHub Actions to complete (~5 minutes). Check:
2. Deploy ServiceMonitors and HPAs
git add cluster-configuration/
git commit -m "feat(zai-proxy): add ServiceMonitors, HPAs, and Grafana dashboard
- Add ServiceMonitor for both ardenone-cluster and apexalgo-iad
- Configure HorizontalPodAutoscaler with CPU/memory/worker metrics
- Deploy Grafana dashboard for visualization
- Update deployments with MAX_WORKERS=20 and metrics port
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
git push origin main
ArgoCD will automatically sync these changes.
3. Update Deployment to v1.1.0
ONLY AFTER GitHub Actions build succeeds:
# Update image version in both deployments
sed -i 's|ronaldraygun/zai-proxy:1.0.0|ronaldraygun/zai-proxy:1.1.0|g' \
cluster-configuration/ardenone-cluster/devpod/zai-proxy.yml \
cluster-configuration/apexalgo-iad/mcp/zai-proxy.yml
git add cluster-configuration/
git commit -m "chore(zai-proxy): bump to v1.1.0 with metrics support
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
git push origin main
4. Verify Deployment
Check pods:
kubectl get pods -n devpod -l app=zai-proxy
kubectl get pods -n mcp -l app=zai-proxy --kubeconfig=/home/coder/.kube/apexalgo-iad.kubeconfig
Check metrics endpoint:
kubectl port-forward -n devpod svc/zai-proxy 8080:8080 &
curl http://localhost:8080/metrics | grep zai_proxy
Check HPA status:
kubectl get hpa -n devpod zai-proxy
kubectl describe hpa -n devpod zai-proxy
Check ServiceMonitor:
kubectl get servicemonitor -n devpod zai-proxy
Tuning for Maximum Subscription Utilization
Strategy 1: Fixed Worker Pool
Set MAX_WORKERS based on your z.ai subscription limits:
- If subscription allows 50 concurrent requests:
- Set
MAX_WORKERS=10withmaxReplicas=5(10 * 5 = 50 total) - Or
MAX_WORKERS=25withmaxReplicas=2(25 * 2 = 50 total)
- Set
Strategy 2: Dynamic Scaling
- Monitor
zai_proxy_worker_utilization_ratioin Grafana - If consistently below 0.5 (50%), reduce
MAX_WORKERSormaxReplicas - If frequently hitting 1.0 (100%), increase
MAX_WORKERSormaxReplicas
Strategy 3: Cost Optimization
- Low-traffic periods: Set
minReplicas=1 - High-traffic periods: Use aggressive scale-up policies
- Balance: Slow scale-down (5 min stabilization) prevents over-provisioning
Alerting Rules (Optional)
Add Prometheus alerting rules to get notified of issues:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: zai-proxy-alerts
namespace: monitoring
spec:
groups:
- name: zai-proxy
interval: 30s
rules:
- alert: ZaiProxyHighErrorRate
expr: |
sum(rate(zai_proxy_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Z.AI Proxy error rate > 5%"
- alert: ZaiProxyHighUtilization
expr: sum(zai_proxy_worker_utilization_ratio) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Z.AI Proxy worker utilization > 90% for 10 minutes"
- alert: ZaiProxyAtMaxCapacity
expr: |
sum(zai_proxy_concurrent_requests)
>=
sum(zai_proxy_max_workers)
for: 5m
labels:
severity: critical
annotations:
summary: "Z.AI Proxy at maximum capacity - requests being rejected"
Troubleshooting
Metrics not appearing in Prometheus
-
Check ServiceMonitor is deployed:
kubectl get servicemonitor -n devpod -
Check Prometheus is scraping:
kubectl logs -n monitoring -l app=prometheus -
Verify metrics endpoint is accessible:
kubectl exec -n devpod deploy/zai-proxy -- wget -O- http://localhost:8080/metrics
HPA not scaling
-
Check HPA status:
kubectl describe hpa -n devpod zai-proxy -
Verify metrics-server is running:
kubectl get pods -n kube-system -l k8s-app=metrics-server -
Check current metrics:
kubectl get hpa -n devpod zai-proxy -o yaml
Pods stuck at capacity (503 errors)
-
Check worker utilization:
sum(zai_proxy_worker_utilization_ratio) -
Increase
MAX_WORKERSormaxReplicasin HPA -
Verify HPA is allowed to scale up:
kubectl get hpa -n devpod zai-proxy # Current replicas should be < maxReplicas
Next Steps
- Monitor for 1-2 weeks - Collect baseline metrics
- Tune MAX_WORKERS - Adjust based on actual utilization
- Enable custom metrics - Configure prometheus-adapter for worker-based autoscaling
- Set up alerts - Get notified of capacity issues
- Cost analysis - Measure subscription utilization vs pod costs
References
- Prometheus Operator: https://prometheus-operator.dev/
- HPA documentation: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
- Prometheus adapter: https://github.com/kubernetes-sigs/prometheus-adapter