Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
451 lines
15 KiB
Markdown
451 lines
15 KiB
Markdown
# Z.AI Proxy Metrics and Autoscaling
|
|
|
|
## Overview
|
|
|
|
The zai-proxy has been enhanced with comprehensive Prometheus metrics and autoscaling capabilities to maximize utilization of the z.ai coding subscription.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Z.AI Proxy Cluster │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ zai-proxy │ │ zai-proxy │ │ zai-proxy │ │
|
|
│ │ Pod 1 │ │ Pod 2 │ │ Pod N │ │
|
|
│ │ │ │ │ │ │ │
|
|
│ │ MAX_WORKERS │ │ MAX_WORKERS │ │ MAX_WORKERS │ │
|
|
│ │ = 20 │ │ = 20 │ │ = 20 │ │
|
|
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
|
│ │ │ │ │
|
|
│ └────────────────┴────────────────┘ │
|
|
│ │ │
|
|
│ /metrics endpoint │
|
|
│ │ │
|
|
│ ┌────────────────▼────────────────┐ │
|
|
│ │ ServiceMonitor │ │
|
|
│ │ (scrapes every 15s) │ │
|
|
│ └────────────────┬────────────────┘ │
|
|
│ │ │
|
|
│ ┌────────────────▼────────────────┐ │
|
|
│ │ Prometheus │ │
|
|
│ │ (stores time-series data) │ │
|
|
│ └────────────────┬────────────────┘ │
|
|
│ │ │
|
|
│ ┌────────────────▼────────────────┐ │
|
|
│ │ HorizontalPodAutoscaler │ │
|
|
│ │ - CPU > 70%: scale up │ │
|
|
│ │ - Memory > 80%: scale up │ │
|
|
│ │ - Worker util > 80%: scale up │ │
|
|
│ │ Min: 1, Max: 5 replicas │ │
|
|
│ └─────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌───────────────────────────────────────────────────────┐ │
|
|
│ │ Grafana Dashboard │ │
|
|
│ │ - Worker utilization gauge │ │
|
|
│ │ - Request rate by status code │ │
|
|
│ │ - Concurrent requests vs max workers │ │
|
|
│ │ - Request duration percentiles (p50, p90, p99) │ │
|
|
│ │ - Request/response size metrics │ │
|
|
│ │ - Upstream error tracking │ │
|
|
│ └───────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Metrics Exposed
|
|
|
|
### Request Metrics
|
|
|
|
1. **`zai_proxy_requests_total`** (Counter)
|
|
- Total number of requests by method, path, and status code
|
|
- Labels: `method`, `path`, `status_code`
|
|
- Example: `zai_proxy_requests_total{method="POST",path="/v1/messages",status_code="200"}`
|
|
|
|
2. **`zai_proxy_request_duration_seconds`** (Histogram)
|
|
- Request duration in seconds
|
|
- Buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s, 30s, 60s, 120s, 300s
|
|
- Labels: `method`, `path`, `status_code`
|
|
- Useful for: p50, p90, p99 latency calculations
|
|
|
|
3. **`zai_proxy_request_size_bytes`** (Histogram)
|
|
- Request payload size in bytes
|
|
- Exponential buckets: 100, 1000, 10000, ...
|
|
- Labels: `method`, `path`
|
|
|
|
4. **`zai_proxy_response_size_bytes`** (Histogram)
|
|
- Response payload size in bytes
|
|
- Exponential buckets: 100, 1000, 10000, ...
|
|
- Labels: `method`, `path`, `status_code`
|
|
|
|
### Worker Metrics
|
|
|
|
5. **`zai_proxy_concurrent_requests`** (Gauge)
|
|
- Number of requests currently being processed
|
|
- Real-time view of active connections
|
|
|
|
6. **`zai_proxy_max_workers`** (Gauge)
|
|
- Maximum number of concurrent workers allowed per pod
|
|
- Set via `MAX_WORKERS` environment variable (default: 20)
|
|
|
|
7. **`zai_proxy_worker_utilization_ratio`** (Gauge)
|
|
- Current worker utilization ratio (concurrent_requests / max_workers)
|
|
- Range: 0.0 to 1.0+
|
|
- **Key metric for autoscaling decisions**
|
|
|
|
### Error Metrics
|
|
|
|
8. **`zai_proxy_upstream_errors_total`** (Counter)
|
|
- Total number of upstream errors by type
|
|
- Labels: `error_type`
|
|
- Error types:
|
|
- `request_creation` - Failed to create upstream request
|
|
- `upstream_connection` - Failed to connect to z.ai API
|
|
- `read_error` - Error reading response from z.ai
|
|
- `write_error` - Error writing response to client
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
Both deployments (`ardenone-cluster/devpod` and `apexalgo-iad/mcp`) support:
|
|
|
|
```yaml
|
|
env:
|
|
- name: ZAI_API_KEY
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: zai-api-key
|
|
key: api-key
|
|
|
|
- name: MAX_WORKERS
|
|
value: "20" # Adjust based on subscription limits
|
|
```
|
|
|
|
**MAX_WORKERS**: Controls the maximum number of concurrent requests a single pod will handle. When exceeded, the proxy returns `503 Service Unavailable` to trigger autoscaling.
|
|
|
|
### Autoscaling Behavior
|
|
|
|
**Scale Up:**
|
|
- Stabilization window: 30 seconds
|
|
- Policies:
|
|
- Can double pod count instantly (100% increase)
|
|
- Or add 2 pods at a time
|
|
- Uses the most aggressive policy
|
|
|
|
**Scale Down:**
|
|
- Stabilization window: 300 seconds (5 minutes)
|
|
- Policies:
|
|
- Maximum 25% reduction at a time
|
|
- Slow scale-down to avoid thrashing
|
|
|
|
**Replica Limits:**
|
|
- Minimum: 1 pod
|
|
- Maximum: 5 pods
|
|
|
|
### Scaling Triggers
|
|
|
|
1. **CPU Utilization > 70%**
|
|
2. **Memory Utilization > 80%**
|
|
3. **Worker Utilization > 80%** (requires prometheus-adapter - see below)
|
|
|
|
## Prometheus Adapter Configuration (Optional)
|
|
|
|
To enable custom metric-based autoscaling (worker utilization), configure prometheus-adapter:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: prometheus-adapter-config
|
|
namespace: monitoring
|
|
data:
|
|
config.yaml: |
|
|
rules:
|
|
- seriesQuery: 'zai_proxy_worker_utilization_ratio'
|
|
resources:
|
|
overrides:
|
|
namespace: {resource: "namespace"}
|
|
pod: {resource: "pod"}
|
|
name:
|
|
matches: "^(.*)$"
|
|
as: "zai_proxy_worker_utilization_ratio"
|
|
metricsQuery: 'avg_over_time(zai_proxy_worker_utilization_ratio[2m])'
|
|
```
|
|
|
|
Then uncomment the custom metric section in the HPA manifests.
|
|
|
|
## Querying Metrics
|
|
|
|
### Useful PromQL Queries
|
|
|
|
**Request rate (req/s):**
|
|
```promql
|
|
sum(rate(zai_proxy_requests_total[5m]))
|
|
```
|
|
|
|
**Request rate by status code:**
|
|
```promql
|
|
sum(rate(zai_proxy_requests_total[5m])) by (status_code)
|
|
```
|
|
|
|
**p99 latency:**
|
|
```promql
|
|
histogram_quantile(0.99, sum(rate(zai_proxy_request_duration_seconds_bucket[5m])) by (le))
|
|
```
|
|
|
|
**Worker utilization (current):**
|
|
```promql
|
|
sum(zai_proxy_worker_utilization_ratio)
|
|
```
|
|
|
|
**Total concurrent capacity:**
|
|
```promql
|
|
sum(zai_proxy_max_workers)
|
|
```
|
|
|
|
**Error rate:**
|
|
```promql
|
|
sum(rate(zai_proxy_upstream_errors_total[5m])) by (error_type)
|
|
```
|
|
|
|
**Success rate (non-5xx):**
|
|
```promql
|
|
sum(rate(zai_proxy_requests_total{status_code!~"5.."}[5m]))
|
|
/
|
|
sum(rate(zai_proxy_requests_total[5m]))
|
|
```
|
|
|
|
## Grafana Dashboard
|
|
|
|
A pre-configured Grafana dashboard is deployed to `monitoring` namespace:
|
|
|
|
**Panels:**
|
|
1. **Worker Utilization Gauge** - Real-time utilization percentage
|
|
2. **Request Rate by Status Code** - Time-series of req/s grouped by HTTP status
|
|
3. **Concurrent Requests vs Max Workers** - Visual capacity tracking
|
|
4. **Request Duration Percentiles** - p50, p90, p99 latency trends
|
|
5. **Request/Response Size (p90)** - Bandwidth usage
|
|
6. **Upstream Errors** - Error rate by type
|
|
|
|
**Access:**
|
|
- Navigate to Grafana (check IngressRoute for URL)
|
|
- Search for "Z.AI Proxy Metrics" dashboard
|
|
|
|
## Deployment Workflow
|
|
|
|
### 1. Build New Container Image
|
|
|
|
```bash
|
|
cd /home/coder/ardenone-cluster
|
|
|
|
# Version is already bumped to 1.1.0
|
|
git add containers/zai-proxy/
|
|
git commit -m "feat(zai-proxy): add Prometheus metrics and worker pool management
|
|
|
|
- Add comprehensive Prometheus metrics for requests, durations, sizes
|
|
- Track concurrent requests and worker utilization
|
|
- Add MAX_WORKERS environment variable for capacity control
|
|
- Expose /metrics endpoint for Prometheus scraping
|
|
|
|
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
|
|
|
|
git push origin main
|
|
```
|
|
|
|
**Wait for GitHub Actions to complete** (~5 minutes). Check:
|
|
- https://github.com/ardenone/ardenone-cluster/actions
|
|
|
|
### 2. Deploy ServiceMonitors and HPAs
|
|
|
|
```bash
|
|
git add cluster-configuration/
|
|
git commit -m "feat(zai-proxy): add ServiceMonitors, HPAs, and Grafana dashboard
|
|
|
|
- Add ServiceMonitor for both ardenone-cluster and apexalgo-iad
|
|
- Configure HorizontalPodAutoscaler with CPU/memory/worker metrics
|
|
- Deploy Grafana dashboard for visualization
|
|
- Update deployments with MAX_WORKERS=20 and metrics port
|
|
|
|
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
|
|
|
|
git push origin main
|
|
```
|
|
|
|
ArgoCD will automatically sync these changes.
|
|
|
|
### 3. Update Deployment to v1.1.0
|
|
|
|
**ONLY AFTER GitHub Actions build succeeds:**
|
|
|
|
```bash
|
|
# Update image version in both deployments
|
|
sed -i 's|ronaldraygun/zai-proxy:1.0.0|ronaldraygun/zai-proxy:1.1.0|g' \
|
|
cluster-configuration/ardenone-cluster/devpod/zai-proxy.yml \
|
|
cluster-configuration/apexalgo-iad/mcp/zai-proxy.yml
|
|
|
|
git add cluster-configuration/
|
|
git commit -m "chore(zai-proxy): bump to v1.1.0 with metrics support
|
|
|
|
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
|
|
|
|
git push origin main
|
|
```
|
|
|
|
### 4. Verify Deployment
|
|
|
|
**Check pods:**
|
|
```bash
|
|
kubectl get pods -n devpod -l app=zai-proxy
|
|
kubectl get pods -n mcp -l app=zai-proxy --kubeconfig=/home/coder/.kube/apexalgo-iad.kubeconfig
|
|
```
|
|
|
|
**Check metrics endpoint:**
|
|
```bash
|
|
kubectl port-forward -n devpod svc/zai-proxy 8080:8080 &
|
|
curl http://localhost:8080/metrics | grep zai_proxy
|
|
```
|
|
|
|
**Check HPA status:**
|
|
```bash
|
|
kubectl get hpa -n devpod zai-proxy
|
|
kubectl describe hpa -n devpod zai-proxy
|
|
```
|
|
|
|
**Check ServiceMonitor:**
|
|
```bash
|
|
kubectl get servicemonitor -n devpod zai-proxy
|
|
```
|
|
|
|
## Tuning for Maximum Subscription Utilization
|
|
|
|
### Strategy 1: Fixed Worker Pool
|
|
|
|
Set `MAX_WORKERS` based on your z.ai subscription limits:
|
|
|
|
- **If subscription allows 50 concurrent requests:**
|
|
- Set `MAX_WORKERS=10` with `maxReplicas=5` (10 * 5 = 50 total)
|
|
- Or `MAX_WORKERS=25` with `maxReplicas=2` (25 * 2 = 50 total)
|
|
|
|
### Strategy 2: Dynamic Scaling
|
|
|
|
1. Monitor `zai_proxy_worker_utilization_ratio` in Grafana
|
|
2. If consistently below 0.5 (50%), reduce `MAX_WORKERS` or `maxReplicas`
|
|
3. If frequently hitting 1.0 (100%), increase `MAX_WORKERS` or `maxReplicas`
|
|
|
|
### Strategy 3: Cost Optimization
|
|
|
|
- **Low-traffic periods:** Set `minReplicas=1`
|
|
- **High-traffic periods:** Use aggressive scale-up policies
|
|
- **Balance:** Slow scale-down (5 min stabilization) prevents over-provisioning
|
|
|
|
## Alerting Rules (Optional)
|
|
|
|
Add Prometheus alerting rules to get notified of issues:
|
|
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: zai-proxy-alerts
|
|
namespace: monitoring
|
|
spec:
|
|
groups:
|
|
- name: zai-proxy
|
|
interval: 30s
|
|
rules:
|
|
- alert: ZaiProxyHighErrorRate
|
|
expr: |
|
|
sum(rate(zai_proxy_requests_total{status_code=~"5.."}[5m]))
|
|
/
|
|
sum(rate(zai_proxy_requests_total[5m]))
|
|
> 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Z.AI Proxy error rate > 5%"
|
|
|
|
- alert: ZaiProxyHighUtilization
|
|
expr: sum(zai_proxy_worker_utilization_ratio) > 0.9
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Z.AI Proxy worker utilization > 90% for 10 minutes"
|
|
|
|
- alert: ZaiProxyAtMaxCapacity
|
|
expr: |
|
|
sum(zai_proxy_concurrent_requests)
|
|
>=
|
|
sum(zai_proxy_max_workers)
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Z.AI Proxy at maximum capacity - requests being rejected"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Metrics not appearing in Prometheus
|
|
|
|
1. Check ServiceMonitor is deployed:
|
|
```bash
|
|
kubectl get servicemonitor -n devpod
|
|
```
|
|
|
|
2. Check Prometheus is scraping:
|
|
```bash
|
|
kubectl logs -n monitoring -l app=prometheus
|
|
```
|
|
|
|
3. Verify metrics endpoint is accessible:
|
|
```bash
|
|
kubectl exec -n devpod deploy/zai-proxy -- wget -O- http://localhost:8080/metrics
|
|
```
|
|
|
|
### HPA not scaling
|
|
|
|
1. Check HPA status:
|
|
```bash
|
|
kubectl describe hpa -n devpod zai-proxy
|
|
```
|
|
|
|
2. Verify metrics-server is running:
|
|
```bash
|
|
kubectl get pods -n kube-system -l k8s-app=metrics-server
|
|
```
|
|
|
|
3. Check current metrics:
|
|
```bash
|
|
kubectl get hpa -n devpod zai-proxy -o yaml
|
|
```
|
|
|
|
### Pods stuck at capacity (503 errors)
|
|
|
|
1. Check worker utilization:
|
|
```promql
|
|
sum(zai_proxy_worker_utilization_ratio)
|
|
```
|
|
|
|
2. Increase `MAX_WORKERS` or `maxReplicas` in HPA
|
|
|
|
3. Verify HPA is allowed to scale up:
|
|
```bash
|
|
kubectl get hpa -n devpod zai-proxy
|
|
# Current replicas should be < maxReplicas
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
1. **Monitor for 1-2 weeks** - Collect baseline metrics
|
|
2. **Tune MAX_WORKERS** - Adjust based on actual utilization
|
|
3. **Enable custom metrics** - Configure prometheus-adapter for worker-based autoscaling
|
|
4. **Set up alerts** - Get notified of capacity issues
|
|
5. **Cost analysis** - Measure subscription utilization vs pod costs
|
|
|
|
## References
|
|
|
|
- Prometheus Operator: https://prometheus-operator.dev/
|
|
- HPA documentation: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
|
|
- Prometheus adapter: https://github.com/kubernetes-sigs/prometheus-adapter
|