zai-proxy/docs/notes/MONITORING_SETUP.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

413 lines
15 KiB
Markdown

# ZAI-Proxy Monitoring Setup - Dual Deployment
## Overview
This document describes the monitoring configuration for zai-proxy dual deployment (production + canary). The setup includes ServiceMonitors for Prometheus scraping, PrometheusRules for alerting, and a Grafana dashboard for visualization.
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ apexalgo-iad Cluster │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Production │ │ Canary │ │
│ │ (mcp namespace) │ │ (devpod namespace)│ │
│ │ │ │ │ │
│ │ zai-proxy:1.0.0 │ │zai-proxy:1.2.0 │ │
│ │ TOKEN_COUNTING │ │TOKEN_COUNTING │ │
│ │ = false │ │= true │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │
│ │ /metrics │ /metrics │
│ │ variant="production" │ variant="canary" │
│ ▼ ▼ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Prometheus (monitoring namespace) │ │
│ │ │ │
│ │ ServiceMonitor: zai-proxy-production │ │
│ │ selector: app=zai-proxy, version=production │ │
│ │ namespace: mcp │ │
│ │ relabels: deployment_variant=production │ │
│ │ │ │
│ │ ServiceMonitor: zai-proxy-canary │ │
│ │ selector: app=zai-proxy-canary, version=canary │ │
│ │ namespace: devpod │ │
│ │ relabels: deployment_variant=canary │ │
│ │ │ │
│ │ PrometheusRules: zai-proxy-canary-alerts │ │
│ │ - Canaries-specific alerts │ │
│ │ - Comparison alerts vs production │ │
│ └────────────────────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Grafana Dashboard │ │
│ │ zai-proxy-dual-deployment.json │ │
│ │ │ │
│ │ Panels: │ │
│ │ - Worker Utilization (gauge) │ │
│ │ - Request Rate (timeseries) │ │
│ │ - Error Rate (timeseries) │ │
│ │ - Latency Comparison (P50/P95) │ │
│ │ - Current Rate Limit │ │
│ │ - Upstream Errors │ │
│ │ - Concurrent Requests (gauge) │ │
│ │ - Token Throughput (canary only) │ │
│ │ - Token Counting Duration (canary only) │ │
│ │ - Rate Limit Adjustments (canary only) │ │
│ └────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
## File Structure
```
k8s/
├── production/
│ ├── deployment.yml # Production deployment (mcp namespace)
│ └── service.yml # Production service with version label
├── canary/
│ ├── deployment.yml # Canary deployment (devpod namespace)
│ └── service.yml # Canary service with version label
└── monitoring/
├── servicemonitor-production.yml # Prometheus scraping for production
├── servicemonitor-canary.yml # Prometheus scraping for canary
├── prometheus-rules.yml # Canary-specific alerts
└── grafana-dashboard-configmap.yml # Grafana dashboard JSON
```
## ServiceMonitor Configuration
### Production ServiceMonitor
**File**: `k8s/monitoring/servicemonitor-production.yml`
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: zai-proxy-production
namespace: monitoring
labels:
app: zai-proxy
release: kube-prometheus-stack-arde
variant: production
spec:
selector:
matchLabels:
app: zai-proxy
version: production
namespaceSelector:
matchNames:
- mcp
endpoints:
- port: http
path: /metrics
interval: 30s
relabelings:
- sourceLabels: [__meta_kubernetes_service_label_version]
targetLabel: deployment_variant
```
### Canary ServiceMonitor
**File**: `k8s/monitoring/servicemonitor-canary.yml`
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: zai-proxy-canary
namespace: monitoring
labels:
app: zai-proxy
release: kube-prometheus-stack-arde
variant: canary
spec:
selector:
matchLabels:
app: zai-proxy-canary
version: canary
namespaceSelector:
matchNames:
- devpod
endpoints:
- port: http
path: /metrics
interval: 30s
relabelings:
- sourceLabels: [__meta_kubernetes_service_label_version]
targetLabel: deployment_variant
```
**Key Points**:
- Both ServiceMonitors add `deployment_variant` label via relabeling
- Production scrapes from `mcp` namespace
- Canary scrapes from `devpod` namespace
- Scrape interval: 30 seconds
## Metrics
### Application Metrics (from `metrics.go`)
| Metric Name | Type | Labels | Description |
|------------|------|--------|-------------|
| `zai_proxy_requests_total` | Counter | method, path, status_code, variant | Total requests |
| `zai_proxy_request_duration_seconds` | Histogram | method, path, status_code, variant | Request latency |
| `zai_proxy_concurrent_requests` | Gauge | variant | Active requests |
| `zai_proxy_worker_utilization_ratio` | Gauge | variant | Worker utilization % |
| `zai_proxy_rate_limit_requests_per_second` | Gauge | variant | Current rate limit |
| `zai_proxy_tokens_total` | Counter | direction, model, variant | Token counts |
| `zai_proxy_token_count_duration_seconds` | Histogram | variant | Token counting time |
| `zai_proxy_build_info` | Gauge | version, variant, commit, build_time | Build metadata |
### Label Mapping
| Application Label | ServiceMonitor Relabel | Dashboard Query |
|------------------|----------------------|-----------------|
| `variant="production"` | `deployment_variant="production"` | `deployment_variant="production"` |
| `variant="canary"` | `deployment_variant="canary"` | `deployment_variant="canary"` |
## Grafana Dashboard
**File**: `k8s/monitoring/grafana-dashboard-configmap.yml`
**Dashboard**: "ZAI Proxy - Production vs Canary"
### Panels
1. **Worker Utilization** (Gauge)
- Shows concurrent requests vs max workers
- Separate gauges for production and canary
2. **Request Rate** (Time Series)
- `sum(rate(zai_proxy_requests_total{deployment_variant="production"}[5m]))`
- `sum(rate(zai_proxy_requests_total{deployment_variant="canary"}[5m]))`
3. **Error Rate** (Time Series)
- 5xx errors as percentage of total requests
- Separate lines for production and canary
4. **Latency Comparison** (Time Series)
- P50 and P95 percentiles
- Separate lines for each deployment
5. **Current Rate Limit** (Time Series)
- `zai_proxy_rate_limit_requests_per_second`
6. **Upstream Errors** (Time Series)
- `zai_proxy_upstream_errors_total` by error_type
7. **Concurrent Requests** (Gauge)
- Current active requests per deployment
8. **Token Throughput** (Time Series) - Canary Only
- `zai_proxy_tokens_total` by direction and model
- Only for canary since production has token counting disabled
9. **Token Counting Duration** (Time Series) - Canary Only
- P95 of `zai_proxy_token_count_duration_seconds`
10. **Rate Limit Adjustments** (Time Series) - Canary Only
- `zai_proxy_rate_limit_adjustments_total` by direction
### Dashboard Labels
```json
"tags": ["zai-proxy", "canary", "production", "monitoring"]
```
## Prometheus Rules - Canary Alerts
**File**: `k8s/monitoring/prometheus-rules.yml`
### Alert Rules
| Alert Name | Severity | Condition | Duration |
|-----------|----------|-----------|----------|
| `ZaiProxyCanaryHighErrorRate` | warning | Error rate > 5% | 5min |
| `ZaiProxyCanaryHighLatency` | warning | P95 > 10s | 5min |
| `ZaiProxyCanaryCrashLooping` | critical | Restart rate > 0 | 5min |
| `ZaiProxyCanaryNotReady` | critical | 0 ready pods | 2min |
| `ZaiProxyCanaryDegradedVsProduction` | warning | 2x error rate vs production | 10min |
| `ZaiProxyCanarySlowerThanProduction` | warning | 50% higher P95 vs production | 10min |
| `ZaiProxyCanaryTokenCountingSlow` | warning | Token counting P95 > 100ms | 5min |
| `ZaiProxyCanaryRateLimitAdjustingDown` | info | Rate limit decreasing | 5min |
### Alert Examples
**High Error Rate**:
```promql
(
sum(rate(zai_proxy_requests_total{deployment_variant="canary",status_code=~"5.."}[5m]))
/
sum(rate(zai_proxy_requests_total{deployment_variant="canary"}[5m]))
) > 0.05
```
**Degraded vs Production**:
```promql
(
sum(rate(zai_proxy_requests_total{deployment_variant="canary",status_code=~"5.."}[10m]))
/
sum(rate(zai_proxy_requests_total{deployment_variant="canary"}[10m]))
) > 2 * (
sum(rate(zai_proxy_requests_total{deployment_variant="production",status_code=~"5.."}[10m]))
/
sum(rate(zai_proxy_requests_total{deployment_variant="production"}[10m])) + 0.01
)
```
## Deployment Labels
### Production Deployment
```yaml
# k8s/production/deployment.yml
spec:
template:
metadata:
labels:
app: zai-proxy
version: production
spec:
containers:
- name: proxy
env:
- name: DEPLOYMENT_VARIANT
value: "production"
- name: TOKEN_COUNTING_ENABLED
value: "false"
```
### Canary Deployment
```yaml
# k8s/canary/deployment.yml
spec:
template:
metadata:
labels:
app: zai-proxy-canary
version: canary
spec:
containers:
- name: proxy
env:
- name: DEPLOYMENT_VARIANT
value: "canary"
- name: TOKEN_COUNTING_ENABLED
value: "true"
```
## Verification
Run the verification script to check the monitoring setup:
```bash
./scripts/verify-monitoring.sh
```
This will check:
- Monitoring namespace exists
- ServiceMonitors are configured correctly
- PrometheusRules are deployed
- Grafana dashboard exists
- Relabel configs are correct
## Manual Metrics Testing
To verify metrics are being exported correctly:
```bash
# Production metrics endpoint
kubectl port-forward -n mcp deployment/zai-proxy 8080:8080
curl http://localhost:8080/metrics | grep zai_proxy
# Canary metrics endpoint
kubectl port-forward -n devpod deployment/zai-proxy-canary 8080:8080
curl http://localhost:8080/metrics | grep zai_proxy
```
## Prometheus Queries
### Compare request rates
```promql
sum(rate(zai_proxy_requests_total{deployment_variant="production"}[5m]))
sum(rate(zai_proxy_requests_total{deployment_variant="canary"}[5m]))
```
### Check token counting (canary only)
```promql
sum(rate(zai_proxy_tokens_total{deployment_variant="canary"}[5m])) by (direction, model)
```
### Compare error rates
```promql
sum(rate(zai_proxy_requests_total{deployment_variant="production",status_code=~"5.."}[5m])) /
sum(rate(zai_proxy_requests_total{deployment_variant="production"}[5m]))
```
## Troubleshooting
### Metrics not appearing
1. Check ServiceMonitor exists:
```bash
kubectl get servicemonitor -n monitoring | grep zai-proxy
```
2. Check Service labels match ServiceMonitor selector:
```bash
kubectl get service -n mcp zai-proxy -o jsonpath='{.metadata.labels}'
kubectl get service -n devpod zai-proxy-canary -o jsonpath='{.metadata.labels}'
```
3. Check Prometheus is scraping:
```bash
kubectl get configmap -n monitoring prometheus-kube-prometheus-prometheus-targets -o jsonpath='{.data}'
```
### Dashboard not loading
1. Check ConfigMap exists:
```bash
kubectl get configmap -n monitoring zai-proxy-grafana-dashboard
```
2. Verify dashboard JSON is valid:
```bash
kubectl get configmap -n monitoring zai-proxy-grafana-dashboard -o jsonpath='{.data.*}' | jq .
```
### Alerts not firing
1. Check PrometheusRule exists:
```bash
kubectl get prometheusrule -n monitoring zai-proxy-canary-alerts
```
2. Verify rules are loaded:
```bash
kubectl port-forward -n monitoring prometheus-kube-prometheus-prometheus-0 9090:9090
curl http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="zai_proxy_canary_alerts")'
```
## Summary
The monitoring setup provides:
- ✅ Separate scraping for production and canary deployments
- ✅ Version labels for metric filtering
- ✅ Grafana dashboard with side-by-side comparison
- ✅ Token counting metrics (canary only)
- ✅ Request rates, error rates, latency for both
- ✅ Canary-specific alerts that don't affect production
- ✅ Comparison alerts (canary vs production)
All monitoring resources are managed via GitOps (ArgoCD) by committing manifests to the repository.