zai-proxy/docs/notes/metrics.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

17 KiB

ZAI Proxy Prometheus Metrics Documentation

Overview

The zai-proxy exports comprehensive metrics for monitoring token consumption, request performance, rate limiting, and system health. All metrics are exposed on the /metrics endpoint in Prometheus text format.

Metrics Endpoint

# Access metrics
curl http://zai-proxy:8080/metrics

# Query from within Kubernetes cluster
curl http://zai-proxy.mcp.svc.cluster.local:8080/metrics

Token Consumption Metrics

zai_proxy_tokens_total

Type: Counter

Description: Total number of tokens processed, tracking both input (prompt) and output (completion) tokens separately.

Labels:

  • direction - Token direction: input (prompt tokens) or output (completion tokens)
  • model - Tokenizer model name (e.g., glm-4, claude-3)
  • variant - Deployment variant: stable (production) or canary (testing)

Example values:

zai_proxy_tokens_total{direction="input",model="glm-4",variant="stable"} 1250000
zai_proxy_tokens_total{direction="output",model="glm-4",variant="stable"} 3500000
zai_proxy_tokens_total{direction="input",model="glm-4",variant="canary"} 15000
zai_proxy_tokens_total{direction="output",model="glm-4",variant="canary"} 42000

Use cases:

  • Track total token consumption over time
  • Calculate cost based on token usage
  • Compare input vs output token ratios
  • Monitor canary deployment token usage separately from production

zai_proxy_token_rate_seconds

Type: Histogram

Description: Time taken to process tokens (tokenization speed). Lower values indicate faster tokenization. Measures the duration of the tokenization operation itself.

Labels:

  • direction - Token direction: input or output
  • model - Tokenizer model name
  • variant - Deployment variant: stable or canary

Buckets: [.00001, .00005, .0001, .0005, .001, .005, .01, .05, .1] (seconds)

Example values:

zai_proxy_token_rate_seconds_bucket{direction="input",model="glm-4",variant="stable",le="0.001"} 9500
zai_proxy_token_rate_seconds_bucket{direction="input",model="glm-4",variant="stable",le="0.005"} 9980
zai_proxy_token_rate_seconds_bucket{direction="input",model="glm-4",variant="stable",le="+Inf"} 10000
zai_proxy_token_rate_seconds_sum{direction="input",model="glm-4",variant="stable"} 8.234
zai_proxy_token_rate_seconds_count{direction="input",model="glm-4",variant="stable"} 10000

Use cases:

  • Monitor tokenization performance
  • Detect tokenizer slowdowns
  • Compare performance between models
  • Alert on slow tokenization (>10ms P95)

zai_proxy_token_rate

Type: Histogram

Description: Token processing throughput in tokens per second. Higher values indicate faster processing. Measures how many tokens are processed per unit time.

Labels:

  • direction - Token direction: input or output
  • model - Tokenizer model name
  • variant - Deployment variant: stable or canary

Buckets: [10, 50, 100, 250, 500, 1000, 2500, 5000, 10000, 25000, 50000, 100000] (tokens/second)

Example values:

zai_proxy_token_rate_bucket{direction="input",model="glm-4",variant="stable",le="1000"} 120
zai_proxy_token_rate_bucket{direction="input",model="glm-4",variant="stable",le="5000"} 850
zai_proxy_token_rate_bucket{direction="input",model="glm-4",variant="stable",le="+Inf"} 1000
zai_proxy_token_rate_sum{direction="input",model="glm-4",variant="stable"} 2500000
zai_proxy_token_rate_count{direction="input",model="glm-4",variant="stable"} 1000

Use cases:

  • Monitor tokenization throughput
  • Compare throughput between input and output tokenization
  • Identify performance bottlenecks
  • Capacity planning based on tokens/second

zai_proxy_token_count_duration_seconds

Type: Histogram

Description: Overall duration of token counting operations, including both tokenization and any overhead.

Labels:

  • variant - Deployment variant: stable or canary

Buckets: [.0001, .0005, .001, .005, .01, .025, .05, .1] (seconds)

Use cases:

  • Monitor total token counting overhead
  • Ensure token counting latency stays below target (<5ms P95)
  • Compare performance between stable and canary deployments

Request Performance Metrics

zai_proxy_requests_total

Type: Counter

Description: Total number of requests processed.

Labels:

  • method - HTTP method (GET, POST, etc.)
  • path - Request path
  • status_code - HTTP status code
  • variant - Deployment variant

Example query:

# Request rate by status code
rate(zai_proxy_requests_total{variant="stable"}[5m])

# Error rate
rate(zai_proxy_requests_total{variant="stable",status_code=~"5.."}[5m])

zai_proxy_request_duration_seconds

Type: Histogram

Description: Request duration from start to completion.

Labels:

  • method, path, status_code, variant

Buckets: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 30, 60, 120, 300] (seconds)

Example query:

# P95 latency
histogram_quantile(0.95, rate(zai_proxy_request_duration_seconds_bucket{variant="stable"}[5m]))

# Average latency
rate(zai_proxy_request_duration_seconds_sum{variant="stable"}[5m]) /
rate(zai_proxy_request_duration_seconds_count{variant="stable"}[5m])

zai_proxy_request_size_bytes / zai_proxy_response_size_bytes

Type: Histogram

Description: Request and response body sizes in bytes.

Labels:

  • Request: method, path, variant
  • Response: method, path, status_code, variant

Buckets: Exponential (100, 1000, 10000, ...)

Example query:

# Average response size
rate(zai_proxy_response_size_bytes_sum{variant="stable"}[5m]) /
rate(zai_proxy_response_size_bytes_count{variant="stable"}[5m])

Concurrency & Worker Metrics

zai_proxy_concurrent_requests

Type: Gauge

Description: Number of requests currently being processed.

Labels:

  • variant - Deployment variant

Example query:

# Current load
zai_proxy_concurrent_requests{variant="stable"}

zai_proxy_max_workers

Type: Gauge

Description: Maximum number of concurrent workers allowed (configured limit).

Labels:

  • variant - Deployment variant

zai_proxy_worker_utilization_ratio

Type: Gauge

Description: Worker utilization ratio (concurrent_requests / max_workers). Value ranges from 0.0 to 1.0 (or higher if overloaded).

Labels:

  • variant - Deployment variant

Example query:

# Worker utilization percentage
zai_proxy_worker_utilization_ratio{variant="stable"} * 100

# Alert when utilization exceeds 80%
zai_proxy_worker_utilization_ratio{variant="stable"} > 0.8

Rate Limiting Metrics

zai_proxy_rate_limit_requests_per_second

Type: Gauge

Description: Current rate limit in requests per second. This value adjusts automatically based on upstream 429 responses.

Labels:

  • variant - Deployment variant

zai_proxy_rate_limit_wait_seconds

Type: Histogram

Description: Time spent waiting for rate limiter before processing request.

Labels:

  • variant - Deployment variant

Buckets: [.001, .005, .01, .025, .05, .1, .25, .5, 1, 2, 5, 10] (seconds)

zai_proxy_rate_limit_adjustments_total

Type: Counter

Description: Number of times the rate limit was adjusted (increased or decreased).

Labels:

  • direction - Adjustment direction: increase or decrease
  • variant - Deployment variant

Example query:

# Rate limit adjustments over time
rate(zai_proxy_rate_limit_adjustments_total{variant="stable"}[10m])

zai_proxy_rate_limit_rejections_total

Type: Counter

Description: Number of requests rejected due to rate limiting.

Labels:

  • variant - Deployment variant

Error Metrics

zai_proxy_upstream_errors_total

Type: Counter

Description: Total number of upstream errors by error type.

Labels:

  • error_type - Error type: request_creation, upstream_connection, read_error, write_error
  • variant - Deployment variant

Example query:

# Error rate by type
rate(zai_proxy_upstream_errors_total{variant="stable"}[5m])

zai_proxy_retry_attempts_total

Type: Counter

Description: Total number of retry attempts.

Labels:

  • reason - Retry reason: 429 (rate limited), network_error, or retry (general)
  • variant - Deployment variant

Build Info Metric

zai_proxy_build_info

Type: Gauge (always 1)

Description: Build information including version, variant, commit hash, and build time. This metric always has value 1 and exists solely to export build metadata as labels.

Labels:

  • version - Version number (e.g., v1.3.0)
  • variant - Deployment variant: stable or canary
  • commit - Git commit hash
  • build_time - Build timestamp

Example query:

# View current deployed version
zai_proxy_build_info{variant="stable"}

Example Prometheus Queries

Token Consumption Analysis

# Total tokens processed per hour (input + output)
sum(increase(zai_proxy_tokens_total{variant="stable"}[1h]))

# Input vs output token ratio
sum(rate(zai_proxy_tokens_total{direction="input",variant="stable"}[5m])) /
sum(rate(zai_proxy_tokens_total{direction="output",variant="stable"}[5m]))

# Token usage by model
sum by (model) (rate(zai_proxy_tokens_total{variant="stable"}[5m]))

# Compare stable vs canary token usage
sum(rate(zai_proxy_tokens_total{variant="stable"}[5m])) by (direction)
sum(rate(zai_proxy_tokens_total{variant="canary"}[5m])) by (direction)

Tokenization Performance

# P95 tokenization latency (input)
histogram_quantile(0.95,
  rate(zai_proxy_token_rate_seconds_bucket{direction="input",variant="stable"}[5m]))

# Average tokenization throughput (tokens/second)
rate(zai_proxy_token_rate_sum{direction="input",variant="stable"}[5m]) /
rate(zai_proxy_token_rate_count{direction="input",variant="stable"}[5m])

# Slow tokenization alert (P95 > 10ms)
histogram_quantile(0.95,
  rate(zai_proxy_token_rate_seconds_bucket{variant="stable"}[5m])) > 0.01

Request Performance

# Requests per second
sum(rate(zai_proxy_requests_total{variant="stable"}[5m]))

# P95 request latency
histogram_quantile(0.95,
  rate(zai_proxy_request_duration_seconds_bucket{variant="stable"}[5m]))

# Error rate (5xx responses)
sum(rate(zai_proxy_requests_total{variant="stable",status_code=~"5.."}[5m])) /
sum(rate(zai_proxy_requests_total{variant="stable"}[5m]))

Canary vs Production Comparison

# Token processing rate comparison
sum(rate(zai_proxy_tokens_total{variant="stable"}[5m])) by (direction)
sum(rate(zai_proxy_tokens_total{variant="canary"}[5m])) by (direction)

# Latency comparison (P95)
histogram_quantile(0.95,
  rate(zai_proxy_request_duration_seconds_bucket{variant="stable"}[5m]))
histogram_quantile(0.95,
  rate(zai_proxy_request_duration_seconds_bucket{variant="canary"}[5m]))

# Error rate comparison
sum(rate(zai_proxy_requests_total{variant="stable",status_code=~"5.."}[5m])) /
sum(rate(zai_proxy_requests_total{variant="stable"}[5m]))
sum(rate(zai_proxy_requests_total{variant="canary",status_code=~"5.."}[5m])) /
sum(rate(zai_proxy_requests_total{variant="canary"}[5m]))

Capacity Planning

# Worker utilization trend
zai_proxy_worker_utilization_ratio{variant="stable"}

# Concurrent requests vs max workers
zai_proxy_concurrent_requests{variant="stable"}
zai_proxy_max_workers{variant="stable"}

# Rate limiting pressure
rate(zai_proxy_rate_limit_adjustments_total{direction="decrease",variant="stable"}[10m])

Grafana Dashboard Suggestions

Dashboard 1: Token Consumption Overview

Panels:

  1. Total Tokens Processed (Time Series)

    sum(rate(zai_proxy_tokens_total{variant="stable"}[5m])) by (direction)
    
  2. Token Rate by Model (Time Series)

    sum(rate(zai_proxy_tokens_total{variant="stable"}[5m])) by (model, direction)
    
  3. Token Cost Estimate (Stat Panel)

    # Assuming $0.01 per 1000 input tokens, $0.03 per 1000 output tokens
    (sum(increase(zai_proxy_tokens_total{direction="input",variant="stable"}[24h])) * 0.00001) +
    (sum(increase(zai_proxy_tokens_total{direction="output",variant="stable"}[24h])) * 0.00003)
    
  4. Tokenization Latency (Heatmap)

    sum(rate(zai_proxy_token_rate_seconds_bucket{variant="stable"}[5m])) by (le)
    
  5. Input vs Output Token Ratio (Gauge)

    sum(rate(zai_proxy_tokens_total{direction="input",variant="stable"}[5m])) /
    sum(rate(zai_proxy_tokens_total{direction="output",variant="stable"}[5m]))
    

Dashboard 2: Performance & Health

Panels:

  1. Request Rate (Time Series)

    sum(rate(zai_proxy_requests_total{variant="stable"}[5m])) by (status_code)
    
  2. Request Latency Percentiles (Time Series)

    histogram_quantile(0.50, rate(zai_proxy_request_duration_seconds_bucket{variant="stable"}[5m]))
    histogram_quantile(0.95, rate(zai_proxy_request_duration_seconds_bucket{variant="stable"}[5m]))
    histogram_quantile(0.99, rate(zai_proxy_request_duration_seconds_bucket{variant="stable"}[5m]))
    
  3. Error Rate (Time Series)

    sum(rate(zai_proxy_requests_total{variant="stable",status_code=~"5.."}[5m])) /
    sum(rate(zai_proxy_requests_total{variant="stable"}[5m]))
    
  4. Worker Utilization (Gauge)

    zai_proxy_worker_utilization_ratio{variant="stable"} * 100
    
  5. Concurrent Requests (Time Series)

    zai_proxy_concurrent_requests{variant="stable"}
    zai_proxy_max_workers{variant="stable"}
    
  6. Upstream Errors (Time Series)

    sum(rate(zai_proxy_upstream_errors_total{variant="stable"}[5m])) by (error_type)
    

Dashboard 3: Canary Deployment Comparison

Panels:

  1. Token Usage: Stable vs Canary (Time Series)

    sum(rate(zai_proxy_tokens_total[5m])) by (variant, direction)
    
  2. Latency Comparison (Time Series)

    histogram_quantile(0.95, rate(zai_proxy_request_duration_seconds_bucket[5m])) by (variant)
    
  3. Error Rate Comparison (Time Series)

    sum(rate(zai_proxy_requests_total{status_code=~"5.."}[5m])) by (variant) /
    sum(rate(zai_proxy_requests_total[5m])) by (variant)
    
  4. Tokenization Performance Comparison (Time Series)

    histogram_quantile(0.95, rate(zai_proxy_token_rate_seconds_bucket[5m])) by (variant)
    
  5. Request Rate Comparison (Time Series)

    sum(rate(zai_proxy_requests_total[5m])) by (variant)
    

Dashboard 4: Rate Limiting & Capacity

Panels:

  1. Current Rate Limit (Gauge)

    zai_proxy_rate_limit_requests_per_second{variant="stable"}
    
  2. Rate Limit Adjustments (Time Series)

    rate(zai_proxy_rate_limit_adjustments_total{variant="stable"}[5m]) by (direction)
    
  3. Rate Limit Wait Time (Heatmap)

    sum(rate(zai_proxy_rate_limit_wait_seconds_bucket{variant="stable"}[5m])) by (le)
    
  4. Retry Attempts (Time Series)

    rate(zai_proxy_retry_attempts_total{variant="stable"}[5m]) by (reason)
    

Alerting Rules

Critical Alerts

# High error rate
- alert: HighErrorRate
  expr: |
    sum(rate(zai_proxy_requests_total{status_code=~"5.."}[5m])) /
    sum(rate(zai_proxy_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected (>5%)"

# Worker capacity exhausted
- alert: WorkerCapacityExhausted
  expr: zai_proxy_worker_utilization_ratio > 0.9
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Worker utilization above 90%"

# Slow tokenization
- alert: SlowTokenization
  expr: |
    histogram_quantile(0.95,
      rate(zai_proxy_token_rate_seconds_bucket[5m])) > 0.01
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "P95 tokenization latency above 10ms"

Warning Alerts

# Frequent rate limit adjustments
- alert: FrequentRateLimitAdjustments
  expr: |
    rate(zai_proxy_rate_limit_adjustments_total{direction="decrease"}[10m]) > 0.1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Frequent rate limit decreases detected"

# High retry rate
- alert: HighRetryRate
  expr: rate(zai_proxy_retry_attempts_total[5m]) > 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High retry attempt rate"

Configuration

Token counting metrics can be configured via environment variables:

# Enable/disable token counting (default: true)
TOKEN_COUNTING_ENABLED=true

# Tokenizer model name for metrics labels (default: glm-4)
TOKENIZER_MODEL=glm-4

# Deployment variant (default: production)
DEPLOYMENT_VARIANT=stable  # or "canary"

Notes

  • All histograms use carefully tuned bucket ranges for optimal query performance
  • Metrics are designed to support dual-deployment monitoring (stable + canary)
  • Token metrics track both count and processing rate for comprehensive analysis
  • Labels allow filtering by deployment variant to isolate canary testing from production
  • Build info metric enables version tracking across deployments