jedarden 225f7cfe51 docs(dashboard): add comprehensive README.md

- Architecture overview with component diagram
- Quick start for local, frontend dev, Docker, and Kubernetes
- Configuration environment variables reference
- Complete API endpoints documentation (REST + SSE)
- Data model and storage schema explanation
- Development setup and testing instructions
- Troubleshooting guide
- Performance characteristics

Co-Authored-By: Claude <noreply@anthropic.com>
Bead-Id: bf-2o7

2026-06-21 09:56:17 -04:00

17 KiB

Raw Blame History

Z.AI Proxy Dashboard

Real-time web dashboard for monitoring zai-proxy metrics, token usage, and request history.

Features

✅ Real-time Metrics - Live updates via Server-Sent Events (SSE) ✅ Prometheus Scraping - Collects metrics from zai-proxy endpoints ✅ SQLite Storage - Efficient data storage with automatic downsampling ✅ Multi-Variant Support - Monitor production and canary deployments side-by-side ✅ Token Tracking - Visualize input/output token rates and totals ✅ Request Analytics - Latency percentiles, error rates, throughput ✅ React Frontend - Modern UI built with React, Vite, and Tailwind CSS

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Dashboard                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────┐      ┌─────────────┐      ┌──────────────┐              │
│  │   Collector  │─────▶│   Storage   │─────▶│   SSE Hub    │              │
│  │              │      │             │      │              │              │
│  │ Scrapes      │      │ SQLite      │      │ Broadcasts   │              │
│  │ Prometheus   │      │ metrics_5s  │      │ live updates  │              │
│  │ endpoints    │      │ metrics_1m  │      │ to clients   │              │
│  └──────────────┘      └─────────────┘      └──────────────┘              │
│         │                                        │                          │
│         │                                        │                          │
│         ▼                                        ▼                          │
│  ┌──────────────┐                         ┌──────────┐                    │
│  │  zai-proxy   │                         │  React   │                    │
│  │  :8080/metrics                        │  Frontend │                    │
│  └──────────────┘                         └──────────┘                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Components

Collector - Scrapes Prometheus metrics from configured targets every 5 seconds
Storage - SQLite database with dual-resolution storage:
- metrics_5s - High-resolution data (24h retention)
- metrics_1m - Downsampled averages (7d retention)
SSE Hub - Real-time broadcast of new snapshots to connected web clients
API Router - REST endpoints for historical data queries
Frontend - React SPA with live charts and status displays

Quick Start

Run Locally

# From dashboard directory
cd dashboard/

# Set required environment variables (optional, defaults shown)
export SCRAPE_TARGETS="http://localhost:8080/metrics"
export LISTEN_ADDR=":8080"
export DB_PATH="/tmp/dashboard.db"

# Build and run
go run .

# Dashboard available at http://localhost:8080
# Metrics API at http://localhost:8080/api/metrics
# SSE stream at http://localhost:8080/api/events

Frontend Development

cd dashboard/frontend/

# Install dependencies
npm install

# Run dev server (proxies API to :8080)
npm run dev

# Build for production
npm run build

# Run tests
npm run test

Docker Deployment

# Build image
docker build -t zai-proxy-dashboard:latest .

# Run container
docker run -p 8080:8080 \
  -v dashboard-data:/data \
  -e SCRAPE_TARGETS="http://zai-proxy:8080/metrics" \
  zai-proxy-dashboard:latest

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zai-proxy-dashboard
  namespace: mcp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: zai-proxy-dashboard
  template:
    metadata:
      labels:
        app: zai-proxy-dashboard
    spec:
      containers:
      - name: dashboard
        image: ronaldraygun/zai-proxy-dashboard:latest
        ports:
        - containerPort: 8080
          name: http
        env:
        - name: SCRAPE_TARGETS
          value: "http://zai-proxy.mcp.svc.cluster.local:8080/metrics"
        - name: DB_PATH
          value: "/data/dashboard.db"
        volumeMounts:
        - name: data
          mountPath: /data
        resources:
          requests:
            cpu: 100m
            memory: 64Mi
          limits:
            cpu: 500m
            memory: 256Mi
      volumes:
      - name: data
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: zai-proxy-dashboard
  namespace: mcp
spec:
  selector:
    app: zai-proxy-dashboard
  ports:
  - port: 8080
    targetPort: 8080

Configuration

Environment Variables

Variable	Type	Default	Description
`LISTEN_ADDR`	String	`:8080`	HTTP listen address
`SCRAPE_TARGETS`	String	`http://zai-proxy.mcp.svc.cluster.local:8080/metrics`	Comma-separated Prometheus endpoints to scrape
`SCRAPE_INTERVAL`	Duration	`5s`	Scrape interval
`SCRAPE_TIMEOUT`	Duration	`3s`	HTTP timeout for each scrape
`DB_PATH`	String	`/data/dashboard.db`	SQLite database file path
`RETENTION_5S`	Duration	`24h`	Retention for high-resolution data
`RETENTION_1M`	Duration	`168h` (7d)	Retention for downsampled data

Variant Detection

The dashboard automatically detects deployment variants from scrape target URLs:

Production - Default variant for any target without "test" or "canary" in the URL
Canary - Detected if URL contains "test" or "canary"

# Single production instance
SCRAPE_TARGETS="http://zai-proxy:8080/metrics"

# Multiple instances (auto-detected variants)
SCRAPE_TARGETS="http://zai-proxy:8080/metrics,http://zai-proxy-canary:8080/metrics"

API Endpoints

REST API

`GET /healthz`

Health check endpoint.

Response: {"status":"ok"}

`GET /api/status`

Returns current health summary for all variants.

Response:

{
  "production": {
    "healthy": true,
    "last_scrape": "2026-06-21T10:30:00Z",
    "req_rate": 45.2,
    "error_rate_pct": 0.1,
    "latency_p50_ms": 120,
    "concurrent": 12,
    "worker_utilization": 0.24,
    "rate_limit_rps": 50.0,
    "token_rate_in": 15000,
    "token_rate_out": 45000
  },
  "canary": {
    "healthy": true,
    "last_scrape": "2026-06-21T10:30:00Z",
    "req_rate": 5.1,
    "error_rate_pct": 0.0,
    "latency_p50_ms": 115,
    "concurrent": 2,
    "worker_utilization": 0.10,
    "rate_limit_rps": 10.0,
    "token_rate_in": 1500,
    "token_rate_out": 4800
  }
}

`GET /api/metrics`

Returns historical metrics for a time range.

Query Parameters:

range - Time range: 5m, 15m, 1h, 6h, 24h, 7d (default: 1h)
variant - Variant filter: production, canary, all (default: all)

Response: JSON array of MetricSnapshot objects

[
  {
    "timestamp": 1708500000000,
    "variant": "production",
    "requests_2xx": 1000,
    "requests_4xx": 10,
    "requests_5xx": 1,
    "tokens_input": 50000,
    "tokens_output": 150000,
    "tokens_cache_read": 10000,
    "tokens_cache_write": 8000,
    "concurrent_requests": 12,
    "max_workers": 50,
    "rate_limit_rps": 50.0,
    "rate_limit_rejections": 0,
    "req_rate": 45.2,
    "token_rate_in": 15000,
    "token_rate_out": 45000,
    "latency_p50": 120,
    "latency_p95": 250,
    "latency_p99": 450,
    "error_rate_pct": 0.1,
    "worker_utilization": 0.24,
    "upstream_errors": 0,
    "retry_attempts": 2
  }
]

`GET /api/config`

Returns dashboard configuration.

Response:

{
  "scrape_interval": 5,
  "targets": ["http://zai-proxy.mcp.svc.cluster.local:8080/metrics"]
}

SSE Endpoint

`GET /api/events`

Server-Sent Events stream for real-time metric updates.

Headers:

Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

Message Format:

data: {"type":"connected","scrape_interval":5,"variants":["production","canary"]}

data: {"type":"metrics","data":{"timestamp":1708500000000,"variant":"production",...}}

: heartbeat

Message Types:

connected - Initial connection confirmation
metrics - New metric snapshot from collector
: heartbeat - Keep-alive (every 30s)

Data Model

MetricSnapshot

Represents a single point-in-time collection of metrics from a zai-proxy instance.

type MetricSnapshot struct {
    Timestamp             int64                   // Unix timestamp (ms)
    Variant               string                  // "production" or "canary"
    Requests2xx           float64                 // Total 2xx requests
    Requests4xx           float64                 // Total 4xx requests
    Requests5xx           float64                 // Total 5xx requests
    TokensInput           float64                 // Total input tokens
    TokensOutput          float64                 // Total output tokens
    TokensCacheRead       float64                 // Total cache-read tokens
    TokensCacheWrite      float64                 // Total cache-write tokens
    ConcurrentRequests    float64                 // Current concurrent requests
    MaxWorkers            float64                 // Maximum workers
    RateLimitRps          float64                 // Current rate limit (req/s)
    RateLimitRejections   float64                 // Total rate limit rejections
    RateLimitAdjIncrease  float64                 // Total rate limit increases
    RateLimitAdjDecrease  float64                 // Total rate limit decreases
    UpstreamErrors        float64                 // Total upstream errors
    RetryAttempts         float64                 // Total retry attempts
    LatencyP50            float64                 // Request latency p50 (ms)
    LatencyP95            float64                 // Request latency p95 (ms)
    LatencyP99            float64                 // Request latency p99 (ms)
    RequestSizeAvg        float64                 // Average request size (bytes)
    ResponseSizeAvg       float64                 // Average response size (bytes)
    TokenRateIn           float64                 // Input token rate (tokens/s)
    TokenRateOut          float64                 // Output token rate (tokens/s)
    TokenRateCacheRead    float64                 // Cache-read token rate (tokens/s)
    TokenRateCacheWrite   float64                 // Cache-write token rate (tokens/s)
    ReqRate               float64                 // Request rate (req/s)
    ErrorRatePct          float64                 // Error rate percentage
    WorkerUtilization     float64                 // Worker utilization ratio (0-1)
    StatusCodeRates       map[string]float64      // Per-status-code rates (req/s)
}

Storage

Database Schema

SQLite database with two resolution levels:

metrics_5s - High-resolution data

5-second intervals
24-hour retention
Raw metric snapshots

metrics_1m - Downsampled data

1-minute intervals (averaged from 5s data)
7-day retention
Created by background downsample job

Automatic Downsampling

Every 10 minutes, the dashboard:

Reads new 5s data since last downsample
Groups by minute bucket and variant
Computes averages for all numeric fields
Writes to metrics_1m table
Cleans up data beyond retention periods

Query Routing

The API automatically selects the appropriate table based on query range:

≤ 1 hour → queries metrics_5s for detailed data
1 hour → queries metrics_1m for performance

Development

Backend Tests

cd dashboard/

# Run all tests
go test -v ./...

# Run specific package tests
go test -v ./collector
go test -v ./storage
go test -v ./api

# Run with coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

Frontend Tests

cd dashboard/frontend/

# Run tests once
npm run test

# Watch mode
npm run test:watch

# Coverage
npm run test -- --coverage

Building

# Backend binary
go build -o zai-proxy-dashboard .

# Frontend only
cd frontend/
npm run build

# Full Docker image
docker build -t zai-proxy-dashboard:latest .

Project Structure

dashboard/
├── main.go                   # Entry point, server setup
├── go.mod                    # Go dependencies
├── go.sum                    # Go dependency checksums
├── Dockerfile                # Multi-stage container build
├── VERSION                   # Version string
├── api/
│   ├── router.go            # HTTP route handlers
│   ├── middleware.go        # Logging, CORS, recovery
│   └── sse.go               # SSE hub implementation
├── collector/
│   ├── collector.go         # Prometheus scraper
│   └── parser.go            # Prometheus text format parser
├── frontend/
│   ├── package.json         # Node.js dependencies
│   ├── vite.config.ts       # Vite build config
│   ├── tailwind.config.js   # Tailwind CSS config
│   └── src/
│       ├── main.tsx         # React entry point
│       ├── App.tsx          # Main app component
│       └── ...              # Components, hooks, utils
├── logger/
│   └── logger.go            # Structured logging
├── model/
│   └── metrics.go           # Data structures
└── storage/
    ├── storage.go           # SQLite storage layer
    └── schema.go            # Database schema, config

Troubleshooting

Dashboard not showing data

Check collector is scraping:

kubectl logs -f deployment/zai-proxy-dashboard -n mcp | grep "scrape"

Expected output:

collector initialized with targets: [http://zai-proxy:8080/metrics]

Check if proxy is reachable:

kubectl exec -n mcp deployment/zai-proxy-dashboard -- wget -O- http://zai-proxy.mcp.svc.cluster.local:8080/metrics

SSE connection drops

Check network connectivity:

# Test SSE endpoint
curl -N http://localhost:8080/api/events

Common causes:

Proxy timeouts (increase SCRAPE_TIMEOUT)
Network policies blocking connections
Client not handling keep-alive heartbeats

Database errors

Check disk space:

kubectl exec -n mcp deployment/zai-proxy-dashboard -- df -h /data

Verify database file:

kubectl exec -n mcp deployment/zai-proxy-dashboard -- sqlite3 /data/dashboard.db ".schema"

High memory usage

Adjust retention periods:

kubectl set env deployment/zai-proxy-dashboard -n mcp \
  RETENTION_5S=12h \
  RETENTION_1M=72h

Check database size:

kubectl exec -n mcp deployment/zai-proxy-dashboard -- du -sh /data/dashboard.db

Performance

Metric	Target	Typical
Scrape latency	<100ms	20-50ms
Storage write latency	<10ms	1-3ms
Query latency (1h)	<500ms	50-200ms
Query latency (7d)	<2s	500ms-1s
Memory per variant	<50MB	20-30MB
Disk usage (per day)	<100MB	40-60MB

Note: Metrics depend on scrape interval and request volume.

Monitoring

Logs

# View all logs
kubectl logs -f deployment/zai-proxy-dashboard -n mcp

# Component-specific logs
kubectl logs -f deployment/zai-proxy-dashboard -n mcp | grep "collector"
kubectl logs -f deployment/zai-proxy-dashboard -n mcp | grep "sse"
kubectl logs -f deployment/zai-proxy-dashboard -n mcp | grep "storage"

Health Checks

# Kubernetes liveness/readiness
kubectl get endpoints zai-proxy-dashboard -n mcp

# Manual health check
curl http://dashboard-url/healthz

Metrics

The dashboard itself does not export Prometheus metrics (it's a consumer, not a producer). Monitor via:

Container resource usage (CPU, memory)
Database file size
SSE client connection count (logs)

License

See repository license.

Contributing

Contributions welcome! Please:

Write tests for new features
Update documentation
Follow existing code style
Test frontend and backend changes

Support

Documentation: Check parent README.md and docs/ directory
Issues: File in repository
Logs: kubectl logs -f deployment/zai-proxy-dashboard -n mcp

17 KiB Raw Blame History