jedarden 9799d75d2b feat(dashboard): add cache token tracking and running totals panel

Adds cache_read and cache_write token directions throughout the
observability stack so Anthropic prompt-cache billing is visible.

- model/metrics.go: TokensCacheRead, TokensCacheWrite, TokenRateCacheRead,
  TokenRateCacheWrite fields on MetricSnapshot
- collector: reads direction=cache_read/cache_write from
  zai_proxy_tokens_total Prometheus metric
- frontend types.ts: matching TS fields
- TokenPanel: rewritten to show all 4 directions (input, output,
  cache_read, cache_write) on the rate chart; running-total summary
  strip above the chart shows window totals (e.g. "5h window: 1.2M
  input / 340k output / 89k cache_read / 12k cache_write")

Also updates docs/plan/plan.md to accurately document the full
dashboard architecture (backend API, storage schema, SSE hub,
frontend panels, Grafana layer, env vars).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 23:08:28 -04:00

21 KiB

Raw Blame History

ZAI Proxy Ecosystem — Plan

Last updated: 2026-05-16 Version: proxy/1.10.0, dashboard/1.0.0

Objective

Provide a stable, observable endpoint for LLM agents to access the Z.AI API without exposing the Z.AI API key to calling processes. The proxy is the sole keeper of the credential; agents reach it via cluster-internal DNS — isolation is enforced at the network layer, not via per-agent authentication.

Security Model

Threat	Mitigation
Agent exfiltrates Z.AI key	Key never leaves proxy pod; agents reach the proxy only via cluster-internal DNS (not public); key is not in agent env, logs, or metrics
Network path to proxy compromised	Proxy is not reachable outside the cluster except via Tailscale ingress; no public IP
Log scraping leaks key	Z.AI key is never logged; incoming Authorization header is overwritten before forwarding, never echoed
Metric label leakage	No credential values in metric labels
Runaway agent burns quota	Global adaptive rate limiter + 429 backoff + `MAX_WORKERS` concurrency cap
Z.AI quota exhaustion	429 counter triggers alerts before quota fully consumed
Malformed upstream response	Proxy validates response body before committing; retries on empty/truncated JSON

What the proxy does NOT do:

Validate per-agent credentials (no proxy-key authentication). Any pod that can reach the proxy via cluster DNS is treated as authorized. Access control is the cluster's responsibility.
Cache or store responses.
Load-balance across multiple Z.AI accounts.

Architecture

LLM Agent (Claude Code, NEEDLE worker, etc.)
    │
    │  POST /v1/messages  (or any path)
    │  Authorization: Bearer <any-value>     ← overwritten; not validated
    ▼
┌─────────────────────────────────────────────────────┐
│                    zai-proxy                        │
│                                                     │
│  • Overwrites Authorization → Bearer <zai-api-key>  │
│  • Enforces concurrency cap (MAX_WORKERS)           │
│  • Global adaptive AIMD rate limiter                │
│  • Counts tokens (tiktoken / API-reported)          │
│  • Validates response body; retries on truncation   │
│  • Records metrics (Prometheus)                     │
│  • TranslateRequest: no-op (Z.AI is Claude-native)  │
│                                                     │
└──────────────────┬──────────────────────────────────┘
                   │  HTTPS
                   ▼
           api.z.ai  (Z.AI upstream)

The Z.AI API key lives only as a Kubernetes Secret (sealed-secrets encrypted at rest, injected as an env var into the proxy pod only). No agent process, worker, or tool ever sees the upstream key.

Components

proxy/ — Reverse Proxy (Go)

The core component. Handles:

Credential injection: overwrites the incoming Authorization header with Bearer <ZAI_API_KEY>. No incoming credential is validated — access is controlled entirely by network policy (cluster-internal DNS + Tailscale boundary).
Concurrency cap: MAX_WORKERS (default 10) bounds the number of in-flight requests. Requests beyond the cap receive 503 immediately.
Global adaptive rate limiter (AIMD/EWMA): A single token-bucket limiter serves all traffic. Every 30-second window it inspects the 429 rate from the upstream and adjusts:
- If 429-rate > 5 %: updates the estimated ceiling via EWMA (alpha = 0.3; default), then drops to ceiling × (1 − hold_margin).
- If 429-rate < 1 %: converges toward the hold position in 50 % steps per window; after probe_interval clean windows, probes above the ceiling to detect upward shifts.
- Rate is bounded by [RATE_LIMIT_MIN, RATE_LIMIT_MAX] (defaults: 1–50 req/s).
- Parameters tunable via env: RATE_LIMIT_CEILING_ALPHA, RATE_LIMIT_HOLD_MARGIN, RATE_LIMIT_PROBE_INTERVAL.
- Reset endpoint: POST /admin/reset-rate-limit resets to initial rate (unauthenticated).
Retry logic: on network error, 429, or truncated/empty response body, the proxy retries up to MAX_RETRIES times (default 3) with exponential backoff (1 s, 2 s, 4 s). If a 429 carries Retry-After, that delay is honoured before the next attempt.
Response validation:
- Non-streaming: reads the full body before committing; retries if empty or invalid JSON.
- Streaming: peeks the first 4 KiB; retries if the stream opens with zero bytes.
- 422 responses are not retried — they indicate a structural request problem. Full request/response bodies are logged for diagnosis.
Token counting: prefers API-reported usage from the response body (usage.input_tokens, usage.output_tokens, usage.cache_read_input_tokens, usage.cache_creation_input_tokens). Falls back to tiktoken cl100k_base local counting if the response carries no usage block; further falls back to SimpleTokenCounter if tiktoken fails to initialise. Enabled via TOKEN_COUNTING_ENABLED (default true).
Request translation: TranslateRequest is a documented no-op. Z.AI natively accepts the Anthropic Claude wire format (including thinking, cache_control, system arrays). Prior field-stripping translations caused 422 errors and were removed.
Prometheus metrics: exposes /metrics with request counts, latency histograms, token usage by direction and pricing tier, rate-limiter state, retry counts, and build info.
Deployment variants: DEPLOYMENT_VARIANT env distinguishes metric streams from production and canary pods. All Prometheus metrics carry a variant label.
Canary support: two Deployments share the devpod namespace. The canary (zai-proxy-v2) currently carries all production traffic (original zai-proxy Deployment is scaled to 0). A zai-proxy-canary Service enables weighted traffic splits for testing new versions.

dashboard/ — Metrics Dashboard (Go + React)

The observability layer. Three subsystems work together:

zai-proxy /metrics
      │
      │  HTTP scrape every 5 s (per SCRAPE_TARGETS)
      ▼
┌──────────────────────────────────────────────┐
│  Collector (goroutine per target)            │
│  • Parses Prometheus text format             │
│  • Computes per-interval rates (req/s etc.)  │
│  • Infers variant from target URL            │
│    ("test"/"canary" → canary, else prod)     │
│  • Handles counter resets                    │
└──────────┬───────────────────────────────────┘
           │ MetricSnapshot channel
    ┌──────┴──────┐
    ▼             ▼
┌────────┐   ┌─────────────────────────────────┐
│Storage │   │  SSE Hub (broadcast to clients) │
│        │   │  • "connected" event on join     │
│5s/24h  │   │    (scrape_interval, variants)   │
│1m/7d   │   │  • 30 s keepalive heartbeat      │
│SQLite  │   │  • Drops slow consumers          │
│WAL     │   └─────────────────────────────────┘
└────────┘
      │
      ▼
REST API
  GET /api/events              SSE stream (live)
  GET /api/metrics?range=&variant=  Historical snapshots
  GET /api/status              Latest snapshot per variant
  GET /api/config              Scrape interval + targets
  GET /healthz                 Health check

Storage schema (SQLite, WAL mode):

Table	Resolution	Retention
`metrics_5s`	5 s	24 h
`metrics_1m`	1 min averages	7 d

QueryRange automatically selects the table: metrics_5s for ranges ≤ 1 h, metrics_1m for longer ranges. Downsampling runs every 10 minutes. Retention purge runs every 10 minutes.

Note: The deployment uses emptyDir for /data — dashboard history is lost on pod restart. A PVC is commented out in the manifest for future use.

REST API parameters:

GET /api/metrics?range={5m,15m,1h,6h,24h,7d}&variant={production,canary,all}
Returns a JSON array of MetricSnapshot objects

Snapshot fields computed by collector:

Field	Description
`req_rate`	Requests per second (counter rate over interval)
`token_rate_in/out`	Input/output tokens per second
`error_rate_pct`	`5xx / total * 100`
`latency_p50/p95/p99`	Histogram quantiles (ms)
`request_size_avg` / `response_size_avg`	Histogram mean (bytes)
`status_code_rates`	Per-status-code req/s map
`rate_limit_rps`	Current limiter rate
`rate_limit_adj_increase/decrease`	AIMD adjustment counters
`worker_utilization`	`concurrent / max_workers`

Frontend (React/Vite/Tailwind, embedded in binary via //go:embed):

Six panels in a 2×3 responsive grid, each wrapped in an error boundary:

Panel	What it shows
Request Rate	req/s time series
Latency	p50 / p95 / p99 (ms) time series
Tokens	Input + output token rate (tokens/s)
Concurrency	In-flight requests vs MAX_WORKERS
Rate Limiter	Current rate, AIMD adjustments, rejections
Errors	Error rate %, upstream errors by type

Global controls:

Variant toggle: Production / Canary / Both — filters all panels
Time range selector: 5 m / 15 m / 1 h / 6 h / 24 h
Theme toggle: Dark / Light
Status bar: connection state, req/s, p50, token rate, error %, workers; stale-data indicators per variant
Loading skeleton: shown until first SSE data arrives
Auto-reconnect: exponential backoff with countdown timer + manual reconnect button
History backfill: on connect, fetches REST history for the current time range before live SSE data arrives

Dashboard environment variables:

Variable	Default	Description
`SCRAPE_TARGETS`	`http://zai-proxy.mcp.svc.cluster.local:8080/metrics`	Comma-separated scrape URLs
`SCRAPE_INTERVAL`	`5s`	How often to scrape
`SCRAPE_TIMEOUT`	`3s`	Per-scrape HTTP timeout
`LISTEN_ADDR`	`:8080`	Dashboard listen address
`DB_PATH`	`/data/dashboard.db`	SQLite file path
`RETENTION_5S`	`24h`	High-resolution data retention
`RETENTION_1M`	`168h` (7d)	Downsampled data retention

The default SCRAPE_TARGETS hardcodes mcp namespace. In deployments where the proxy runs in a different namespace (e.g., devpod), override via env.

Grafana — Prometheus Dashboard (separate from the React dashboard)

A Grafana dashboard ConfigMap lives at k8s/ardenone-cluster/monitoring/grafana-dashboard-zai-proxy.yml and queries Prometheus directly. Panels:

Panel	Query
Total Requests (1h)	`increase(zai_proxy_requests_total[1h])`
Error Rate	`rate(4xx+5xx) / rate(total)`
429 Errors (1h)	`increase(requests_total{status_code="429"}[1h])`
Response Time p90	`histogram_quantile(0.90, ...)`
Worker Utilization	`sum(zai_proxy_worker_utilization_ratio)`
Rate Limit (current)	`zai_proxy_rate_limit_requests_per_second`
Concurrent Requests	`sum(zai_proxy_concurrent_requests)`
Success Rate	`rate(2xx) / rate(total)`
Request Rate by Status	by `status_code` label
Concurrent vs Max Workers	concurrent + max_workers overlay
Duration Percentiles	p50 / p90 / p99
Request/Response Size p90	histogram_quantile on size histograms
Upstream Errors	by `error_type` label
Rate Limit Behavior	retries by reason + adjustments by direction
Token panels	total / input / output `increase(...[1h])`

Telemetry & Metrics

Token counting

The proxy records token usage after every request. API-reported counts are preferred; tiktoken is the fallback.

Metric	Labels
`zai_proxy_tokens_total`	`direction={input,output,cache_read,cache_write}`, `model`, `variant`, `pricing_tier={peak,off_peak}`
`zai_proxy_request_duration_seconds`	`method`, `path`, `status_code`, `variant`
`zai_proxy_requests_total`	`method`, `path`, `status_code`, `variant`
`zai_proxy_request_size_bytes`	`method`, `path`, `variant`
`zai_proxy_response_size_bytes`	`method`, `path`, `status_code`, `variant`
`zai_proxy_concurrent_requests`	`variant`
`zai_proxy_max_workers`	`variant`
`zai_proxy_worker_utilization_ratio`	`variant`
`zai_proxy_token_count_duration_seconds`	`variant`
`zai_proxy_token_rate_seconds`	`direction`, `model`, `variant`
`zai_proxy_token_rate`	`direction`, `model`, `variant`
`zai_proxy_build_info`	`version`, `variant`, `commit`, `build_time`

Pricing tier: GetPricingTier() returns peak between 02:00–06:00 ET (Z.AI 2× pricing window), off_peak otherwise. Applied to all tokensTotal observations.

Token header: input token count is also set in X-Token-Input response header so agents can track their own consumption without querying the dashboard.

Rate-limiter metrics

Metric	Labels	Description
`zai_proxy_rate_limit_requests_per_second`	`variant`	Current limiter rate
`zai_proxy_rate_limit_wait_seconds`	`variant`	Time waiting in the limiter
`zai_proxy_rate_limit_adjustments_total`	`direction={increase,decrease,probe}`, `variant`	Algorithm decisions
`zai_proxy_rate_limit_rejections_total`	`variant`	Requests rejected (capacity)
`zai_proxy_retry_attempts_total`	`reason={retry,network_error,429,truncated_response,empty_streaming}`, `variant`	Retry causes
`zai_proxy_upstream_errors_total`	`error_type={422,429,truncated_response,empty_streaming,upstream_connection,write_error,read_error,request_creation}`, `variant`	Error taxonomy

Error classification

Upstream condition	Proxy action
429 + Retry-After	Wait header delay, then retry (up to MAX_RETRIES)
429 no header	Exponential backoff retry
422	Log bodies, no retry, return 422 to client
Empty/invalid JSON body (2xx)	Retry; 502 after MAX_RETRIES
Empty streaming response	Retry; 502 after MAX_RETRIES
Network error	Retry; 502 after MAX_RETRIES
Other 4xx/5xx	Pass through; no retry

Dashboard alerting targets (future)

429 rate from Z.AI > 5 % over 5 m → alert (quota pressure)
p95 latency > 10 s → alert (upstream degradation)
Error rate > 2 % → alert

Environment Variables

See docs/notes/ENVIRONMENT_VARIABLES.md for the full reference. Key variables:

Variable	Default	Description
`ZAI_API_KEY`	required	Upstream Z.AI API key
`DEPLOYMENT_VARIANT`	`production`	Metric stream tag
`MAX_WORKERS`	`10`	Concurrency cap
`TOKEN_COUNTING_ENABLED`	`true`	Enable/disable token counting
`TOKENIZER_MODEL`	`glm-4`	Model label for token metrics
`RATE_LIMIT_INITIAL`	`10.0`	Starting rate (req/s)
`RATE_LIMIT_MIN`	`1.0`	Floor rate
`RATE_LIMIT_MAX`	`50.0`	Ceiling cap
`RATE_LIMIT_CEILING_ALPHA`	`0.3`	EWMA smoothing factor
`RATE_LIMIT_HOLD_MARGIN`	`0.02`	Hold this % below estimated ceiling
`RATE_LIMIT_PROBE_INTERVAL`	`10`	Probe above ceiling every N clean windows
`MAX_RETRIES`	`3`	Max retry attempts
`ZAI_TARGET_URL`	`https://api.z.ai/api/anthropic`	Upstream URL

Repository Layout

zai-proxy/                          (git.ardenone.com/jedarden/zai-proxy)
├── proxy/                          Go module: git.ardenone.com/jedarden/zai-proxy
│   ├── main.go                     HTTP server, routing, rate limiter, retry logic
│   ├── translator.go               No-op (Z.AI natively speaks the Claude wire format)
│   ├── bodyparser.go               Body parsing, streaming capture, usage injection
│   ├── tokenizer.go                Token counting (tiktoken cl100k_base + GLM fallback)
│   ├── metrics.go                  Prometheus instrumentation + pricing tier logic
│   ├── evaluation/                 Offline eval harness (token count accuracy vs Anthropic API)
│   ├── cmd/evaluate/               CLI for batch evaluation
│   ├── cmd/demo-eval/              Demo evaluation runner
│   ├── scripts/                    Load test, canary integration, benchmarks
│   ├── tests/                      Integration and regression test suites
│   └── Dockerfile                  Production image
├── dashboard/                      Go module: git.ardenone.com/jedarden/zai-proxy/dashboard
│   ├── main.go                     HTTP server + SSE broadcaster
│   ├── collector/                  Prometheus scraper + parser
│   ├── api/                        REST + SSE handlers
│   ├── storage/                    SQLite persistence layer
│   ├── model/                      Shared metric data types
│   ├── logger/                     Structured logger
│   └── frontend/                   React/Vite/Tailwind dashboard UI
└── docs/
    ├── plan/plan.md                This document
    ├── notes/                      Deployment, operations, canary procedures
    └── research/                   Tokenizer research, metrics references

CI/CD

Build templates live in jedarden/declarative-config → k8s/iad-ci/argo-workflows/:

Template	Builds	Pushes to
`zai-proxy-build`	`proxy/`	`ronaldraygun/zai-proxy:{VERSION}`
`zai-proxy-dashboard-build`	`dashboard/`	`ronaldraygun/zai-proxy-dashboard:{VERSION}`

Both templates clone from git.ardenone.com/jedarden/zai-proxy (no auth required). Versions are read from proxy/VERSION and dashboard/VERSION respectively.

Triggering a build:

kubectl --kubeconfig=/home/coding/.kube/iad-ci.kubeconfig create -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: zai-proxy-build-manual-
  namespace: argo-workflows
spec:
  workflowTemplateRef:
    name: zai-proxy-build
EOF

Deployment

Both components deploy to the devpod namespace on ardenone-cluster via ArgoCD from jedarden/declarative-config.

Key manifests:

k8s/ardenone-cluster/devpod/zai-proxy.yml — original Deployment (currently replicas=0)
k8s/ardenone-cluster/devpod/zai-proxy-v2.yml — active production Deployment
k8s/ardenone-cluster/devpod/zai-proxy-canary-deployment.yml — canary config
k8s/ardenone-cluster/devpod/zai-proxy-canary-service.yml — weighted traffic split
k8s/ardenone-cluster/devpod/zai-proxy-tailscale.yml — Tailscale ingress
k8s/ardenone-cluster/devpod/zai-proxy-servicemonitor.yml — Prometheus scrape target
k8s/ardenone-cluster/monitoring/grafana-dashboard-zai-proxy.yml — Grafana dashboard

The Z.AI API key flows: OpenBao → ESO ExternalSecret → K8s Secret → proxy pod env (read once at startup; never written to any metric, log, or response).

Workers reach the proxy via cluster-internal DNS:

Production: http://zai-proxy.devpod.svc.cluster.local:8080/api/anthropic
Canary: http://zai-proxy-test.devpod.svc.cluster.local:8080/api/anthropic

Operations

Document	What it covers
`docs/notes/ENVIRONMENT_VARIABLES.md`	Full env var reference
`docs/notes/DEPLOYMENT.md`	Production/canary dual-deploy workflow
`docs/notes/CANARY_PROMOTION_PROCEDURE.md`	Step-by-step canary promotion
`docs/notes/CANARY_PROMOTION_CHECKLIST.md`	Go/no-go checklist
`docs/notes/CANARY_ROLLBACK_PROCEDURE.md`	Rollback triggers and steps
`docs/notes/CANARY_TROUBLESHOOTING_GUIDE.md`	Common canary issues
`docs/notes/REGRESSION_TESTING.md`	Regression test suite overview
`docs/notes/REGRESSION_TEST_GUIDE.md`	Running regression tests
`docs/notes/TOKEN_COUNTING.md`	Token counting design and validation
`docs/notes/TOKENIZER_CONFIGURATION.md`	Tokenizer tuning
`docs/notes/MONITORING_SETUP.md`	Grafana + Prometheus setup
`docs/notes/zai-proxy-rate-limiting.md`	Adaptive rate limiter deep-dive
`docs/notes/TROUBLESHOOTING.md`	General troubleshooting

Migration Status

Source extracted from ardenone-cluster/containers/zai-proxy → proxy/
Source extracted from ardenone-cluster/containers/zai-proxy-dashboard → dashboard/
Go module paths updated to git.ardenone.com/jedarden/zai-proxy[/dashboard]
Argo Workflow templates created (zai-proxy-build, zai-proxy-dashboard-build)
Push new workflow templates to declarative-config (triggers ArgoCD sync)
Update CLAUDE.md / ardenone-cluster README to point to new repo
Retire ardenone-cluster/containers/zai-proxy and containers/zai-proxy-dashboard once builds verified from new repo

21 KiB Raw Blame History Unescape Escape