Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
12 KiB
Z.AI Proxy
A production-ready HTTP proxy for the Z.AI API with token counting, adaptive rate limiting, and comprehensive observability.
Features
✅ Token Counting - Accurate input/output token tracking using tiktoken ✅ Adaptive Rate Limiting - Automatically adjusts to API limits ✅ Prometheus Metrics - Full observability with detailed metrics ✅ Streaming Support - Handles SSE (Server-Sent Events) streaming responses ✅ Graceful Degradation - Never fails requests due to token counting errors ✅ Production Ready - Thread-safe, tested, and battle-hardened
Quick Start
Run Locally
# Set required environment variables
export ZAI_API_KEY="your-api-key-here"
# Run the proxy
go run main.go tokenizer.go
# Proxy listens on :8080
# Metrics available at :8080/metrics
Docker Deployment
# Build image
docker build -t zai-proxy:latest .
# Run container
docker run -p 8080:8080 \
-e ZAI_API_KEY="your-api-key" \
-e TOKEN_COUNTING_ENABLED=true \
zai-proxy:latest
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: zai-proxy
namespace: mcp
spec:
replicas: 2
selector:
matchLabels:
app: zai-proxy
template:
metadata:
labels:
app: zai-proxy
spec:
containers:
- name: zai-proxy
image: ghcr.io/ardenone/zai-proxy:latest
ports:
- containerPort: 8080
name: http
env:
- name: ZAI_API_KEY
valueFrom:
secretKeyRef:
name: zai-api-key
key: api-key
- name: TOKEN_COUNTING_ENABLED
value: "true"
- name: TOKENIZER_MODEL
value: "glm-4"
- name: MAX_WORKERS
value: "50"
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: 2000m
memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
name: zai-proxy
namespace: mcp
spec:
selector:
app: zai-proxy
ports:
- port: 8080
targetPort: 8080
Configuration
Environment Variables
| Variable | Type | Default | Description |
|---|---|---|---|
ZAI_API_KEY |
String | Required | Z.AI API key for upstream authentication |
TOKEN_COUNTING_ENABLED |
Boolean | true |
Enable/disable token counting |
TOKENIZER_MODEL |
String | glm-4 |
Model name for Prometheus metrics labels |
MAX_WORKERS |
Integer | 10 |
Maximum concurrent requests |
RATE_LIMIT_INITIAL |
Float | 10.0 |
Initial rate limit (requests/second) |
RATE_LIMIT_MIN |
Float | 1.0 |
Minimum rate limit (requests/second) |
RATE_LIMIT_MAX |
Float | 50.0 |
Maximum rate limit (requests/second) |
MAX_RETRIES |
Integer | 3 |
Maximum retry attempts for failed requests |
See docs/ENVIRONMENT_VARIABLES.md for complete reference.
Token Counting
The proxy automatically counts input and output tokens for all requests using tiktoken cl100k_base encoding (Claude 3 compatible).
How It Works
┌─────────────┐
│ Client │
└──────┬──────┘
│ Request
↓
┌─────────────────────────────────────┐
│ Proxy: Count Input Tokens │
│ • Parse request messages │
│ • Tokenize using tiktoken │
│ • Metric: zai_proxy_tokens_total │
└──────┬──────────────────────────────┘
│
↓
┌─────────────┐
│ Z.AI API │
└──────┬──────┘
│ Response (streaming)
↓
┌─────────────────────────────────────┐
│ Proxy: Stream + Capture │
│ • Stream to client (zero-copy) │
│ • Capture content in background │
│ • Count output tokens after stream │
│ • Metric: zai_proxy_tokens_total │
└──────┬──────────────────────────────┘
│
↓
┌─────────────┐
│ Client │
└─────────────┘
Quick Configuration
# Enable token counting (default)
export TOKEN_COUNTING_ENABLED=true
export TOKENIZER_MODEL=glm-4
# Disable token counting
export TOKEN_COUNTING_ENABLED=false
Monitoring Token Usage
View logs:
kubectl logs -f deployment/zai-proxy -n mcp | grep "Token usage"
# Output: Token usage: input=123, output=456
Query Prometheus:
# Total tokens per minute
rate(zai_proxy_tokens_total[5m]) * 60
# Input vs output ratio
rate(zai_proxy_tokens_total{direction="output"}[5m]) /
rate(zai_proxy_tokens_total{direction="input"}[5m])
# Token counting latency (should be <1ms)
histogram_quantile(0.99, rate(zai_proxy_token_count_duration_seconds_bucket[5m]))
See docs/TOKEN_COUNTING.md for comprehensive guide.
Prometheus Metrics
The proxy exports metrics at :8080/metrics:
Request Metrics
| Metric | Type | Description |
|---|---|---|
zai_proxy_requests_total |
Counter | Total requests by method, path, status |
zai_proxy_request_duration_seconds |
Histogram | Request duration |
zai_proxy_concurrent_requests |
Gauge | Active concurrent requests |
zai_proxy_upstream_errors_total |
Counter | Upstream errors by type |
Token Metrics
| Metric | Type | Description |
|---|---|---|
zai_proxy_tokens_total |
Counter | Total tokens by direction (input/output) and model |
zai_proxy_token_count_duration_seconds |
Histogram | Token counting latency |
zai_proxy_token_rate |
Histogram | Token processing rate (tokens/second) |
Rate Limiting Metrics
| Metric | Type | Description |
|---|---|---|
zai_proxy_rate_limit_requests_per_second |
Gauge | Current rate limit |
zai_proxy_rate_limit_wait_seconds |
Histogram | Rate limiter wait time |
zai_proxy_rate_limit_adjustments_total |
Counter | Rate limit adjustments (increase/decrease) |
Usage Example
# Make a request through the proxy
curl -X POST http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: $ZAI_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-3-sonnet",
"messages": [
{"role": "user", "content": "Hello, Claude!"}
],
"max_tokens": 100,
"stream": true
}'
# Check token usage in logs
# Output: Token usage: input=5, output=12
# Query metrics
curl http://localhost:8080/metrics | grep zai_proxy_tokens_total
# zai_proxy_tokens_total{direction="input",model="glm-4"} 5
# zai_proxy_tokens_total{direction="output",model="glm-4"} 12
Development
Running Tests
# Run all tests
go test -v ./...
# Run token counting tests
go test -v -run TestTikToken
# Run with coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
Building
# Build binary
go build -o zai-proxy main.go tokenizer.go
# Build Docker image (use GitHub Actions for devpod environments)
docker build -t zai-proxy:dev .
Note: Docker builds in devpod environments may fail with overlayfs errors. See docs/DEVPOD_DOCKER_BUILD_LIMITATION.md for details and the recommended GitHub Actions build workflow.
Project Structure
zai-proxy/
├── main.go # Proxy server
├── tokenizer.go # Token counting implementation
├── tokenizer_test.go # Token counting tests
├── main_test.go # Integration tests
├── docs/
│ ├── TOKEN_COUNTING.md # Token counting guide (comprehensive)
│ ├── ENVIRONMENT_VARIABLES.md # Environment variable reference
│ ├── TOKENIZER_CONFIGURATION.md # Tokenizer configuration
│ └── ...
├── RESPONSE_TOKEN_COUNTING.md # Implementation notes
├── TOKEN_COUNTING_WORKFLOW.md # Development workflow
├── go.mod # Go dependencies
└── Dockerfile # Container image
Documentation
- TOKEN_COUNTING.md - Comprehensive token counting guide
- How it works internally (architecture)
- Response format specification
- Configuration options
- Prometheus metrics reference
- Code examples and usage
- Known limitations
- Troubleshooting guide
- ENVIRONMENT_VARIABLES.md - Environment variable reference
- TOKENIZER_CONFIGURATION.md - Tokenizer configuration
- DEVPOD_DOCKER_BUILD_LIMITATION.md - Devpod Docker build limitations and GitHub Actions workaround
- RESPONSE_TOKEN_COUNTING.md - Implementation notes
- TOKEN_COUNTING_WORKFLOW.md - Development workflow
Troubleshooting
Token counting not working
Check startup logs:
kubectl logs deployment/zai-proxy -n mcp | grep -i token
Expected output:
Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)
If disabled:
Token counting disabled (TOKEN_COUNTING_ENABLED=false)
Fix:
kubectl set env deployment/zai-proxy -n mcp TOKEN_COUNTING_ENABLED=true
kubectl rollout restart deployment/zai-proxy -n mcp
Token counts seem inaccurate
Check if fallback tokenizer is active:
kubectl logs deployment/zai-proxy -n mcp | grep -i fallback
If you see:
Falling back to SimpleTokenCounter
This means tiktoken failed to initialize. The fallback uses word count approximation (~30% variance).
Resolution: Rebuild with tiktoken dependencies
High token counting latency
Query latency:
histogram_quantile(0.99, rate(zai_proxy_token_count_duration_seconds_bucket[5m]))
Expected: <1ms for 99th percentile
If >5ms: Increase CPU limits or reduce concurrent requests
See docs/TOKEN_COUNTING.md#troubleshooting-guide for complete guide.
Known Limitations
-
No usage injection - Token counts are logged and metricked but not added to response bodies
- Workaround: Check logs or query Prometheus
- Future enhancement planned
-
Hardcoded model label -
TOKENIZER_MODELenv var applies to all requests- Workaround: Use separate proxy instances per model
- Future: Extract model from request body dynamically
-
Tiktoken assumptions - Uses
cl100k_baseencoding for all models- Works well for Claude 3 (<3% variance)
- May have variance for GLM-4 (<10% expected)
See docs/TOKEN_COUNTING.md#known-limitations for details.
Performance
| Metric | Target | Typical |
|---|---|---|
| Request latency overhead | <5ms | <1ms |
| Token counting latency | <1ms | 0.3-0.8ms |
| Streaming overhead | 0ms | 0ms (zero-copy) |
| Memory per request | <5KB | ~2KB |
Token counting happens AFTER streaming completes, so it doesn't affect end-user latency.
License
See repository license.
Contributing
Contributions welcome! Please:
- Read existing documentation
- Write tests for new features
- Update documentation
- Follow existing code style
Support
- Documentation: Check
docs/directory - Issues: File in repository
- Logs:
kubectl logs -f deployment/zai-proxy -n mcp - Metrics:
http://zai-proxy.mcp.svc.cluster.local:8080/metrics