jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo

Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:53:52 -04:00

12 KiB

Raw Blame History

Z.AI Proxy

A production-ready HTTP proxy for the Z.AI API with token counting, adaptive rate limiting, and comprehensive observability.

Features

✅ Token Counting - Accurate input/output token tracking using tiktoken ✅ Adaptive Rate Limiting - Automatically adjusts to API limits ✅ Prometheus Metrics - Full observability with detailed metrics ✅ Streaming Support - Handles SSE (Server-Sent Events) streaming responses ✅ Graceful Degradation - Never fails requests due to token counting errors ✅ Production Ready - Thread-safe, tested, and battle-hardened

Quick Start

Run Locally

# Set required environment variables
export ZAI_API_KEY="your-api-key-here"

# Run the proxy
go run main.go tokenizer.go

# Proxy listens on :8080
# Metrics available at :8080/metrics

Docker Deployment

# Build image
docker build -t zai-proxy:latest .

# Run container
docker run -p 8080:8080 \
  -e ZAI_API_KEY="your-api-key" \
  -e TOKEN_COUNTING_ENABLED=true \
  zai-proxy:latest

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zai-proxy
  namespace: mcp
spec:
  replicas: 2
  selector:
    matchLabels:
      app: zai-proxy
  template:
    metadata:
      labels:
        app: zai-proxy
    spec:
      containers:
      - name: zai-proxy
        image: ghcr.io/ardenone/zai-proxy:latest
        ports:
        - containerPort: 8080
          name: http
        env:
        - name: ZAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: zai-api-key
              key: api-key
        - name: TOKEN_COUNTING_ENABLED
          value: "true"
        - name: TOKENIZER_MODEL
          value: "glm-4"
        - name: MAX_WORKERS
          value: "50"
        resources:
          requests:
            cpu: 500m
            memory: 256Mi
          limits:
            cpu: 2000m
            memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  name: zai-proxy
  namespace: mcp
spec:
  selector:
    app: zai-proxy
  ports:
  - port: 8080
    targetPort: 8080

Configuration

Environment Variables

Variable	Type	Default	Description
`ZAI_API_KEY`	String	Required	Z.AI API key for upstream authentication
`TOKEN_COUNTING_ENABLED`	Boolean	`true`	Enable/disable token counting
`TOKENIZER_MODEL`	String	`glm-4`	Model name for Prometheus metrics labels
`MAX_WORKERS`	Integer	`10`	Maximum concurrent requests
`RATE_LIMIT_INITIAL`	Float	`10.0`	Initial rate limit (requests/second)
`RATE_LIMIT_MIN`	Float	`1.0`	Minimum rate limit (requests/second)
`RATE_LIMIT_MAX`	Float	`50.0`	Maximum rate limit (requests/second)
`MAX_RETRIES`	Integer	`3`	Maximum retry attempts for failed requests

See docs/ENVIRONMENT_VARIABLES.md for complete reference.

Token Counting

The proxy automatically counts input and output tokens for all requests using tiktoken cl100k_base encoding (Claude 3 compatible).

How It Works

┌─────────────┐
│   Client    │
└──────┬──────┘
       │ Request
       ↓
┌─────────────────────────────────────┐
│  Proxy: Count Input Tokens          │
│  • Parse request messages           │
│  • Tokenize using tiktoken          │
│  • Metric: zai_proxy_tokens_total   │
└──────┬──────────────────────────────┘
       │
       ↓
┌─────────────┐
│  Z.AI API   │
└──────┬──────┘
       │ Response (streaming)
       ↓
┌─────────────────────────────────────┐
│  Proxy: Stream + Capture            │
│  • Stream to client (zero-copy)     │
│  • Capture content in background    │
│  • Count output tokens after stream │
│  • Metric: zai_proxy_tokens_total   │
└──────┬──────────────────────────────┘
       │
       ↓
┌─────────────┐
│   Client    │
└─────────────┘

Quick Configuration

# Enable token counting (default)
export TOKEN_COUNTING_ENABLED=true
export TOKENIZER_MODEL=glm-4

# Disable token counting
export TOKEN_COUNTING_ENABLED=false

Monitoring Token Usage

View logs:

kubectl logs -f deployment/zai-proxy -n mcp | grep "Token usage"
# Output: Token usage: input=123, output=456

Query Prometheus:

# Total tokens per minute
rate(zai_proxy_tokens_total[5m]) * 60

# Input vs output ratio
rate(zai_proxy_tokens_total{direction="output"}[5m]) /
rate(zai_proxy_tokens_total{direction="input"}[5m])

# Token counting latency (should be <1ms)
histogram_quantile(0.99, rate(zai_proxy_token_count_duration_seconds_bucket[5m]))

See docs/TOKEN_COUNTING.md for comprehensive guide.

Prometheus Metrics

The proxy exports metrics at :8080/metrics:

Request Metrics

Metric	Type	Description
`zai_proxy_requests_total`	Counter	Total requests by method, path, status
`zai_proxy_request_duration_seconds`	Histogram	Request duration
`zai_proxy_concurrent_requests`	Gauge	Active concurrent requests
`zai_proxy_upstream_errors_total`	Counter	Upstream errors by type

Token Metrics

Metric	Type	Description
`zai_proxy_tokens_total`	Counter	Total tokens by direction (input/output) and model
`zai_proxy_token_count_duration_seconds`	Histogram	Token counting latency
`zai_proxy_token_rate`	Histogram	Token processing rate (tokens/second)

Rate Limiting Metrics

Metric	Type	Description
`zai_proxy_rate_limit_requests_per_second`	Gauge	Current rate limit
`zai_proxy_rate_limit_wait_seconds`	Histogram	Rate limiter wait time
`zai_proxy_rate_limit_adjustments_total`	Counter	Rate limit adjustments (increase/decrease)

Usage Example

# Make a request through the proxy
curl -X POST http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: $ZAI_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-3-sonnet",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ],
    "max_tokens": 100,
    "stream": true
  }'

# Check token usage in logs
# Output: Token usage: input=5, output=12

# Query metrics
curl http://localhost:8080/metrics | grep zai_proxy_tokens_total
# zai_proxy_tokens_total{direction="input",model="glm-4"} 5
# zai_proxy_tokens_total{direction="output",model="glm-4"} 12

Development

Running Tests

# Run all tests
go test -v ./...

# Run token counting tests
go test -v -run TestTikToken

# Run with coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

Building

# Build binary
go build -o zai-proxy main.go tokenizer.go

# Build Docker image (use GitHub Actions for devpod environments)
docker build -t zai-proxy:dev .

Note: Docker builds in devpod environments may fail with overlayfs errors. See docs/DEVPOD_DOCKER_BUILD_LIMITATION.md for details and the recommended GitHub Actions build workflow.

Project Structure

zai-proxy/
├── main.go                   # Proxy server
├── tokenizer.go              # Token counting implementation
├── tokenizer_test.go         # Token counting tests
├── main_test.go              # Integration tests
├── docs/
│   ├── TOKEN_COUNTING.md              # Token counting guide (comprehensive)
│   ├── ENVIRONMENT_VARIABLES.md       # Environment variable reference
│   ├── TOKENIZER_CONFIGURATION.md     # Tokenizer configuration
│   └── ...
├── RESPONSE_TOKEN_COUNTING.md # Implementation notes
├── TOKEN_COUNTING_WORKFLOW.md # Development workflow
├── go.mod                    # Go dependencies
└── Dockerfile                # Container image

Documentation

TOKEN_COUNTING.md - Comprehensive token counting guide
- How it works internally (architecture)
- Response format specification
- Configuration options
- Prometheus metrics reference
- Code examples and usage
- Known limitations
- Troubleshooting guide
ENVIRONMENT_VARIABLES.md - Environment variable reference
TOKENIZER_CONFIGURATION.md - Tokenizer configuration
DEVPOD_DOCKER_BUILD_LIMITATION.md - Devpod Docker build limitations and GitHub Actions workaround
RESPONSE_TOKEN_COUNTING.md - Implementation notes
TOKEN_COUNTING_WORKFLOW.md - Development workflow

Troubleshooting

Token counting not working

Check startup logs:

kubectl logs deployment/zai-proxy -n mcp | grep -i token

Expected output:

Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)

If disabled:

Token counting disabled (TOKEN_COUNTING_ENABLED=false)

Fix:

kubectl set env deployment/zai-proxy -n mcp TOKEN_COUNTING_ENABLED=true
kubectl rollout restart deployment/zai-proxy -n mcp

Token counts seem inaccurate

Check if fallback tokenizer is active:

kubectl logs deployment/zai-proxy -n mcp | grep -i fallback

If you see:

Falling back to SimpleTokenCounter

This means tiktoken failed to initialize. The fallback uses word count approximation (~30% variance).

Resolution: Rebuild with tiktoken dependencies

High token counting latency

Query latency:

histogram_quantile(0.99, rate(zai_proxy_token_count_duration_seconds_bucket[5m]))

Expected: <1ms for 99th percentile

If >5ms: Increase CPU limits or reduce concurrent requests

See docs/TOKEN_COUNTING.md#troubleshooting-guide for complete guide.

Known Limitations

No usage injection - Token counts are logged and metricked but not added to response bodies
- Workaround: Check logs or query Prometheus
- Future enhancement planned
Hardcoded model label - TOKENIZER_MODEL env var applies to all requests
- Workaround: Use separate proxy instances per model
- Future: Extract model from request body dynamically
Tiktoken assumptions - Uses cl100k_base encoding for all models
- Works well for Claude 3 (<3% variance)
- May have variance for GLM-4 (<10% expected)

See docs/TOKEN_COUNTING.md#known-limitations for details.

Performance

Metric	Target	Typical
Request latency overhead	<5ms	<1ms
Token counting latency	<1ms	0.3-0.8ms
Streaming overhead	0ms	0ms (zero-copy)
Memory per request	<5KB	~2KB

Token counting happens AFTER streaming completes, so it doesn't affect end-user latency.

License

See repository license.

Contributing

Contributions welcome! Please:

Read existing documentation
Write tests for new features
Update documentation
Follow existing code style

Support

Documentation: Check docs/ directory
Issues: File in repository
Logs: kubectl logs -f deployment/zai-proxy -n mcp
Metrics: http://zai-proxy.mcp.svc.cluster.local:8080/metrics

12 KiB Raw Blame History