zai-proxy/proxy/README.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

405 lines
12 KiB
Markdown

# Z.AI Proxy
A production-ready HTTP proxy for the Z.AI API with token counting, adaptive rate limiting, and comprehensive observability.
## Features
**Token Counting** - Accurate input/output token tracking using tiktoken
**Adaptive Rate Limiting** - Automatically adjusts to API limits
**Prometheus Metrics** - Full observability with detailed metrics
**Streaming Support** - Handles SSE (Server-Sent Events) streaming responses
**Graceful Degradation** - Never fails requests due to token counting errors
**Production Ready** - Thread-safe, tested, and battle-hardened
## Quick Start
### Run Locally
```bash
# Set required environment variables
export ZAI_API_KEY="your-api-key-here"
# Run the proxy
go run main.go tokenizer.go
# Proxy listens on :8080
# Metrics available at :8080/metrics
```
### Docker Deployment
```bash
# Build image
docker build -t zai-proxy:latest .
# Run container
docker run -p 8080:8080 \
-e ZAI_API_KEY="your-api-key" \
-e TOKEN_COUNTING_ENABLED=true \
zai-proxy:latest
```
### Kubernetes Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: zai-proxy
namespace: mcp
spec:
replicas: 2
selector:
matchLabels:
app: zai-proxy
template:
metadata:
labels:
app: zai-proxy
spec:
containers:
- name: zai-proxy
image: ghcr.io/ardenone/zai-proxy:latest
ports:
- containerPort: 8080
name: http
env:
- name: ZAI_API_KEY
valueFrom:
secretKeyRef:
name: zai-api-key
key: api-key
- name: TOKEN_COUNTING_ENABLED
value: "true"
- name: TOKENIZER_MODEL
value: "glm-4"
- name: MAX_WORKERS
value: "50"
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: 2000m
memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
name: zai-proxy
namespace: mcp
spec:
selector:
app: zai-proxy
ports:
- port: 8080
targetPort: 8080
```
## Configuration
### Environment Variables
| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `ZAI_API_KEY` | String | **Required** | Z.AI API key for upstream authentication |
| `TOKEN_COUNTING_ENABLED` | Boolean | `true` | Enable/disable token counting |
| `TOKENIZER_MODEL` | String | `glm-4` | Model name for Prometheus metrics labels |
| `MAX_WORKERS` | Integer | `10` | Maximum concurrent requests |
| `RATE_LIMIT_INITIAL` | Float | `10.0` | Initial rate limit (requests/second) |
| `RATE_LIMIT_MIN` | Float | `1.0` | Minimum rate limit (requests/second) |
| `RATE_LIMIT_MAX` | Float | `50.0` | Maximum rate limit (requests/second) |
| `MAX_RETRIES` | Integer | `3` | Maximum retry attempts for failed requests |
**See [docs/ENVIRONMENT_VARIABLES.md](docs/ENVIRONMENT_VARIABLES.md) for complete reference.**
## Token Counting
The proxy automatically counts input and output tokens for all requests using tiktoken `cl100k_base` encoding (Claude 3 compatible).
### How It Works
```
┌─────────────┐
│ Client │
└──────┬──────┘
│ Request
┌─────────────────────────────────────┐
│ Proxy: Count Input Tokens │
│ • Parse request messages │
│ • Tokenize using tiktoken │
│ • Metric: zai_proxy_tokens_total │
└──────┬──────────────────────────────┘
┌─────────────┐
│ Z.AI API │
└──────┬──────┘
│ Response (streaming)
┌─────────────────────────────────────┐
│ Proxy: Stream + Capture │
│ • Stream to client (zero-copy) │
│ • Capture content in background │
│ • Count output tokens after stream │
│ • Metric: zai_proxy_tokens_total │
└──────┬──────────────────────────────┘
┌─────────────┐
│ Client │
└─────────────┘
```
### Quick Configuration
```bash
# Enable token counting (default)
export TOKEN_COUNTING_ENABLED=true
export TOKENIZER_MODEL=glm-4
# Disable token counting
export TOKEN_COUNTING_ENABLED=false
```
### Monitoring Token Usage
**View logs:**
```bash
kubectl logs -f deployment/zai-proxy -n mcp | grep "Token usage"
# Output: Token usage: input=123, output=456
```
**Query Prometheus:**
```promql
# Total tokens per minute
rate(zai_proxy_tokens_total[5m]) * 60
# Input vs output ratio
rate(zai_proxy_tokens_total{direction="output"}[5m]) /
rate(zai_proxy_tokens_total{direction="input"}[5m])
# Token counting latency (should be <1ms)
histogram_quantile(0.99, rate(zai_proxy_token_count_duration_seconds_bucket[5m]))
```
**See [docs/TOKEN_COUNTING.md](docs/TOKEN_COUNTING.md) for comprehensive guide.**
## Prometheus Metrics
The proxy exports metrics at `:8080/metrics`:
### Request Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `zai_proxy_requests_total` | Counter | Total requests by method, path, status |
| `zai_proxy_request_duration_seconds` | Histogram | Request duration |
| `zai_proxy_concurrent_requests` | Gauge | Active concurrent requests |
| `zai_proxy_upstream_errors_total` | Counter | Upstream errors by type |
### Token Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `zai_proxy_tokens_total` | Counter | Total tokens by direction (input/output) and model |
| `zai_proxy_token_count_duration_seconds` | Histogram | Token counting latency |
| `zai_proxy_token_rate` | Histogram | Token processing rate (tokens/second) |
### Rate Limiting Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `zai_proxy_rate_limit_requests_per_second` | Gauge | Current rate limit |
| `zai_proxy_rate_limit_wait_seconds` | Histogram | Rate limiter wait time |
| `zai_proxy_rate_limit_adjustments_total` | Counter | Rate limit adjustments (increase/decrease) |
## Usage Example
```bash
# Make a request through the proxy
curl -X POST http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: $ZAI_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-3-sonnet",
"messages": [
{"role": "user", "content": "Hello, Claude!"}
],
"max_tokens": 100,
"stream": true
}'
# Check token usage in logs
# Output: Token usage: input=5, output=12
# Query metrics
curl http://localhost:8080/metrics | grep zai_proxy_tokens_total
# zai_proxy_tokens_total{direction="input",model="glm-4"} 5
# zai_proxy_tokens_total{direction="output",model="glm-4"} 12
```
## Development
### Running Tests
```bash
# Run all tests
go test -v ./...
# Run token counting tests
go test -v -run TestTikToken
# Run with coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
```
### Building
```bash
# Build binary
go build -o zai-proxy main.go tokenizer.go
# Build Docker image (use GitHub Actions for devpod environments)
docker build -t zai-proxy:dev .
```
**Note:** Docker builds in devpod environments may fail with overlayfs errors. See [docs/DEVPOD_DOCKER_BUILD_LIMITATION.md](docs/DEVPOD_DOCKER_BUILD_LIMITATION.md) for details and the recommended GitHub Actions build workflow.
### Project Structure
```
zai-proxy/
├── main.go # Proxy server
├── tokenizer.go # Token counting implementation
├── tokenizer_test.go # Token counting tests
├── main_test.go # Integration tests
├── docs/
│ ├── TOKEN_COUNTING.md # Token counting guide (comprehensive)
│ ├── ENVIRONMENT_VARIABLES.md # Environment variable reference
│ ├── TOKENIZER_CONFIGURATION.md # Tokenizer configuration
│ └── ...
├── RESPONSE_TOKEN_COUNTING.md # Implementation notes
├── TOKEN_COUNTING_WORKFLOW.md # Development workflow
├── go.mod # Go dependencies
└── Dockerfile # Container image
```
## Documentation
- **[TOKEN_COUNTING.md](docs/TOKEN_COUNTING.md)** - Comprehensive token counting guide
- How it works internally (architecture)
- Response format specification
- Configuration options
- Prometheus metrics reference
- Code examples and usage
- Known limitations
- Troubleshooting guide
- **[ENVIRONMENT_VARIABLES.md](docs/ENVIRONMENT_VARIABLES.md)** - Environment variable reference
- **[TOKENIZER_CONFIGURATION.md](docs/TOKENIZER_CONFIGURATION.md)** - Tokenizer configuration
- **[DEVPOD_DOCKER_BUILD_LIMITATION.md](docs/DEVPOD_DOCKER_BUILD_LIMITATION.md)** - Devpod Docker build limitations and GitHub Actions workaround
- **[RESPONSE_TOKEN_COUNTING.md](RESPONSE_TOKEN_COUNTING.md)** - Implementation notes
- **[TOKEN_COUNTING_WORKFLOW.md](TOKEN_COUNTING_WORKFLOW.md)** - Development workflow
## Troubleshooting
### Token counting not working
**Check startup logs:**
```bash
kubectl logs deployment/zai-proxy -n mcp | grep -i token
```
**Expected output:**
```
Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)
```
**If disabled:**
```
Token counting disabled (TOKEN_COUNTING_ENABLED=false)
```
**Fix:**
```bash
kubectl set env deployment/zai-proxy -n mcp TOKEN_COUNTING_ENABLED=true
kubectl rollout restart deployment/zai-proxy -n mcp
```
### Token counts seem inaccurate
**Check if fallback tokenizer is active:**
```bash
kubectl logs deployment/zai-proxy -n mcp | grep -i fallback
```
**If you see:**
```
Falling back to SimpleTokenCounter
```
**This means tiktoken failed to initialize.** The fallback uses word count approximation (~30% variance).
**Resolution:** Rebuild with tiktoken dependencies
### High token counting latency
**Query latency:**
```promql
histogram_quantile(0.99, rate(zai_proxy_token_count_duration_seconds_bucket[5m]))
```
**Expected:** <1ms for 99th percentile
**If >5ms:** Increase CPU limits or reduce concurrent requests
**See [docs/TOKEN_COUNTING.md#troubleshooting-guide](docs/TOKEN_COUNTING.md#troubleshooting-guide) for complete guide.**
## Known Limitations
1. **No usage injection** - Token counts are logged and metricked but not added to response bodies
- Workaround: Check logs or query Prometheus
- Future enhancement planned
2. **Hardcoded model label** - `TOKENIZER_MODEL` env var applies to all requests
- Workaround: Use separate proxy instances per model
- Future: Extract model from request body dynamically
3. **Tiktoken assumptions** - Uses `cl100k_base` encoding for all models
- Works well for Claude 3 (<3% variance)
- May have variance for GLM-4 (<10% expected)
**See [docs/TOKEN_COUNTING.md#known-limitations](docs/TOKEN_COUNTING.md#known-limitations) for details.**
## Performance
| Metric | Target | Typical |
|--------|--------|---------|
| Request latency overhead | <5ms | <1ms |
| Token counting latency | <1ms | 0.3-0.8ms |
| Streaming overhead | 0ms | 0ms (zero-copy) |
| Memory per request | <5KB | ~2KB |
**Token counting happens AFTER streaming completes, so it doesn't affect end-user latency.**
## License
See repository license.
## Contributing
Contributions welcome! Please:
1. Read existing documentation
2. Write tests for new features
3. Update documentation
4. Follow existing code style
## Support
- **Documentation:** Check `docs/` directory
- **Issues:** File in repository
- **Logs:** `kubectl logs -f deployment/zai-proxy -n mcp`
- **Metrics:** `http://zai-proxy.mcp.svc.cluster.local:8080/metrics`