Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
405 lines
12 KiB
Markdown
405 lines
12 KiB
Markdown
# Z.AI Proxy
|
|
|
|
A production-ready HTTP proxy for the Z.AI API with token counting, adaptive rate limiting, and comprehensive observability.
|
|
|
|
## Features
|
|
|
|
✅ **Token Counting** - Accurate input/output token tracking using tiktoken
|
|
✅ **Adaptive Rate Limiting** - Automatically adjusts to API limits
|
|
✅ **Prometheus Metrics** - Full observability with detailed metrics
|
|
✅ **Streaming Support** - Handles SSE (Server-Sent Events) streaming responses
|
|
✅ **Graceful Degradation** - Never fails requests due to token counting errors
|
|
✅ **Production Ready** - Thread-safe, tested, and battle-hardened
|
|
|
|
## Quick Start
|
|
|
|
### Run Locally
|
|
|
|
```bash
|
|
# Set required environment variables
|
|
export ZAI_API_KEY="your-api-key-here"
|
|
|
|
# Run the proxy
|
|
go run main.go tokenizer.go
|
|
|
|
# Proxy listens on :8080
|
|
# Metrics available at :8080/metrics
|
|
```
|
|
|
|
### Docker Deployment
|
|
|
|
```bash
|
|
# Build image
|
|
docker build -t zai-proxy:latest .
|
|
|
|
# Run container
|
|
docker run -p 8080:8080 \
|
|
-e ZAI_API_KEY="your-api-key" \
|
|
-e TOKEN_COUNTING_ENABLED=true \
|
|
zai-proxy:latest
|
|
```
|
|
|
|
### Kubernetes Deployment
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: zai-proxy
|
|
namespace: mcp
|
|
spec:
|
|
replicas: 2
|
|
selector:
|
|
matchLabels:
|
|
app: zai-proxy
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: zai-proxy
|
|
spec:
|
|
containers:
|
|
- name: zai-proxy
|
|
image: ghcr.io/ardenone/zai-proxy:latest
|
|
ports:
|
|
- containerPort: 8080
|
|
name: http
|
|
env:
|
|
- name: ZAI_API_KEY
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: zai-api-key
|
|
key: api-key
|
|
- name: TOKEN_COUNTING_ENABLED
|
|
value: "true"
|
|
- name: TOKENIZER_MODEL
|
|
value: "glm-4"
|
|
- name: MAX_WORKERS
|
|
value: "50"
|
|
resources:
|
|
requests:
|
|
cpu: 500m
|
|
memory: 256Mi
|
|
limits:
|
|
cpu: 2000m
|
|
memory: 512Mi
|
|
---
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: zai-proxy
|
|
namespace: mcp
|
|
spec:
|
|
selector:
|
|
app: zai-proxy
|
|
ports:
|
|
- port: 8080
|
|
targetPort: 8080
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Type | Default | Description |
|
|
|----------|------|---------|-------------|
|
|
| `ZAI_API_KEY` | String | **Required** | Z.AI API key for upstream authentication |
|
|
| `TOKEN_COUNTING_ENABLED` | Boolean | `true` | Enable/disable token counting |
|
|
| `TOKENIZER_MODEL` | String | `glm-4` | Model name for Prometheus metrics labels |
|
|
| `MAX_WORKERS` | Integer | `10` | Maximum concurrent requests |
|
|
| `RATE_LIMIT_INITIAL` | Float | `10.0` | Initial rate limit (requests/second) |
|
|
| `RATE_LIMIT_MIN` | Float | `1.0` | Minimum rate limit (requests/second) |
|
|
| `RATE_LIMIT_MAX` | Float | `50.0` | Maximum rate limit (requests/second) |
|
|
| `MAX_RETRIES` | Integer | `3` | Maximum retry attempts for failed requests |
|
|
|
|
**See [docs/ENVIRONMENT_VARIABLES.md](docs/ENVIRONMENT_VARIABLES.md) for complete reference.**
|
|
|
|
## Token Counting
|
|
|
|
The proxy automatically counts input and output tokens for all requests using tiktoken `cl100k_base` encoding (Claude 3 compatible).
|
|
|
|
### How It Works
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ Client │
|
|
└──────┬──────┘
|
|
│ Request
|
|
↓
|
|
┌─────────────────────────────────────┐
|
|
│ Proxy: Count Input Tokens │
|
|
│ • Parse request messages │
|
|
│ • Tokenize using tiktoken │
|
|
│ • Metric: zai_proxy_tokens_total │
|
|
└──────┬──────────────────────────────┘
|
|
│
|
|
↓
|
|
┌─────────────┐
|
|
│ Z.AI API │
|
|
└──────┬──────┘
|
|
│ Response (streaming)
|
|
↓
|
|
┌─────────────────────────────────────┐
|
|
│ Proxy: Stream + Capture │
|
|
│ • Stream to client (zero-copy) │
|
|
│ • Capture content in background │
|
|
│ • Count output tokens after stream │
|
|
│ • Metric: zai_proxy_tokens_total │
|
|
└──────┬──────────────────────────────┘
|
|
│
|
|
↓
|
|
┌─────────────┐
|
|
│ Client │
|
|
└─────────────┘
|
|
```
|
|
|
|
### Quick Configuration
|
|
|
|
```bash
|
|
# Enable token counting (default)
|
|
export TOKEN_COUNTING_ENABLED=true
|
|
export TOKENIZER_MODEL=glm-4
|
|
|
|
# Disable token counting
|
|
export TOKEN_COUNTING_ENABLED=false
|
|
```
|
|
|
|
### Monitoring Token Usage
|
|
|
|
**View logs:**
|
|
```bash
|
|
kubectl logs -f deployment/zai-proxy -n mcp | grep "Token usage"
|
|
# Output: Token usage: input=123, output=456
|
|
```
|
|
|
|
**Query Prometheus:**
|
|
```promql
|
|
# Total tokens per minute
|
|
rate(zai_proxy_tokens_total[5m]) * 60
|
|
|
|
# Input vs output ratio
|
|
rate(zai_proxy_tokens_total{direction="output"}[5m]) /
|
|
rate(zai_proxy_tokens_total{direction="input"}[5m])
|
|
|
|
# Token counting latency (should be <1ms)
|
|
histogram_quantile(0.99, rate(zai_proxy_token_count_duration_seconds_bucket[5m]))
|
|
```
|
|
|
|
**See [docs/TOKEN_COUNTING.md](docs/TOKEN_COUNTING.md) for comprehensive guide.**
|
|
|
|
## Prometheus Metrics
|
|
|
|
The proxy exports metrics at `:8080/metrics`:
|
|
|
|
### Request Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `zai_proxy_requests_total` | Counter | Total requests by method, path, status |
|
|
| `zai_proxy_request_duration_seconds` | Histogram | Request duration |
|
|
| `zai_proxy_concurrent_requests` | Gauge | Active concurrent requests |
|
|
| `zai_proxy_upstream_errors_total` | Counter | Upstream errors by type |
|
|
|
|
### Token Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `zai_proxy_tokens_total` | Counter | Total tokens by direction (input/output) and model |
|
|
| `zai_proxy_token_count_duration_seconds` | Histogram | Token counting latency |
|
|
| `zai_proxy_token_rate` | Histogram | Token processing rate (tokens/second) |
|
|
|
|
### Rate Limiting Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `zai_proxy_rate_limit_requests_per_second` | Gauge | Current rate limit |
|
|
| `zai_proxy_rate_limit_wait_seconds` | Histogram | Rate limiter wait time |
|
|
| `zai_proxy_rate_limit_adjustments_total` | Counter | Rate limit adjustments (increase/decrease) |
|
|
|
|
## Usage Example
|
|
|
|
```bash
|
|
# Make a request through the proxy
|
|
curl -X POST http://localhost:8080/v1/messages \
|
|
-H "Content-Type: application/json" \
|
|
-H "x-api-key: $ZAI_API_KEY" \
|
|
-H "anthropic-version: 2023-06-01" \
|
|
-d '{
|
|
"model": "claude-3-sonnet",
|
|
"messages": [
|
|
{"role": "user", "content": "Hello, Claude!"}
|
|
],
|
|
"max_tokens": 100,
|
|
"stream": true
|
|
}'
|
|
|
|
# Check token usage in logs
|
|
# Output: Token usage: input=5, output=12
|
|
|
|
# Query metrics
|
|
curl http://localhost:8080/metrics | grep zai_proxy_tokens_total
|
|
# zai_proxy_tokens_total{direction="input",model="glm-4"} 5
|
|
# zai_proxy_tokens_total{direction="output",model="glm-4"} 12
|
|
```
|
|
|
|
## Development
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Run all tests
|
|
go test -v ./...
|
|
|
|
# Run token counting tests
|
|
go test -v -run TestTikToken
|
|
|
|
# Run with coverage
|
|
go test -coverprofile=coverage.out ./...
|
|
go tool cover -html=coverage.out
|
|
```
|
|
|
|
### Building
|
|
|
|
```bash
|
|
# Build binary
|
|
go build -o zai-proxy main.go tokenizer.go
|
|
|
|
# Build Docker image (use GitHub Actions for devpod environments)
|
|
docker build -t zai-proxy:dev .
|
|
```
|
|
|
|
**Note:** Docker builds in devpod environments may fail with overlayfs errors. See [docs/DEVPOD_DOCKER_BUILD_LIMITATION.md](docs/DEVPOD_DOCKER_BUILD_LIMITATION.md) for details and the recommended GitHub Actions build workflow.
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
zai-proxy/
|
|
├── main.go # Proxy server
|
|
├── tokenizer.go # Token counting implementation
|
|
├── tokenizer_test.go # Token counting tests
|
|
├── main_test.go # Integration tests
|
|
├── docs/
|
|
│ ├── TOKEN_COUNTING.md # Token counting guide (comprehensive)
|
|
│ ├── ENVIRONMENT_VARIABLES.md # Environment variable reference
|
|
│ ├── TOKENIZER_CONFIGURATION.md # Tokenizer configuration
|
|
│ └── ...
|
|
├── RESPONSE_TOKEN_COUNTING.md # Implementation notes
|
|
├── TOKEN_COUNTING_WORKFLOW.md # Development workflow
|
|
├── go.mod # Go dependencies
|
|
└── Dockerfile # Container image
|
|
```
|
|
|
|
## Documentation
|
|
|
|
- **[TOKEN_COUNTING.md](docs/TOKEN_COUNTING.md)** - Comprehensive token counting guide
|
|
- How it works internally (architecture)
|
|
- Response format specification
|
|
- Configuration options
|
|
- Prometheus metrics reference
|
|
- Code examples and usage
|
|
- Known limitations
|
|
- Troubleshooting guide
|
|
- **[ENVIRONMENT_VARIABLES.md](docs/ENVIRONMENT_VARIABLES.md)** - Environment variable reference
|
|
- **[TOKENIZER_CONFIGURATION.md](docs/TOKENIZER_CONFIGURATION.md)** - Tokenizer configuration
|
|
- **[DEVPOD_DOCKER_BUILD_LIMITATION.md](docs/DEVPOD_DOCKER_BUILD_LIMITATION.md)** - Devpod Docker build limitations and GitHub Actions workaround
|
|
- **[RESPONSE_TOKEN_COUNTING.md](RESPONSE_TOKEN_COUNTING.md)** - Implementation notes
|
|
- **[TOKEN_COUNTING_WORKFLOW.md](TOKEN_COUNTING_WORKFLOW.md)** - Development workflow
|
|
|
|
## Troubleshooting
|
|
|
|
### Token counting not working
|
|
|
|
**Check startup logs:**
|
|
```bash
|
|
kubectl logs deployment/zai-proxy -n mcp | grep -i token
|
|
```
|
|
|
|
**Expected output:**
|
|
```
|
|
Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)
|
|
```
|
|
|
|
**If disabled:**
|
|
```
|
|
Token counting disabled (TOKEN_COUNTING_ENABLED=false)
|
|
```
|
|
|
|
**Fix:**
|
|
```bash
|
|
kubectl set env deployment/zai-proxy -n mcp TOKEN_COUNTING_ENABLED=true
|
|
kubectl rollout restart deployment/zai-proxy -n mcp
|
|
```
|
|
|
|
### Token counts seem inaccurate
|
|
|
|
**Check if fallback tokenizer is active:**
|
|
```bash
|
|
kubectl logs deployment/zai-proxy -n mcp | grep -i fallback
|
|
```
|
|
|
|
**If you see:**
|
|
```
|
|
Falling back to SimpleTokenCounter
|
|
```
|
|
|
|
**This means tiktoken failed to initialize.** The fallback uses word count approximation (~30% variance).
|
|
|
|
**Resolution:** Rebuild with tiktoken dependencies
|
|
|
|
### High token counting latency
|
|
|
|
**Query latency:**
|
|
```promql
|
|
histogram_quantile(0.99, rate(zai_proxy_token_count_duration_seconds_bucket[5m]))
|
|
```
|
|
|
|
**Expected:** <1ms for 99th percentile
|
|
|
|
**If >5ms:** Increase CPU limits or reduce concurrent requests
|
|
|
|
**See [docs/TOKEN_COUNTING.md#troubleshooting-guide](docs/TOKEN_COUNTING.md#troubleshooting-guide) for complete guide.**
|
|
|
|
## Known Limitations
|
|
|
|
1. **No usage injection** - Token counts are logged and metricked but not added to response bodies
|
|
- Workaround: Check logs or query Prometheus
|
|
- Future enhancement planned
|
|
|
|
2. **Hardcoded model label** - `TOKENIZER_MODEL` env var applies to all requests
|
|
- Workaround: Use separate proxy instances per model
|
|
- Future: Extract model from request body dynamically
|
|
|
|
3. **Tiktoken assumptions** - Uses `cl100k_base` encoding for all models
|
|
- Works well for Claude 3 (<3% variance)
|
|
- May have variance for GLM-4 (<10% expected)
|
|
|
|
**See [docs/TOKEN_COUNTING.md#known-limitations](docs/TOKEN_COUNTING.md#known-limitations) for details.**
|
|
|
|
## Performance
|
|
|
|
| Metric | Target | Typical |
|
|
|--------|--------|---------|
|
|
| Request latency overhead | <5ms | <1ms |
|
|
| Token counting latency | <1ms | 0.3-0.8ms |
|
|
| Streaming overhead | 0ms | 0ms (zero-copy) |
|
|
| Memory per request | <5KB | ~2KB |
|
|
|
|
**Token counting happens AFTER streaming completes, so it doesn't affect end-user latency.**
|
|
|
|
## License
|
|
|
|
See repository license.
|
|
|
|
## Contributing
|
|
|
|
Contributions welcome! Please:
|
|
1. Read existing documentation
|
|
2. Write tests for new features
|
|
3. Update documentation
|
|
4. Follow existing code style
|
|
|
|
## Support
|
|
|
|
- **Documentation:** Check `docs/` directory
|
|
- **Issues:** File in repository
|
|
- **Logs:** `kubectl logs -f deployment/zai-proxy -n mcp`
|
|
- **Metrics:** `http://zai-proxy.mcp.svc.cluster.local:8080/metrics`
|