Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.5 KiB
Environment Variables
This document describes all environment variables supported by the zai-proxy service.
Tokenizer Configuration
TOKEN_COUNTING_ENABLED
Type: Boolean
Default: true
Description: Enable or disable token counting for input and output tokens.
When enabled, the proxy will:
- Count input tokens from request messages using tiktoken cl100k_base encoding
- Count output tokens from response content (both streaming and non-streaming)
- Emit Prometheus metrics:
zai_proxy_tokens_totalandzai_proxy_token_count_duration_seconds - Log token usage for each request
When disabled, the proxy will skip all token counting operations, reducing CPU overhead.
Valid values:
true,1, or empty (default) - Token counting enabledfalse,0- Token counting disabled
Example:
# Enable token counting (default)
TOKEN_COUNTING_ENABLED=true
# Disable token counting
TOKEN_COUNTING_ENABLED=false
TOKENIZER_MODEL
Type: String
Default: glm-4
Description: Model name used for Prometheus metrics labels.
This value is used as the model label in the zai_proxy_tokens_total metric. It does not affect the tokenization algorithm (which always uses tiktoken cl100k_base encoding), but allows distinguishing token counts by model in metrics.
Example:
# Default
TOKENIZER_MODEL=glm-4
# For different model tracking
TOKENIZER_MODEL=claude-3-opus
TOKENIZER_MODEL=gpt-4
Prometheus metric example:
zai_proxy_tokens_total{direction="input",model="glm-4"} 1234
zai_proxy_tokens_total{direction="output",model="glm-4"} 5678
Worker Configuration
MAX_WORKERS
Type: Integer
Default: 10
Description: Maximum number of concurrent requests allowed.
When the number of concurrent requests exceeds this limit, new requests will receive a 503 Service Unavailable response.
Example:
MAX_WORKERS=50
Rate Limiting Configuration
RATE_LIMIT_INITIAL
Type: Float
Default: 10.0
Description: Initial rate limit in requests per second.
The proxy uses adaptive rate limiting that automatically adjusts based on 429 responses from the upstream API.
Example:
RATE_LIMIT_INITIAL=20.0
RATE_LIMIT_MIN
Type: Float
Default: 1.0
Description: Minimum rate limit in requests per second.
The adaptive rate limiter will never decrease below this value, even when receiving 429 responses.
Example:
RATE_LIMIT_MIN=0.5
RATE_LIMIT_MAX
Type: Float
Default: 50.0
Description: Maximum rate limit in requests per second.
The adaptive rate limiter will never increase above this value, even during successful operation.
Example:
RATE_LIMIT_MAX=100.0
RATE_LIMIT_ADDITIVE_INCREASE
Type: Float
Default: 0.5
Description: Additive increase step in requests per second (AIMD algorithm).
The rate limiter uses AIMD (Additive Increase, Multiplicative Decrease):
- On success (< 1% 429s): rate increases by this fixed amount
- On 429s (> 5%): rate decreases multiplicatively (5-40%)
This produces stable convergence instead of oscillation. For example, with a ceiling of 20 req/s and additive step of 0.5, the rate will converge near 19.5 instead of bouncing between 19 and 20.
Example:
RATE_LIMIT_ADDITIVE_INCREASE=0.5
Retry Configuration
MAX_RETRIES
Type: Integer
Default: 3
Description: Maximum number of retry attempts for failed requests.
The proxy will retry requests on:
- Network errors
- HTTP 429 (Too Many Requests) responses
Exponential backoff is used: 1s, 2s, 4s, etc.
Example:
MAX_RETRIES=5
Required Configuration
ZAI_API_KEY
Type: String Required: Yes Description: API key for authenticating with the Z.AI upstream API.
The proxy will fail to start if this variable is not set.
Example:
ZAI_API_KEY=your-api-key-here
Complete Example Configuration
# Required
ZAI_API_KEY=sk-xxx...
# Tokenizer settings
TOKEN_COUNTING_ENABLED=true
TOKENIZER_MODEL=glm-4
# Worker settings
MAX_WORKERS=20
# Rate limiting
RATE_LIMIT_INITIAL=15.0
RATE_LIMIT_MIN=1.0
RATE_LIMIT_MAX=50.0
# Retry settings
MAX_RETRIES=3
Startup Logs
When the proxy starts, it logs the current configuration:
Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)
Max workers set to: 20
Adaptive rate limiting: initial=15.0, min=1.0, max=50.0 req/s
Z.AI proxy listening on :8080
Metrics available at :8080/metrics
If token counting is disabled:
Token counting disabled (TOKEN_COUNTING_ENABLED=false)
Max workers set to: 20
Adaptive rate limiting: initial=15.0, min=1.0, max=50.0 req/s
Z.AI proxy listening on :8080
Metrics available at :8080/metrics
Kubernetes ConfigMap Example
apiVersion: v1
kind: ConfigMap
metadata:
name: zai-proxy-config
namespace: mcp
data:
TOKEN_COUNTING_ENABLED: "true"
TOKENIZER_MODEL: "glm-4"
MAX_WORKERS: "20"
RATE_LIMIT_INITIAL: "15.0"
RATE_LIMIT_MIN: "1.0"
RATE_LIMIT_MAX: "50.0"
MAX_RETRIES: "3"
Kubernetes Deployment Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: zai-proxy
namespace: mcp
spec:
template:
spec:
containers:
- name: zai-proxy
image: ghcr.io/ardenone/zai-proxy:latest
env:
- name: ZAI_API_KEY
valueFrom:
secretKeyRef:
name: zai-api-key
key: api-key
envFrom:
- configMapRef:
name: zai-proxy-config