zai-proxy/docs/notes/ENVIRONMENT_VARIABLES.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

241 lines
5.5 KiB
Markdown

# Environment Variables
This document describes all environment variables supported by the zai-proxy service.
## Tokenizer Configuration
### `TOKEN_COUNTING_ENABLED`
**Type:** Boolean
**Default:** `true`
**Description:** Enable or disable token counting for input and output tokens.
When enabled, the proxy will:
- Count input tokens from request messages using tiktoken cl100k_base encoding
- Count output tokens from response content (both streaming and non-streaming)
- Emit Prometheus metrics: `zai_proxy_tokens_total` and `zai_proxy_token_count_duration_seconds`
- Log token usage for each request
When disabled, the proxy will skip all token counting operations, reducing CPU overhead.
**Valid values:**
- `true`, `1`, or empty (default) - Token counting enabled
- `false`, `0` - Token counting disabled
**Example:**
```bash
# Enable token counting (default)
TOKEN_COUNTING_ENABLED=true
# Disable token counting
TOKEN_COUNTING_ENABLED=false
```
### `TOKENIZER_MODEL`
**Type:** String
**Default:** `glm-4`
**Description:** Model name used for Prometheus metrics labels.
This value is used as the `model` label in the `zai_proxy_tokens_total` metric. It does not affect the tokenization algorithm (which always uses tiktoken cl100k_base encoding), but allows distinguishing token counts by model in metrics.
**Example:**
```bash
# Default
TOKENIZER_MODEL=glm-4
# For different model tracking
TOKENIZER_MODEL=claude-3-opus
TOKENIZER_MODEL=gpt-4
```
**Prometheus metric example:**
```
zai_proxy_tokens_total{direction="input",model="glm-4"} 1234
zai_proxy_tokens_total{direction="output",model="glm-4"} 5678
```
## Worker Configuration
### `MAX_WORKERS`
**Type:** Integer
**Default:** `10`
**Description:** Maximum number of concurrent requests allowed.
When the number of concurrent requests exceeds this limit, new requests will receive a `503 Service Unavailable` response.
**Example:**
```bash
MAX_WORKERS=50
```
## Rate Limiting Configuration
### `RATE_LIMIT_INITIAL`
**Type:** Float
**Default:** `10.0`
**Description:** Initial rate limit in requests per second.
The proxy uses adaptive rate limiting that automatically adjusts based on 429 responses from the upstream API.
**Example:**
```bash
RATE_LIMIT_INITIAL=20.0
```
### `RATE_LIMIT_MIN`
**Type:** Float
**Default:** `1.0`
**Description:** Minimum rate limit in requests per second.
The adaptive rate limiter will never decrease below this value, even when receiving 429 responses.
**Example:**
```bash
RATE_LIMIT_MIN=0.5
```
### `RATE_LIMIT_MAX`
**Type:** Float
**Default:** `50.0`
**Description:** Maximum rate limit in requests per second.
The adaptive rate limiter will never increase above this value, even during successful operation.
**Example:**
```bash
RATE_LIMIT_MAX=100.0
```
### `RATE_LIMIT_ADDITIVE_INCREASE`
**Type:** Float
**Default:** `0.5`
**Description:** Additive increase step in requests per second (AIMD algorithm).
The rate limiter uses AIMD (Additive Increase, Multiplicative Decrease):
- On success (< 1% 429s): rate increases by this fixed amount
- On 429s (> 5%): rate decreases multiplicatively (5-40%)
This produces stable convergence instead of oscillation. For example, with a ceiling of 20 req/s and additive step of 0.5, the rate will converge near 19.5 instead of bouncing between 19 and 20.
**Example:**
```bash
RATE_LIMIT_ADDITIVE_INCREASE=0.5
```
## Retry Configuration
### `MAX_RETRIES`
**Type:** Integer
**Default:** `3`
**Description:** Maximum number of retry attempts for failed requests.
The proxy will retry requests on:
- Network errors
- HTTP 429 (Too Many Requests) responses
Exponential backoff is used: 1s, 2s, 4s, etc.
**Example:**
```bash
MAX_RETRIES=5
```
## Required Configuration
### `ZAI_API_KEY`
**Type:** String
**Required:** Yes
**Description:** API key for authenticating with the Z.AI upstream API.
The proxy will fail to start if this variable is not set.
**Example:**
```bash
ZAI_API_KEY=your-api-key-here
```
## Complete Example Configuration
```bash
# Required
ZAI_API_KEY=sk-xxx...
# Tokenizer settings
TOKEN_COUNTING_ENABLED=true
TOKENIZER_MODEL=glm-4
# Worker settings
MAX_WORKERS=20
# Rate limiting
RATE_LIMIT_INITIAL=15.0
RATE_LIMIT_MIN=1.0
RATE_LIMIT_MAX=50.0
# Retry settings
MAX_RETRIES=3
```
## Startup Logs
When the proxy starts, it logs the current configuration:
```
Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)
Max workers set to: 20
Adaptive rate limiting: initial=15.0, min=1.0, max=50.0 req/s
Z.AI proxy listening on :8080
Metrics available at :8080/metrics
```
If token counting is disabled:
```
Token counting disabled (TOKEN_COUNTING_ENABLED=false)
Max workers set to: 20
Adaptive rate limiting: initial=15.0, min=1.0, max=50.0 req/s
Z.AI proxy listening on :8080
Metrics available at :8080/metrics
```
## Kubernetes ConfigMap Example
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: zai-proxy-config
namespace: mcp
data:
TOKEN_COUNTING_ENABLED: "true"
TOKENIZER_MODEL: "glm-4"
MAX_WORKERS: "20"
RATE_LIMIT_INITIAL: "15.0"
RATE_LIMIT_MIN: "1.0"
RATE_LIMIT_MAX: "50.0"
MAX_RETRIES: "3"
```
## Kubernetes Deployment Example
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: zai-proxy
namespace: mcp
spec:
template:
spec:
containers:
- name: zai-proxy
image: ghcr.io/ardenone/zai-proxy:latest
env:
- name: ZAI_API_KEY
valueFrom:
secretKeyRef:
name: zai-api-key
key: api-key
envFrom:
- configMapRef:
name: zai-proxy-config
```