jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo

Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:53:52 -04:00

5.5 KiB

Raw Permalink Blame History

Environment Variables

This document describes all environment variables supported by the zai-proxy service.

Tokenizer Configuration

`TOKEN_COUNTING_ENABLED`

Type: Boolean Default: true Description: Enable or disable token counting for input and output tokens.

When enabled, the proxy will:

Count input tokens from request messages using tiktoken cl100k_base encoding
Count output tokens from response content (both streaming and non-streaming)
Emit Prometheus metrics: zai_proxy_tokens_total and zai_proxy_token_count_duration_seconds
Log token usage for each request

When disabled, the proxy will skip all token counting operations, reducing CPU overhead.

Valid values:

true, 1, or empty (default) - Token counting enabled
false, 0 - Token counting disabled

Example:

# Enable token counting (default)
TOKEN_COUNTING_ENABLED=true

# Disable token counting
TOKEN_COUNTING_ENABLED=false

`TOKENIZER_MODEL`

Type: String Default: glm-4 Description: Model name used for Prometheus metrics labels.

This value is used as the model label in the zai_proxy_tokens_total metric. It does not affect the tokenization algorithm (which always uses tiktoken cl100k_base encoding), but allows distinguishing token counts by model in metrics.

Example:

# Default
TOKENIZER_MODEL=glm-4

# For different model tracking
TOKENIZER_MODEL=claude-3-opus
TOKENIZER_MODEL=gpt-4

Prometheus metric example:

zai_proxy_tokens_total{direction="input",model="glm-4"} 1234
zai_proxy_tokens_total{direction="output",model="glm-4"} 5678

Worker Configuration

`MAX_WORKERS`

Type: Integer Default: 10 Description: Maximum number of concurrent requests allowed.

When the number of concurrent requests exceeds this limit, new requests will receive a 503 Service Unavailable response.

Example:

MAX_WORKERS=50

Rate Limiting Configuration

`RATE_LIMIT_INITIAL`

Type: Float Default: 10.0 Description: Initial rate limit in requests per second.

The proxy uses adaptive rate limiting that automatically adjusts based on 429 responses from the upstream API.

Example:

RATE_LIMIT_INITIAL=20.0

`RATE_LIMIT_MIN`

Type: Float Default: 1.0 Description: Minimum rate limit in requests per second.

The adaptive rate limiter will never decrease below this value, even when receiving 429 responses.

Example:

RATE_LIMIT_MIN=0.5

`RATE_LIMIT_MAX`

Type: Float Default: 50.0 Description: Maximum rate limit in requests per second.

The adaptive rate limiter will never increase above this value, even during successful operation.

Example:

RATE_LIMIT_MAX=100.0

`RATE_LIMIT_ADDITIVE_INCREASE`

Type: Float Default: 0.5 Description: Additive increase step in requests per second (AIMD algorithm).

The rate limiter uses AIMD (Additive Increase, Multiplicative Decrease):

On success (< 1% 429s): rate increases by this fixed amount
On 429s (> 5%): rate decreases multiplicatively (5-40%)

This produces stable convergence instead of oscillation. For example, with a ceiling of 20 req/s and additive step of 0.5, the rate will converge near 19.5 instead of bouncing between 19 and 20.

Example:

RATE_LIMIT_ADDITIVE_INCREASE=0.5

Retry Configuration

`MAX_RETRIES`

Type: Integer Default: 3 Description: Maximum number of retry attempts for failed requests.

The proxy will retry requests on:

Network errors
HTTP 429 (Too Many Requests) responses

Exponential backoff is used: 1s, 2s, 4s, etc.

Example:

MAX_RETRIES=5

Required Configuration

`ZAI_API_KEY`

Type: String Required: Yes Description: API key for authenticating with the Z.AI upstream API.

The proxy will fail to start if this variable is not set.

Example:

ZAI_API_KEY=your-api-key-here

Complete Example Configuration

# Required
ZAI_API_KEY=sk-xxx...

# Tokenizer settings
TOKEN_COUNTING_ENABLED=true
TOKENIZER_MODEL=glm-4

# Worker settings
MAX_WORKERS=20

# Rate limiting
RATE_LIMIT_INITIAL=15.0
RATE_LIMIT_MIN=1.0
RATE_LIMIT_MAX=50.0

# Retry settings
MAX_RETRIES=3

Startup Logs

When the proxy starts, it logs the current configuration:

Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)
Max workers set to: 20
Adaptive rate limiting: initial=15.0, min=1.0, max=50.0 req/s
Z.AI proxy listening on :8080
Metrics available at :8080/metrics

If token counting is disabled:

Token counting disabled (TOKEN_COUNTING_ENABLED=false)
Max workers set to: 20
Adaptive rate limiting: initial=15.0, min=1.0, max=50.0 req/s
Z.AI proxy listening on :8080
Metrics available at :8080/metrics

Kubernetes ConfigMap Example

apiVersion: v1
kind: ConfigMap
metadata:
  name: zai-proxy-config
  namespace: mcp
data:
  TOKEN_COUNTING_ENABLED: "true"
  TOKENIZER_MODEL: "glm-4"
  MAX_WORKERS: "20"
  RATE_LIMIT_INITIAL: "15.0"
  RATE_LIMIT_MIN: "1.0"
  RATE_LIMIT_MAX: "50.0"
  MAX_RETRIES: "3"

Kubernetes Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zai-proxy
  namespace: mcp
spec:
  template:
    spec:
      containers:
      - name: zai-proxy
        image: ghcr.io/ardenone/zai-proxy:latest
        env:
        - name: ZAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: zai-api-key
              key: api-key
        envFrom:
        - configMapRef:
            name: zai-proxy-config

5.5 KiB Raw Permalink Blame History

Environment Variables

Tokenizer Configuration

TOKEN_COUNTING_ENABLED

TOKENIZER_MODEL

Worker Configuration

MAX_WORKERS

Rate Limiting Configuration

RATE_LIMIT_INITIAL

RATE_LIMIT_MIN

RATE_LIMIT_MAX

RATE_LIMIT_ADDITIVE_INCREASE

Retry Configuration

MAX_RETRIES

Required Configuration

ZAI_API_KEY

Complete Example Configuration

Startup Logs

Kubernetes ConfigMap Example

Kubernetes Deployment Example

5.5 KiB

Raw Permalink Blame History

`TOKEN_COUNTING_ENABLED`

`TOKENIZER_MODEL`

`MAX_WORKERS`

`RATE_LIMIT_INITIAL`

`RATE_LIMIT_MIN`

`RATE_LIMIT_MAX`

`RATE_LIMIT_ADDITIVE_INCREASE`

`MAX_RETRIES`

`ZAI_API_KEY`