jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo

Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:53:52 -04:00

4.3 KiB

Raw Permalink Blame History

Tokenizer Configuration

This document describes the tokenizer configuration options for the Z.AI proxy.

Environment Variables

`TOKEN_COUNTING_ENABLED`

Default: true

Controls whether token counting is enabled or disabled.

Values:

true or 1 or unset: Token counting is enabled (default)
false or 0: Token counting is disabled

Example:

# Disable token counting
export TOKEN_COUNTING_ENABLED=false

# Enable token counting (default)
export TOKEN_COUNTING_ENABLED=true

Behavior:

When enabled, the proxy will initialize the tiktoken tokenizer and count tokens for all requests and responses
When disabled, no tokenizer is initialized and no token metrics are collected
Disabling can reduce CPU usage and memory footprint if token metrics are not needed

`TOKENIZER_MODEL`

Default: glm-4

Specifies the model name to use as a label in Prometheus token metrics.

Values: Any string (e.g., glm-4, claude-3, gpt-4, etc.)

Example:

# Set model name for metrics
export TOKENIZER_MODEL=glm-4.7

# Use different model name
export TOKENIZER_MODEL=claude-3-sonnet

Behavior:

This is purely for Prometheus metrics labeling and does not affect the tokenization algorithm
The proxy always uses tiktoken's cl100k_base encoding regardless of this setting
Metrics will be tagged with the specified model name: zai_proxy_tokens_total{direction="input",model="glm-4"}
Useful for tracking token usage per model when the proxy handles multiple models

Startup Log Messages

The proxy logs its tokenizer configuration at startup:

Token counting enabled (tiktoken):

Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)

Token counting enabled (fallback mode):

Warning: Failed to initialize TikToken counter: <error>
Falling back to SimpleTokenCounter
Token counting enabled (fallback mode, model: glm-4)

Token counting disabled:

Token counting disabled (TOKEN_COUNTING_ENABLED=false)

Prometheus Metrics

When token counting is enabled, the following metrics are exposed:

`zai_proxy_tokens_total`

Type: Counter

Labels:

direction: input or output
model: Value from TOKENIZER_MODEL environment variable

Description: Total number of tokens processed by direction and model.

Example:

# HELP zai_proxy_tokens_total Total number of tokens processed
# TYPE zai_proxy_tokens_total counter
zai_proxy_tokens_total{direction="input",model="glm-4"} 15234
zai_proxy_tokens_total{direction="output",model="glm-4"} 8921

`zai_proxy_token_count_duration_seconds`

Type: Histogram

Description: Duration of token counting operations in seconds.

Example:

# HELP zai_proxy_token_count_duration_seconds Duration of token counting operations
# TYPE zai_proxy_token_count_duration_seconds histogram
zai_proxy_token_count_duration_seconds_bucket{le="0.0001"} 142
zai_proxy_token_count_duration_seconds_bucket{le="0.0005"} 289
zai_proxy_token_count_duration_seconds_bucket{le="0.001"} 456
...

Kubernetes Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zai-proxy
spec:
  template:
    spec:
      containers:
      - name: zai-proxy
        image: zai-proxy:latest
        env:
        - name: ZAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: zai-api-key
              key: api-key
        - name: TOKEN_COUNTING_ENABLED
          value: "true"
        - name: TOKENIZER_MODEL
          value: "glm-4"
        - name: MAX_WORKERS
          value: "50"
        - name: RATE_LIMIT_INITIAL
          value: "10"
        - name: RATE_LIMIT_MIN
          value: "1"
        - name: RATE_LIMIT_MAX
          value: "50"

Implementation Details

Tokenizer: Uses tiktoken-go with cl100k_base encoding (Claude 3 compatible)
Fallback: If tiktoken initialization fails, falls back to simple word-based approximation
Thread-safe: Token counting is mutex-protected for concurrent access
Performance: Token counting adds minimal latency (~0.1-1ms per request)
Streaming: Supports both streaming (SSE) and non-streaming responses

4.3 KiB Raw Permalink Blame History

Tokenizer Configuration

Environment Variables

TOKEN_COUNTING_ENABLED

TOKENIZER_MODEL

Startup Log Messages

Prometheus Metrics

zai_proxy_tokens_total

zai_proxy_token_count_duration_seconds

Kubernetes Deployment Example

Implementation Details

See Also

4.3 KiB

Raw Permalink Blame History

`TOKEN_COUNTING_ENABLED`

`TOKENIZER_MODEL`

`zai_proxy_tokens_total`

`zai_proxy_token_count_duration_seconds`