zai-proxy/docs/notes/TOKENIZER_CONFIGURATION.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

4.3 KiB

Tokenizer Configuration

This document describes the tokenizer configuration options for the Z.AI proxy.

Environment Variables

TOKEN_COUNTING_ENABLED

Default: true

Controls whether token counting is enabled or disabled.

Values:

  • true or 1 or unset: Token counting is enabled (default)
  • false or 0: Token counting is disabled

Example:

# Disable token counting
export TOKEN_COUNTING_ENABLED=false

# Enable token counting (default)
export TOKEN_COUNTING_ENABLED=true

Behavior:

  • When enabled, the proxy will initialize the tiktoken tokenizer and count tokens for all requests and responses
  • When disabled, no tokenizer is initialized and no token metrics are collected
  • Disabling can reduce CPU usage and memory footprint if token metrics are not needed

TOKENIZER_MODEL

Default: glm-4

Specifies the model name to use as a label in Prometheus token metrics.

Values: Any string (e.g., glm-4, claude-3, gpt-4, etc.)

Example:

# Set model name for metrics
export TOKENIZER_MODEL=glm-4.7

# Use different model name
export TOKENIZER_MODEL=claude-3-sonnet

Behavior:

  • This is purely for Prometheus metrics labeling and does not affect the tokenization algorithm
  • The proxy always uses tiktoken's cl100k_base encoding regardless of this setting
  • Metrics will be tagged with the specified model name: zai_proxy_tokens_total{direction="input",model="glm-4"}
  • Useful for tracking token usage per model when the proxy handles multiple models

Startup Log Messages

The proxy logs its tokenizer configuration at startup:

Token counting enabled (tiktoken):

Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)

Token counting enabled (fallback mode):

Warning: Failed to initialize TikToken counter: <error>
Falling back to SimpleTokenCounter
Token counting enabled (fallback mode, model: glm-4)

Token counting disabled:

Token counting disabled (TOKEN_COUNTING_ENABLED=false)

Prometheus Metrics

When token counting is enabled, the following metrics are exposed:

zai_proxy_tokens_total

Type: Counter

Labels:

  • direction: input or output
  • model: Value from TOKENIZER_MODEL environment variable

Description: Total number of tokens processed by direction and model.

Example:

# HELP zai_proxy_tokens_total Total number of tokens processed
# TYPE zai_proxy_tokens_total counter
zai_proxy_tokens_total{direction="input",model="glm-4"} 15234
zai_proxy_tokens_total{direction="output",model="glm-4"} 8921

zai_proxy_token_count_duration_seconds

Type: Histogram

Description: Duration of token counting operations in seconds.

Example:

# HELP zai_proxy_token_count_duration_seconds Duration of token counting operations
# TYPE zai_proxy_token_count_duration_seconds histogram
zai_proxy_token_count_duration_seconds_bucket{le="0.0001"} 142
zai_proxy_token_count_duration_seconds_bucket{le="0.0005"} 289
zai_proxy_token_count_duration_seconds_bucket{le="0.001"} 456
...

Kubernetes Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zai-proxy
spec:
  template:
    spec:
      containers:
      - name: zai-proxy
        image: zai-proxy:latest
        env:
        - name: ZAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: zai-api-key
              key: api-key
        - name: TOKEN_COUNTING_ENABLED
          value: "true"
        - name: TOKENIZER_MODEL
          value: "glm-4"
        - name: MAX_WORKERS
          value: "50"
        - name: RATE_LIMIT_INITIAL
          value: "10"
        - name: RATE_LIMIT_MIN
          value: "1"
        - name: RATE_LIMIT_MAX
          value: "50"

Implementation Details

  • Tokenizer: Uses tiktoken-go with cl100k_base encoding (Claude 3 compatible)
  • Fallback: If tiktoken initialization fails, falls back to simple word-based approximation
  • Thread-safe: Token counting is mutex-protected for concurrent access
  • Performance: Token counting adds minimal latency (~0.1-1ms per request)
  • Streaming: Supports both streaming (SSE) and non-streaming responses

See Also