Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.3 KiB
4.3 KiB
Tokenizer Configuration
This document describes the tokenizer configuration options for the Z.AI proxy.
Environment Variables
TOKEN_COUNTING_ENABLED
Default: true
Controls whether token counting is enabled or disabled.
Values:
trueor1or unset: Token counting is enabled (default)falseor0: Token counting is disabled
Example:
# Disable token counting
export TOKEN_COUNTING_ENABLED=false
# Enable token counting (default)
export TOKEN_COUNTING_ENABLED=true
Behavior:
- When enabled, the proxy will initialize the tiktoken tokenizer and count tokens for all requests and responses
- When disabled, no tokenizer is initialized and no token metrics are collected
- Disabling can reduce CPU usage and memory footprint if token metrics are not needed
TOKENIZER_MODEL
Default: glm-4
Specifies the model name to use as a label in Prometheus token metrics.
Values: Any string (e.g., glm-4, claude-3, gpt-4, etc.)
Example:
# Set model name for metrics
export TOKENIZER_MODEL=glm-4.7
# Use different model name
export TOKENIZER_MODEL=claude-3-sonnet
Behavior:
- This is purely for Prometheus metrics labeling and does not affect the tokenization algorithm
- The proxy always uses tiktoken's
cl100k_baseencoding regardless of this setting - Metrics will be tagged with the specified model name:
zai_proxy_tokens_total{direction="input",model="glm-4"} - Useful for tracking token usage per model when the proxy handles multiple models
Startup Log Messages
The proxy logs its tokenizer configuration at startup:
Token counting enabled (tiktoken):
Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)
Token counting enabled (fallback mode):
Warning: Failed to initialize TikToken counter: <error>
Falling back to SimpleTokenCounter
Token counting enabled (fallback mode, model: glm-4)
Token counting disabled:
Token counting disabled (TOKEN_COUNTING_ENABLED=false)
Prometheus Metrics
When token counting is enabled, the following metrics are exposed:
zai_proxy_tokens_total
Type: Counter
Labels:
direction:inputoroutputmodel: Value fromTOKENIZER_MODELenvironment variable
Description: Total number of tokens processed by direction and model.
Example:
# HELP zai_proxy_tokens_total Total number of tokens processed
# TYPE zai_proxy_tokens_total counter
zai_proxy_tokens_total{direction="input",model="glm-4"} 15234
zai_proxy_tokens_total{direction="output",model="glm-4"} 8921
zai_proxy_token_count_duration_seconds
Type: Histogram
Description: Duration of token counting operations in seconds.
Example:
# HELP zai_proxy_token_count_duration_seconds Duration of token counting operations
# TYPE zai_proxy_token_count_duration_seconds histogram
zai_proxy_token_count_duration_seconds_bucket{le="0.0001"} 142
zai_proxy_token_count_duration_seconds_bucket{le="0.0005"} 289
zai_proxy_token_count_duration_seconds_bucket{le="0.001"} 456
...
Kubernetes Deployment Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: zai-proxy
spec:
template:
spec:
containers:
- name: zai-proxy
image: zai-proxy:latest
env:
- name: ZAI_API_KEY
valueFrom:
secretKeyRef:
name: zai-api-key
key: api-key
- name: TOKEN_COUNTING_ENABLED
value: "true"
- name: TOKENIZER_MODEL
value: "glm-4"
- name: MAX_WORKERS
value: "50"
- name: RATE_LIMIT_INITIAL
value: "10"
- name: RATE_LIMIT_MIN
value: "1"
- name: RATE_LIMIT_MAX
value: "50"
Implementation Details
- Tokenizer: Uses tiktoken-go with
cl100k_baseencoding (Claude 3 compatible) - Fallback: If tiktoken initialization fails, falls back to simple word-based approximation
- Thread-safe: Token counting is mutex-protected for concurrent access
- Performance: Token counting adds minimal latency (~0.1-1ms per request)
- Streaming: Supports both streaming (SSE) and non-streaming responses
See Also
- RESPONSE_TOKEN_COUNTING.md - Token counting workflow
- TOKEN_COUNTING_WORKFLOW.md - Detailed token counting architecture