zai-proxy/docs/notes/TOKENIZER_CONFIGURATION.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

157 lines
4.3 KiB
Markdown

# Tokenizer Configuration
This document describes the tokenizer configuration options for the Z.AI proxy.
## Environment Variables
### `TOKEN_COUNTING_ENABLED`
**Default:** `true`
Controls whether token counting is enabled or disabled.
**Values:**
- `true` or `1` or unset: Token counting is enabled (default)
- `false` or `0`: Token counting is disabled
**Example:**
```bash
# Disable token counting
export TOKEN_COUNTING_ENABLED=false
# Enable token counting (default)
export TOKEN_COUNTING_ENABLED=true
```
**Behavior:**
- When enabled, the proxy will initialize the tiktoken tokenizer and count tokens for all requests and responses
- When disabled, no tokenizer is initialized and no token metrics are collected
- Disabling can reduce CPU usage and memory footprint if token metrics are not needed
### `TOKENIZER_MODEL`
**Default:** `glm-4`
Specifies the model name to use as a label in Prometheus token metrics.
**Values:** Any string (e.g., `glm-4`, `claude-3`, `gpt-4`, etc.)
**Example:**
```bash
# Set model name for metrics
export TOKENIZER_MODEL=glm-4.7
# Use different model name
export TOKENIZER_MODEL=claude-3-sonnet
```
**Behavior:**
- This is purely for Prometheus metrics labeling and does not affect the tokenization algorithm
- The proxy always uses tiktoken's `cl100k_base` encoding regardless of this setting
- Metrics will be tagged with the specified model name: `zai_proxy_tokens_total{direction="input",model="glm-4"}`
- Useful for tracking token usage per model when the proxy handles multiple models
## Startup Log Messages
The proxy logs its tokenizer configuration at startup:
**Token counting enabled (tiktoken):**
```
Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)
```
**Token counting enabled (fallback mode):**
```
Warning: Failed to initialize TikToken counter: <error>
Falling back to SimpleTokenCounter
Token counting enabled (fallback mode, model: glm-4)
```
**Token counting disabled:**
```
Token counting disabled (TOKEN_COUNTING_ENABLED=false)
```
## Prometheus Metrics
When token counting is enabled, the following metrics are exposed:
### `zai_proxy_tokens_total`
**Type:** Counter
**Labels:**
- `direction`: `input` or `output`
- `model`: Value from `TOKENIZER_MODEL` environment variable
**Description:** Total number of tokens processed by direction and model.
**Example:**
```
# HELP zai_proxy_tokens_total Total number of tokens processed
# TYPE zai_proxy_tokens_total counter
zai_proxy_tokens_total{direction="input",model="glm-4"} 15234
zai_proxy_tokens_total{direction="output",model="glm-4"} 8921
```
### `zai_proxy_token_count_duration_seconds`
**Type:** Histogram
**Description:** Duration of token counting operations in seconds.
**Example:**
```
# HELP zai_proxy_token_count_duration_seconds Duration of token counting operations
# TYPE zai_proxy_token_count_duration_seconds histogram
zai_proxy_token_count_duration_seconds_bucket{le="0.0001"} 142
zai_proxy_token_count_duration_seconds_bucket{le="0.0005"} 289
zai_proxy_token_count_duration_seconds_bucket{le="0.001"} 456
...
```
## Kubernetes Deployment Example
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: zai-proxy
spec:
template:
spec:
containers:
- name: zai-proxy
image: zai-proxy:latest
env:
- name: ZAI_API_KEY
valueFrom:
secretKeyRef:
name: zai-api-key
key: api-key
- name: TOKEN_COUNTING_ENABLED
value: "true"
- name: TOKENIZER_MODEL
value: "glm-4"
- name: MAX_WORKERS
value: "50"
- name: RATE_LIMIT_INITIAL
value: "10"
- name: RATE_LIMIT_MIN
value: "1"
- name: RATE_LIMIT_MAX
value: "50"
```
## Implementation Details
- **Tokenizer:** Uses tiktoken-go with `cl100k_base` encoding (Claude 3 compatible)
- **Fallback:** If tiktoken initialization fails, falls back to simple word-based approximation
- **Thread-safe:** Token counting is mutex-protected for concurrent access
- **Performance:** Token counting adds minimal latency (~0.1-1ms per request)
- **Streaming:** Supports both streaming (SSE) and non-streaming responses
## See Also
- [RESPONSE_TOKEN_COUNTING.md](../RESPONSE_TOKEN_COUNTING.md) - Token counting workflow
- [TOKEN_COUNTING_WORKFLOW.md](../TOKEN_COUNTING_WORKFLOW.md) - Detailed token counting architecture