zai-proxy/docs/research/GLM4_TOKENIZER_RESEARCH.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

12 KiB
Raw Permalink Blame History

GLM-4 Tokenizer Libraries for Go - Research Findings

Research Date: 2026-02-08 Bead ID: bd-dv2 Purpose: Identify suitable Go libraries for GLM-4 tokenization for z.ai proxy integration

Executive Summary

GLM-4 uses a custom tokenizer based on HuggingFace's Tokenizers library with a vocabulary size of ~155,000 tokens. The tokenizer combines byte-level BPE encoding for Chinese/multilingual tokens merged with tiktoken's cl100k_base vocabulary.

Recommended Solution: Use github.com/daulet/tokenizers with HuggingFace pretrained models (zai-org/GLM-4.7-Flash)

GLM-4 Tokenizer Specifications

Model Information

  • Model Repository: zai-org/GLM-4.7-Flash on HuggingFace
  • Tokenizer Class: PreTrainedTokenizer (HuggingFace format)
  • Vocabulary Size: ~155,000 tokens (154,820 base + 36 special tokens = 154,856 total)
  • Max Context Length: 128,000 tokens
  • Encoding Method: Byte-level BPE (Byte Pair Encoding)
  • Special Tokens: 36 special tokens including <|endoftext|>, [MASK], <|system|>, <|user|>, <|assistant|>, etc.

Tokenizer Configuration

{
  "tokenizer_class": "PreTrainedTokenizer",
  "model_max_length": 128000,
  "clean_up_tokenization_spaces": false,
  "do_lower_case": false,
  "eos_token": "<|endoftext|>",
  "pad_token": "<|endoftext|>",
  "padding_side": "left"
}

Evaluated Libraries

Repository: https://github.com/daulet/tokenizers Status: Active, well-maintained License: Apache 2.0

Pros

  • Go bindings for HuggingFace Tokenizers (Rust-based)
  • Supports FromPretrained() to load models from HuggingFace Hub
  • Can load GLM-4.7 tokenizer: tokenizers.FromPretrained("zai-org/GLM-4.7-Flash")
  • High performance (uses native Rust implementation)
  • Supports encoding with options (attention mask, type IDs, offsets, etc.)
  • Prebuilt binaries for linux-amd64, darwin-arm64, linux-arm64
  • Actively maintained with recent releases

Cons

  • ⚠️ Requires CGO (needs libtokenizers.a static library)
  • ⚠️ More complex build process compared to pure Go libraries
  • ⚠️ 3x slower than tiktoken for OpenAI models (but only option for GLM-4)

Installation

go get github.com/daulet/tokenizers

Usage Example

package main

import (
    "fmt"
    "github.com/daulet/tokenizers"
)

func main() {
    // Load GLM-4.7 tokenizer from HuggingFace
    tk, err := tokenizers.FromPretrained("zai-org/GLM-4.7-Flash")
    if err != nil {
        panic(err)
    }
    defer tk.Close()

    // Get vocabulary size
    fmt.Println("Vocab size:", tk.VocabSize())
    // Output: Vocab size: 154856

    // Encode text
    ids, tokens := tk.Encode("你好世界Hello, world!", false)
    fmt.Println("Token IDs:", ids)
    fmt.Println("Tokens:", tokens)

    // Encode with special tokens
    idsWithSpecial, tokensWithSpecial := tk.Encode("你好,世界!", true)
    fmt.Println("With special tokens:", idsWithSpecial, tokensWithSpecial)

    // Decode tokens
    text := tk.Decode(ids, true)
    fmt.Println("Decoded:", text)

    // Advanced encoding with options
    encOpts := []tokenizers.EncodeOption{
        tokenizers.WithReturnTypeIDs(),
        tokenizers.WithReturnAttentionMask(),
        tokenizers.WithReturnTokens(),
        tokenizers.WithReturnOffsets(),
    }
    encoding := tk.EncodeWithOptions("Sample text", false, encOpts...)
    fmt.Println("IDs:", encoding.IDs)
    fmt.Println("Attention Mask:", encoding.AttentionMask)
    fmt.Println("Tokens:", encoding.Tokens)
    fmt.Println("Offsets:", encoding.Offsets)
}

Performance Benchmarks

BenchmarkEncodeNTimes-10        133966      10456 ns/op    256 B/op    12 allocs/op
BenchmarkDecodeNTimes-10        817164       1489 ns/op     64 B/op     2 allocs/op

Build Requirements

# Option 1: Use prebuilt binaries (recommended)
# Download from https://github.com/daulet/tokenizers/releases
# Extract libtokenizers.a to project directory

# Option 2: Build from source (requires Rust toolchain)
make build  # Builds libtokenizers.a

# Set CGO flags
export CGO_LDFLAGS="-L./path/to/libtokenizers/directory"

Repository: https://github.com/gomlx/tokenizers Status: DEPRECATED - Moved to github.com/gomlx/go-huggingface

Notes

  • Marked as "UNDER CONSTRUCTION" and "NOT FUNCTIONAL YET"
  • Deprecated in favor of integrated solution in go-huggingface
  • Not suitable for production use

3. tiktoken-go / go-tiktoken NOT COMPATIBLE

Repositories:

Status: Active, but NOT compatible with GLM-4

Why Not Compatible

  • Tiktoken only supports OpenAI models (GPT-3.5, GPT-4, etc.)
  • Uses different encoding schemes (cl100k_base, o200k_base)
  • GLM-4 uses custom vocabulary and BPE rules
  • Will produce incorrect token counts for GLM-4 models

Supported Encodings (OpenAI only)

  • gpt-3.5-turbo
  • gpt-4
  • gpt-4-turbo
  • cl100k_base, p50k_base, r50k_base, o200k_base

4. ChatGLM API Wrappers (Not Tokenizer Libraries)

Repositories:

Status: API wrappers only, no tokenizer functionality

Notes

  • These are REST API clients for ChatGLM service
  • Do NOT provide local tokenization capabilities
  • Not suitable for token counting in proxy middleware

Implementation Recommendation

package tokenizer

import (
    "fmt"
    "sync"
    "github.com/daulet/tokenizers"
)

// GLM4Tokenizer wraps the HuggingFace tokenizer for GLM-4 models
type GLM4Tokenizer struct {
    tk   *tokenizers.Tokenizer
    mu   sync.RWMutex
    name string
}

// NewGLM4Tokenizer creates a tokenizer for GLM-4.7 models
func NewGLM4Tokenizer() (*GLM4Tokenizer, error) {
    tk, err := tokenizers.FromPretrained("zai-org/GLM-4.7-Flash")
    if err != nil {
        return nil, fmt.Errorf("failed to load GLM-4.7 tokenizer: %w", err)
    }

    return &GLM4Tokenizer{
        tk:   tk,
        name: "GLM-4.7-Flash",
    }, nil
}

// Close releases native resources
func (t *GLM4Tokenizer) Close() {
    t.mu.Lock()
    defer t.mu.Unlock()
    if t.tk != nil {
        t.tk.Close()
        t.tk = nil
    }
}

// CountTokens returns the number of tokens in the text
func (t *GLM4Tokenizer) CountTokens(text string) int {
    t.mu.RLock()
    defer t.mu.RUnlock()

    ids, _ := t.tk.Encode(text, false)
    return len(ids)
}

// CountTokensWithSpecial includes special tokens in the count
func (t *GLM4Tokenizer) CountTokensWithSpecial(text string) int {
    t.mu.RLock()
    defer t.mu.RUnlock()

    ids, _ := t.tk.Encode(text, true)
    return len(ids)
}

// Encode returns token IDs and token strings
func (t *GLM4Tokenizer) Encode(text string, addSpecialTokens bool) ([]uint32, []string) {
    t.mu.RLock()
    defer t.mu.RUnlock()

    return t.tk.Encode(text, addSpecialTokens)
}

// Decode converts token IDs back to text
func (t *GLM4Tokenizer) Decode(tokenIDs []uint32) string {
    t.mu.RLock()
    defer t.mu.RUnlock()

    return t.tk.Decode(tokenIDs, true)
}

// VocabSize returns the tokenizer vocabulary size
func (t *GLM4Tokenizer) VocabSize() uint {
    t.mu.RLock()
    defer t.mu.RUnlock()

    return t.tk.VocabSize()
}

Integration with zai-proxy

package proxy

import (
    "encoding/json"
    "net/http"
)

// ChatRequest represents an OpenAI-compatible chat request
type ChatRequest struct {
    Model    string         `json:"model"`
    Messages []ChatMessage  `json:"messages"`
    Stream   bool           `json:"stream"`
}

type ChatMessage struct {
    Role    string `json:"role"`
    Content string `json:"content"`
}

// TokenCounter middleware for z.ai proxy
func (p *Proxy) countRequestTokens(req *ChatRequest) (int, error) {
    // Initialize tokenizer (cache this globally in production)
    tk, err := NewGLM4Tokenizer()
    if err != nil {
        return 0, err
    }
    defer tk.Close()

    totalTokens := 0

    // Count tokens for each message
    for _, msg := range req.Messages {
        // Format: <|role|>\nContent\n
        formatted := fmt.Sprintf("<|%s|>\n%s\n", msg.Role, msg.Content)
        totalTokens += tk.CountTokens(formatted)
    }

    // Add tokens for response priming
    totalTokens += 3 // <|assistant|> token overhead

    return totalTokens, nil
}

Testing & Validation

Test Plan

  1. Accuracy Test: Compare token counts with official GLM-4 API
  2. Performance Test: Measure encoding/decoding latency
  3. Edge Cases: Test with multilingual text, special characters, empty strings
  4. Memory Test: Check for memory leaks in long-running processes

Validation Methodology

# Download test corpus
curl -o test_corpus.txt https://huggingface.co/datasets/wikitext/resolve/main/wikitext-2-raw-v1/test.txt

# Run token counting tests
go test -v ./tokenizer -run TestGLM4TokenCount

# Performance benchmarks
go test -bench=. ./tokenizer -benchmem -benchtime=10s

Dependencies & Build Setup

go.mod Dependencies

require (
    github.com/daulet/tokenizers v1.0.0 // Check latest version
)

Docker Build Configuration

FROM golang:1.22-alpine AS builder

# Install build dependencies
RUN apk add --no-cache gcc musl-dev

# Download prebuilt libtokenizers.a
ADD https://github.com/daulet/tokenizers/releases/download/v1.0.0/libtokenizers-linux-amd64.a /lib/libtokenizers.a

# Set CGO flags
ENV CGO_ENABLED=1
ENV CGO_LDFLAGS="-L/lib"

# Build application
COPY . /app
WORKDIR /app
RUN go build -o zai-proxy .

FROM alpine:latest
COPY --from=builder /app/zai-proxy /usr/local/bin/
CMD ["/usr/local/bin/zai-proxy"]

Alternatives Considered & Rejected

Library Reason for Rejection
tiktoken-go Only supports OpenAI models, incompatible with GLM-4
go-tiktoken Same as above, fork of tiktoken-go
gomlx/tokenizers Deprecated, not functional
ChatGLM API wrappers Only API clients, no local tokenization
Manual BPE implementation Too complex, error-prone, slower than Rust bindings

References

Documentation


Next Steps

  1. Complete research and document findings
  2. ⏭️ Implement GLM4Tokenizer wrapper (bd-dv2 follow-up)
  3. ⏭️ Write unit tests for token counting accuracy
  4. ⏭️ Benchmark performance vs. Python tokenizer
  5. ⏭️ Integrate into zai-proxy middleware
  6. ⏭️ Deploy and validate with production traffic

Appendix: GLM-4 Special Tokens

<|endoftext|>           # EOS token (ID: 154820)
[MASK]                  # Mask token
[gMASK]                 # Global mask
[sMASK]                 # Sentence mask
<sop>                   # Start of passage
<eop>                   # End of passage
<|system|>              # System message role
<|user|>                # User message role
<|assistant|>           # Assistant message role
<|observation|>         # Observation/tool result
<|begin_of_image|>      # Image boundary
<|end_of_image|>
<|begin_of_video|>      # Video boundary
<|end_of_video|>
<|begin_of_audio|>      # Audio boundary
<|end_of_audio|>
<tool_call>             # Function calling
</tool_call>
<tool_response>
</tool_response>
<think>                 # Chain-of-thought
</think>

Research Completed: 2026-02-08 Confidence Level: High - daulet/tokenizers is production-ready for GLM-4 tokenization