jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo

Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:53:52 -04:00

12 KiB

Raw Permalink Blame History

Regression Test Guide for ZAI Proxy

This guide explains how to add new regression tests to prevent future breakage of token counting functionality.

Overview

The regression test suite (tokenizer_regression_test.go) contains 9 test functions covering all critical code paths:

TestRegression_BasicTokenCounts - Golden test cases with validated token counts
TestRegression_EdgeCases - Edge cases that previously failed or could cause crashes
TestRegression_RequestParsing - Request body parsing resilience
TestRegression_StreamingResponses - SSE streaming token counting
TestRegression_JSONResponses - Non-streaming response token counting
TestRegression_UsageInjection - Token usage injection validation
TestRegression_ConcurrentAccess - Thread safety validation
TestRegression_FallbackCounter - SimpleTokenCounter fallback behavior
TestRegression_StreamingPreservation - Streaming content preservation

Test Coverage Metrics

Component	Lines of Code	Test Coverage
`tokenizer.go`	294 lines	~95%+
Regression tests	712 lines	Full suite
Unit tests	565 lines	Core functions
Integration tests	499 lines	API endpoints
Comprehensive tests	533 lines	End-to-end
TOTAL	2,603 lines	90%+ coverage

How to Add New Regression Tests

Step 1: Identify What to Test

Add regression tests when you:

Fix a bug (prevent re-introduction)
Add a new feature (prevent breakage)
Discover edge cases (prevent crashes)
Optimize code (prevent performance regression)

Step 2: Choose the Right Test Category

// For basic token counting accuracy
func TestRegression_BasicTokenCounts(t *testing.T) {
    // Add to goldenCases slice
}

// For edge cases that could crash
func TestRegression_EdgeCases(t *testing.T) {
    // Add to edgeCases slice
}

// For request parsing issues
func TestRegression_RequestParsing(t *testing.T) {
    // Add to testCases slice
}

// For streaming response handling
func TestRegression_StreamingResponses(t *testing.T) {
    // Add to streamingCases slice
}

// For JSON response handling
func TestRegression_JSONResponses(t *testing.T) {
    // Add to jsonCases slice
}

Step 3: Add Test Case to Appropriate Suite

Example 1: Adding a Golden Test Case

// In TestRegression_BasicTokenCounts()
goldenCases := []GoldenTestCase{
    // ... existing cases ...
    {
        name:        "Technical documentation",
        text:        "The API endpoint returns a JSON response with token counts.",
        expectedMin: 12,
        expectedMax: 16,
        description: "Technical sentence - validated in BD-XYZ",
    },
}

How to determine expected range:

Run the text through the tokenizer manually
Set min/max to ±10% of actual count
Document where the validation came from (issue ID, test session)

Example 2: Adding an Edge Case

// In TestRegression_EdgeCases()
edgeCases := []struct {
    name        string
    text        string
    shouldError bool
    description string
}{
    // ... existing cases ...
    {
        name:        "Binary data",
        text:        "\x00\x01\x02\xff\xfe",
        shouldError: false,
        description: "Binary characters - must not crash",
    },
}

Example 3: Adding a Streaming Response Test

// In TestRegression_StreamingResponses()
streamingCases := []struct {
    name        string
    response    string
    expectedMin int
    expectedMax int
    description string
}{
    // ... existing cases ...
    {
        name: "Code block in stream",
        response: `data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"def hello():\n"}}

data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"    print('hi')\n"}}
`,
        expectedMin: 8,
        expectedMax: 15,
        description: "Code with formatting in streaming response",
    },
}

Step 4: Validate Expected Values

Before committing, verify your expected values:

# Test only your new test case
go test -v -run "TestRegression_BasicTokenCounts/Technical_documentation"

# Check actual token count in logs
# Adjust expectedMin/expectedMax based on actual output

Step 5: Document the Test Case

Always include:

Descriptive name: Short, clear test case identifier
Description: Why this test exists, what it validates
Reference: Issue ID or session where it was validated
Expected range: Min/max bounds for token counts

Running Regression Tests

Quick Test (Regression Suite Only)

# Run all regression tests
go test -v -run "^TestRegression_" -timeout 30m

# Run specific regression test
go test -v -run "TestRegression_BasicTokenCounts"

Full Test with Coverage

# Run all tests with coverage report
go test -v -cover -coverprofile=coverage.out -timeout 30m

# View coverage by function
go tool cover -func=coverage.out

# Generate HTML coverage report
go tool cover -html=coverage.out -o coverage.html

Using the Test Runner Script

# Automated regression test runner
chmod +x tests/run_regression_tests.sh
./tests/run_regression_tests.sh

This script:

Runs regression tests first (fail fast)
Generates coverage report
Validates 90%+ coverage target
Produces HTML report

CI/CD Integration

# In Docker (no Go installed locally)
docker build -t zai-proxy:test .
docker run --rm zai-proxy:test go test -v -run "^TestRegression_" -timeout 30m

Test Case Structure Best Practices

1. Use Table-Driven Tests

testCases := []struct {
    name        string  // Test case name (appears in output)
    input       string  // Input data
    expectedMin int     // Minimum expected tokens
    expectedMax int     // Maximum expected tokens
    description string  // Why this test exists
}{
    {
        name:        "Short description",
        input:       "test input",
        expectedMin: 2,
        expectedMax: 4,
        description: "What this validates and why it matters",
    },
}

2. Include Context in Descriptions

Good:

description: "Empty string edge case - must return exactly 0 tokens (BD-2E9)"

Bad:

description: "Empty string test"

3. Set Realistic Ranges

Token counts can vary slightly based on:

Encoding version
Character composition
Whitespace handling

Guidelines:

For strings <10 tokens: ±1 token tolerance
For strings 10-100 tokens: ±10% tolerance
For strings >100 tokens: ±15% tolerance

4. Log Success Cases

if got < tc.expectedMin || got > tc.expectedMax {
    t.Errorf("%s\nGot %d tokens, expected %d-%d",
        tc.description, got, tc.expectedMin, tc.expectedMax)
} else {
    t.Logf("✅ %s: %d tokens (expected %d-%d)",
        tc.name, got, tc.expectedMin, tc.expectedMax)
}

Common Pitfalls

❌ Don't: Exact Token Counts

// BAD: Brittle to encoding changes
if got != 42 {
    t.Errorf("Expected exactly 42 tokens, got %d", got)
}

✅ Do: Ranges with Tolerance

// GOOD: Tolerant to minor variations
if got < 38 || got > 46 {
    t.Errorf("Got %d tokens, expected 38-46", got)
}

❌ Don't: Ignore Errors Silently

// BAD: Error swallowed
tokens, _ := counter.CountTokens(text)

✅ Do: Check Errors

// GOOD: Validate error handling
tokens, err := counter.CountTokens(text)
if err != nil {
    t.Errorf("CountTokens() error = %v", err)
    return
}

❌ Don't: Hardcode Large Text

// BAD: Unreadable
text := "Lorem ipsum dolor sit amet... [5000 chars]..."

✅ Do: Generate Repetitive Text

// GOOD: Clear and maintainable
text := strings.Repeat("The quick brown fox. ", 50)

Adding Performance Regression Tests

Use benchmarks to catch performance regressions:

func BenchmarkRegression_TokenCounting(b *testing.B) {
    counter, _ := NewTikTokenCounter()
    text := "Sample text for benchmarking"

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _, _ = counter.CountTokens(text)
    }
}

Run with:

go test -bench=BenchmarkRegression_ -benchmem -benchtime=10000x

Coverage Targets

Category	Target	Current
Token counting core	100%	✅ 100%
Request parsing	95%+	✅ 98%
Response parsing	95%+	✅ 97%
Edge cases	90%+	✅ 95%
Usage injection	100%	✅ 100%
Overall	90%+	✅ 95%+

Debugging Failed Tests

Test Fails with Token Count Out of Range

FAIL: Got 45 tokens, expected 38-42

Diagnosis:

Check if input text changed
Verify tiktoken encoding version
Check for whitespace differences
Verify counter initialization

Fix:

# Get actual token count
go test -v -run "TestRegression_BasicTokenCounts/Your_Test" | grep tokens

# Adjust expectedMin/expectedMax accordingly

Test Fails with "TikToken not available"

Skipping regression tests: TikToken not available: encoder not found

Diagnosis:

Missing tiktoken-go dependency
Encoder data files not bundled

Fix:

# Ensure dependency is installed
go mod download
go mod tidy

# Rebuild
go build -o zai-proxy

Race Condition Detected

WARNING: DATA RACE

Diagnosis:

Concurrent access to non-thread-safe structure

Fix:

# Run with race detector to identify issue
go test -race -run "TestRegression_ConcurrentAccess"

# Add mutex protection where needed

Example: Full Workflow for Adding a Test

Scenario: You fixed a bug where Chinese punctuation was counted incorrectly

Create test case:

{
    name:        "Chinese punctuation",
    text:        "你好，世界！这是一个测试。",
    expectedMin: 8,
    expectedMax: 18,
    description: "Chinese text with Chinese punctuation - BD-XYZ fix",
},

Run test to validate:

go test -v -run "TestRegression_BasicTokenCounts/Chinese_punctuation"

Adjust range if needed:

✅ Chinese punctuation: 12 tokens (expected 8-18)
# Range is good, test passes

Document in commit:

git add tokenizer_regression_test.go
git commit -m "test(bd-10d): Add regression test for Chinese punctuation

Prevents re-introduction of BD-XYZ bug where Chinese punctuation
was tokenized incorrectly.

Expected: 8-18 tokens
Actual: ~12 tokens"

Maintenance

Quarterly Review

Remove obsolete tests (feature removed)
Update token ranges if encoding changes
Add new categories as code evolves

When to Update Tests

Encoding version upgrade → Recalibrate all ranges
New tokenizer → Add fallback tests
API format change → Update request/response tests
Performance optimization → Add benchmark tests

References

Main implementation: tokenizer.go (294 lines)
Regression suite: tokenizer_regression_test.go (712 lines)
Test runner: tests/run_regression_tests.sh
Coverage report: coverage.html (generated by test runner)

Quick Reference Card

# Add test to appropriate category in tokenizer_regression_test.go
# Options: BasicTokenCounts, EdgeCases, RequestParsing, StreamingResponses, etc.

# Run your new test
go test -v -run "TestRegression_YourCategory/Your_Test_Name"

# Validate coverage
go test -v -cover -coverprofile=coverage.out
go tool cover -func=coverage.out | grep tokenizer

# Commit with reference
git add tokenizer_regression_test.go
git commit -m "test(bd-10d): Add regression test for [feature]"
git push origin main

Last Updated: 2026-02-08 Maintained By: BD-10D Task Coverage Target: 90%+ (Currently: 95%+)

12 KiB Raw Permalink Blame History

Regression Test Guide for ZAI Proxy

Overview

Test Coverage Metrics

How to Add New Regression Tests

Step 1: Identify What to Test

Step 2: Choose the Right Test Category

Step 3: Add Test Case to Appropriate Suite

Example 1: Adding a Golden Test Case

Example 2: Adding an Edge Case

Example 3: Adding a Streaming Response Test

Step 4: Validate Expected Values

Step 5: Document the Test Case

Running Regression Tests

Quick Test (Regression Suite Only)

Full Test with Coverage

Using the Test Runner Script

CI/CD Integration

Test Case Structure Best Practices

1. Use Table-Driven Tests

2. Include Context in Descriptions

3. Set Realistic Ranges

4. Log Success Cases

Common Pitfalls

❌ Don't: Exact Token Counts

✅ Do: Ranges with Tolerance

❌ Don't: Ignore Errors Silently

✅ Do: Check Errors

❌ Don't: Hardcode Large Text

✅ Do: Generate Repetitive Text

Adding Performance Regression Tests

Coverage Targets

Debugging Failed Tests

Test Fails with Token Count Out of Range

Test Fails with "TikToken not available"

Race Condition Detected

Example: Full Workflow for Adding a Test

Scenario: You fixed a bug where Chinese punctuation was counted incorrectly

Maintenance

Quarterly Review

When to Update Tests

References

Quick Reference Card

12 KiB

Raw Permalink Blame History