zai-proxy/proxy/evaluation
jedarden dee82a76a3 chore: update module paths and add evaluation package
- proxy/go.mod: github.com/ardenone/zai-proxy → git.ardenone.com/jedarden/zai-proxy
- dashboard/go.mod: github.com/ardenone/ardenone-cluster/containers/zai-proxy-dashboard → git.ardenone.com/jedarden/zai-proxy/dashboard
- Update all Go import paths in proxy/ and dashboard/ to match new module paths
- Add proxy/evaluation/ package (was missing from initial commit)
- Add docs/plan/plan.md with architecture, security model, telemetry design, and migration checklist

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:03:50 -04:00
..
zai_eval chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
.env.example chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
.gitignore chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
evaluator.go chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
evaluator_test.go chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
EXAMPLE_USAGE.md chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
pyproject.toml chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
README.md chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
report.go chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
requirements.txt chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
run_evaluation.py chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00
test_cases.go chore: update module paths and add evaluation package 2026-05-16 16:03:50 -04:00

Z.AI Proxy Evaluation Framework

Tool to compare token counts from z.ai proxy with real Anthropic API responses.

Purpose

The z.ai proxy counts tokens using tiktoken's cl100k_base encoding. This framework validates that the proxy's token counts match the official Anthropic API usage metadata.

Installation

cd /home/coder/ardenone-cluster/containers/zai-proxy/evaluation

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Or install as package
pip install -e .

Configuration

Set up environment variables:

cp .env.example .env
# Edit .env with your API keys

Required variables:

  • ZAI_API_KEY - Your z.ai API key
  • ZAI_PROXY_URL - Proxy URL (default: http://localhost:8080)
  • ANTHROPIC_API_KEY - Your Anthropic API key

Usage

List available test cases

zai-eval list-tests

Run all tests

zai-eval run

Run a specific test

zai-eval run short_simple

Run with output reports

zai-eval run --output ./results --json --markdown

Quick test with custom prompt

zai-eval quick "What is the capital of France?"

Validate endpoints

zai-eval validate

Test Cases

The framework includes 14 diverse test cases:

  1. short_simple - Short simple text
  2. medium_conversation - Medium length conversation
  3. long_context - Long context with detailed information
  4. code_snippet - Request involving code
  5. multi_turn_conversation - Multiple turns of conversation
  6. structured_data - Request with structured data format
  7. mathematical_content - Content with mathematical expressions
  8. multilingual_text - Text with multiple languages
  9. list_heavy_content - Content with many list items
  10. json_only_response - Request expecting JSON response
  11. creative_writing - Creative writing prompt
  12. technical_explanation - Technical concept explanation
  13. empty_system_message - Request with system message
  14. special_characters - Text with many special characters

Metrics

The framework calculates:

  • Accuracy metrics: Percentage of exact matches for input/output/total tokens
  • Mean Absolute Error (MAE): Average token count difference
  • Mean Percentage Error (MPE): Average percentage difference
  • Systematic bias: Consistent over/under-counting patterns
  • Latency comparison: Proxy vs Anthropic API response times

Output

Console Output

Rich-formatted console output with color-coded results:

  • ✓ Green: Exact match
  • ~ Yellow: Close (<5% difference)
  • ✗ Red: Mismatch

JSON Report

{
  "summary": {
    "total_requests": 14,
    "input_token_accuracy": 85.71,
    "output_token_accuracy": 92.86,
    "overall_accuracy": 78.57
  },
  "advanced_metrics": {...},
  "bias_analysis": {...},
  "results": [...]
}

Markdown Report

Human-readable report with tables and summaries.

Architecture

┌─────────────┐
│   CLI       │
└──────┬──────┘
       │
       ↓
┌─────────────────────────────────────┐
│      DualClient                    │
│  ┌────────────┐  ┌──────────────┐ │
│  │ Proxy      │  │ Anthropic    │ │
│  │ Client     │  │ Client       │ │
│  └────────────┘  └──────────────┘ │
└─────────────────────────────────────┘
       │
       ↓
┌─────────────────────────────────────┐
│     EvaluationResult               │
│  • Compare token counts            │
│  • Calculate metrics               │
│  • Detect biases                   │
└─────────────────────────────────────┘
       │
       ↓
┌─────────────────────────────────────┐
│   EvaluationReport                 │
│  • Summary statistics              │
│  • Accuracy metrics                │
│  • Bias analysis                   │
└─────────────────────────────────────┘

Development

Project structure

evaluation/
├── zai_eval/
│   ├── __init__.py
│   ├── cli.py              # CLI interface
│   ├── client.py           # HTTP clients
│   ├── models.py           # Data models
│   ├── test_cases.py       # Test case definitions
│   ├── metrics.py          # Metrics calculation
│   └── report.py           # Report generation
├── requirements.txt
├── pyproject.toml
├── .env.example
└── README.md

Adding new test cases

Edit zai_eval/test_cases.py:

TEST_CASES.append(
    EvaluationRequest(
        name="my_test",
        description="My test description",
        model="claude-3-sonnet-20240229",
        max_tokens=100,
        messages=[...],
    )
)

License

Same as parent project.