History

jedarden dee82a76a3 chore: update module paths and add evaluation package - proxy/go.mod: github.com/ardenone/zai-proxy → git.ardenone.com/jedarden/zai-proxy - dashboard/go.mod: github.com/ardenone/ardenone-cluster/containers/zai-proxy-dashboard → git.ardenone.com/jedarden/zai-proxy/dashboard - Update all Go import paths in proxy/ and dashboard/ to match new module paths - Add proxy/evaluation/ package (was missing from initial commit) - Add docs/plan/plan.md with architecture, security model, telemetry design, and migration checklist Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-05-16 16:03:50 -04:00
..
zai_eval	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
.env.example	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
.gitignore	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
evaluator.go	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
evaluator_test.go	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
EXAMPLE_USAGE.md	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
pyproject.toml	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
README.md	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
report.go	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
requirements.txt	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
run_evaluation.py	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00
test_cases.go	chore: update module paths and add evaluation package	2026-05-16 16:03:50 -04:00

README.md

Z.AI Proxy Evaluation Framework

Tool to compare token counts from z.ai proxy with real Anthropic API responses.

Purpose

The z.ai proxy counts tokens using tiktoken's cl100k_base encoding. This framework validates that the proxy's token counts match the official Anthropic API usage metadata.

Installation

cd /home/coder/ardenone-cluster/containers/zai-proxy/evaluation

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Or install as package
pip install -e .

Configuration

Set up environment variables:

cp .env.example .env
# Edit .env with your API keys

Required variables:

ZAI_API_KEY - Your z.ai API key
ZAI_PROXY_URL - Proxy URL (default: http://localhost:8080)
ANTHROPIC_API_KEY - Your Anthropic API key

Usage

List available test cases

zai-eval list-tests

Run all tests

zai-eval run

Run a specific test

zai-eval run short_simple

Run with output reports

zai-eval run --output ./results --json --markdown

Quick test with custom prompt

zai-eval quick "What is the capital of France?"

Validate endpoints

zai-eval validate

Test Cases

The framework includes 14 diverse test cases:

short_simple - Short simple text
medium_conversation - Medium length conversation
long_context - Long context with detailed information
code_snippet - Request involving code
multi_turn_conversation - Multiple turns of conversation
structured_data - Request with structured data format
mathematical_content - Content with mathematical expressions
multilingual_text - Text with multiple languages
list_heavy_content - Content with many list items
json_only_response - Request expecting JSON response
creative_writing - Creative writing prompt
technical_explanation - Technical concept explanation
empty_system_message - Request with system message
special_characters - Text with many special characters

Metrics

The framework calculates:

Accuracy metrics: Percentage of exact matches for input/output/total tokens
Mean Absolute Error (MAE): Average token count difference
Mean Percentage Error (MPE): Average percentage difference
Systematic bias: Consistent over/under-counting patterns
Latency comparison: Proxy vs Anthropic API response times

Output

Console Output

Rich-formatted console output with color-coded results:

✓ Green: Exact match
~ Yellow: Close (<5% difference)
✗ Red: Mismatch

JSON Report

{
  "summary": {
    "total_requests": 14,
    "input_token_accuracy": 85.71,
    "output_token_accuracy": 92.86,
    "overall_accuracy": 78.57
  },
  "advanced_metrics": {...},
  "bias_analysis": {...},
  "results": [...]
}

Markdown Report

Human-readable report with tables and summaries.

Architecture

┌─────────────┐
│   CLI       │
└──────┬──────┘
       │
       ↓
┌─────────────────────────────────────┐
│      DualClient                    │
│  ┌────────────┐  ┌──────────────┐ │
│  │ Proxy      │  │ Anthropic    │ │
│  │ Client     │  │ Client       │ │
│  └────────────┘  └──────────────┘ │
└─────────────────────────────────────┘
       │
       ↓
┌─────────────────────────────────────┐
│     EvaluationResult               │
│  • Compare token counts            │
│  • Calculate metrics               │
│  • Detect biases                   │
└─────────────────────────────────────┘
       │
       ↓
┌─────────────────────────────────────┐
│   EvaluationReport                 │
│  • Summary statistics              │
│  • Accuracy metrics                │
│  • Bias analysis                   │
└─────────────────────────────────────┘

Development

Project structure

evaluation/
├── zai_eval/
│   ├── __init__.py
│   ├── cli.py              # CLI interface
│   ├── client.py           # HTTP clients
│   ├── models.py           # Data models
│   ├── test_cases.py       # Test case definitions
│   ├── metrics.py          # Metrics calculation
│   └── report.py           # Report generation
├── requirements.txt
├── pyproject.toml
├── .env.example
└── README.md

Adding new test cases

Edit zai_eval/test_cases.py:

TEST_CASES.append(
    EvaluationRequest(
        name="my_test",
        description="My test description",
        model="claude-3-sonnet-20240229",
        max_tokens=100,
        messages=[...],
    )
)

License

Same as parent project.