- proxy/go.mod: github.com/ardenone/zai-proxy → git.ardenone.com/jedarden/zai-proxy - dashboard/go.mod: github.com/ardenone/ardenone-cluster/containers/zai-proxy-dashboard → git.ardenone.com/jedarden/zai-proxy/dashboard - Update all Go import paths in proxy/ and dashboard/ to match new module paths - Add proxy/evaluation/ package (was missing from initial commit) - Add docs/plan/plan.md with architecture, security model, telemetry design, and migration checklist Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| zai_eval | ||
| .env.example | ||
| .gitignore | ||
| evaluator.go | ||
| evaluator_test.go | ||
| EXAMPLE_USAGE.md | ||
| pyproject.toml | ||
| README.md | ||
| report.go | ||
| requirements.txt | ||
| run_evaluation.py | ||
| test_cases.go | ||
Z.AI Proxy Evaluation Framework
Tool to compare token counts from z.ai proxy with real Anthropic API responses.
Purpose
The z.ai proxy counts tokens using tiktoken's cl100k_base encoding. This framework validates that the proxy's token counts match the official Anthropic API usage metadata.
Installation
cd /home/coder/ardenone-cluster/containers/zai-proxy/evaluation
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Or install as package
pip install -e .
Configuration
Set up environment variables:
cp .env.example .env
# Edit .env with your API keys
Required variables:
ZAI_API_KEY- Your z.ai API keyZAI_PROXY_URL- Proxy URL (default: http://localhost:8080)ANTHROPIC_API_KEY- Your Anthropic API key
Usage
List available test cases
zai-eval list-tests
Run all tests
zai-eval run
Run a specific test
zai-eval run short_simple
Run with output reports
zai-eval run --output ./results --json --markdown
Quick test with custom prompt
zai-eval quick "What is the capital of France?"
Validate endpoints
zai-eval validate
Test Cases
The framework includes 14 diverse test cases:
- short_simple - Short simple text
- medium_conversation - Medium length conversation
- long_context - Long context with detailed information
- code_snippet - Request involving code
- multi_turn_conversation - Multiple turns of conversation
- structured_data - Request with structured data format
- mathematical_content - Content with mathematical expressions
- multilingual_text - Text with multiple languages
- list_heavy_content - Content with many list items
- json_only_response - Request expecting JSON response
- creative_writing - Creative writing prompt
- technical_explanation - Technical concept explanation
- empty_system_message - Request with system message
- special_characters - Text with many special characters
Metrics
The framework calculates:
- Accuracy metrics: Percentage of exact matches for input/output/total tokens
- Mean Absolute Error (MAE): Average token count difference
- Mean Percentage Error (MPE): Average percentage difference
- Systematic bias: Consistent over/under-counting patterns
- Latency comparison: Proxy vs Anthropic API response times
Output
Console Output
Rich-formatted console output with color-coded results:
- ✓ Green: Exact match
- ~ Yellow: Close (<5% difference)
- ✗ Red: Mismatch
JSON Report
{
"summary": {
"total_requests": 14,
"input_token_accuracy": 85.71,
"output_token_accuracy": 92.86,
"overall_accuracy": 78.57
},
"advanced_metrics": {...},
"bias_analysis": {...},
"results": [...]
}
Markdown Report
Human-readable report with tables and summaries.
Architecture
┌─────────────┐
│ CLI │
└──────┬──────┘
│
↓
┌─────────────────────────────────────┐
│ DualClient │
│ ┌────────────┐ ┌──────────────┐ │
│ │ Proxy │ │ Anthropic │ │
│ │ Client │ │ Client │ │
│ └────────────┘ └──────────────┘ │
└─────────────────────────────────────┘
│
↓
┌─────────────────────────────────────┐
│ EvaluationResult │
│ • Compare token counts │
│ • Calculate metrics │
│ • Detect biases │
└─────────────────────────────────────┘
│
↓
┌─────────────────────────────────────┐
│ EvaluationReport │
│ • Summary statistics │
│ • Accuracy metrics │
│ • Bias analysis │
└─────────────────────────────────────┘
Development
Project structure
evaluation/
├── zai_eval/
│ ├── __init__.py
│ ├── cli.py # CLI interface
│ ├── client.py # HTTP clients
│ ├── models.py # Data models
│ ├── test_cases.py # Test case definitions
│ ├── metrics.py # Metrics calculation
│ └── report.py # Report generation
├── requirements.txt
├── pyproject.toml
├── .env.example
└── README.md
Adding new test cases
Edit zai_eval/test_cases.py:
TEST_CASES.append(
EvaluationRequest(
name="my_test",
description="My test description",
model="claude-3-sonnet-20240229",
max_tokens=100,
messages=[...],
)
)
License
Same as parent project.