zai-proxy/proxy/evaluation/README.md
jedarden dee82a76a3 chore: update module paths and add evaluation package
- proxy/go.mod: github.com/ardenone/zai-proxy → git.ardenone.com/jedarden/zai-proxy
- dashboard/go.mod: github.com/ardenone/ardenone-cluster/containers/zai-proxy-dashboard → git.ardenone.com/jedarden/zai-proxy/dashboard
- Update all Go import paths in proxy/ and dashboard/ to match new module paths
- Add proxy/evaluation/ package (was missing from initial commit)
- Add docs/plan/plan.md with architecture, security model, telemetry design, and migration checklist

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:03:50 -04:00

206 lines
5.4 KiB
Markdown

# Z.AI Proxy Evaluation Framework
Tool to compare token counts from z.ai proxy with real Anthropic API responses.
## Purpose
The z.ai proxy counts tokens using tiktoken's `cl100k_base` encoding. This framework validates that the proxy's token counts match the official Anthropic API usage metadata.
## Installation
```bash
cd /home/coder/ardenone-cluster/containers/zai-proxy/evaluation
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Or install as package
pip install -e .
```
## Configuration
Set up environment variables:
```bash
cp .env.example .env
# Edit .env with your API keys
```
Required variables:
- `ZAI_API_KEY` - Your z.ai API key
- `ZAI_PROXY_URL` - Proxy URL (default: http://localhost:8080)
- `ANTHROPIC_API_KEY` - Your Anthropic API key
## Usage
### List available test cases
```bash
zai-eval list-tests
```
### Run all tests
```bash
zai-eval run
```
### Run a specific test
```bash
zai-eval run short_simple
```
### Run with output reports
```bash
zai-eval run --output ./results --json --markdown
```
### Quick test with custom prompt
```bash
zai-eval quick "What is the capital of France?"
```
### Validate endpoints
```bash
zai-eval validate
```
## Test Cases
The framework includes 14 diverse test cases:
1. **short_simple** - Short simple text
2. **medium_conversation** - Medium length conversation
3. **long_context** - Long context with detailed information
4. **code_snippet** - Request involving code
5. **multi_turn_conversation** - Multiple turns of conversation
6. **structured_data** - Request with structured data format
7. **mathematical_content** - Content with mathematical expressions
8. **multilingual_text** - Text with multiple languages
9. **list_heavy_content** - Content with many list items
10. **json_only_response** - Request expecting JSON response
11. **creative_writing** - Creative writing prompt
12. **technical_explanation** - Technical concept explanation
13. **empty_system_message** - Request with system message
14. **special_characters** - Text with many special characters
## Metrics
The framework calculates:
- **Accuracy metrics**: Percentage of exact matches for input/output/total tokens
- **Mean Absolute Error (MAE)**: Average token count difference
- **Mean Percentage Error (MPE)**: Average percentage difference
- **Systematic bias**: Consistent over/under-counting patterns
- **Latency comparison**: Proxy vs Anthropic API response times
## Output
### Console Output
Rich-formatted console output with color-coded results:
- ✓ Green: Exact match
- ~ Yellow: Close (<5% difference)
- Red: Mismatch
### JSON Report
```json
{
"summary": {
"total_requests": 14,
"input_token_accuracy": 85.71,
"output_token_accuracy": 92.86,
"overall_accuracy": 78.57
},
"advanced_metrics": {...},
"bias_analysis": {...},
"results": [...]
}
```
### Markdown Report
Human-readable report with tables and summaries.
## Architecture
```
┌─────────────┐
│ CLI │
└──────┬──────┘
┌─────────────────────────────────────┐
│ DualClient │
│ ┌────────────┐ ┌──────────────┐ │
│ │ Proxy │ │ Anthropic │ │
│ │ Client │ │ Client │ │
│ └────────────┘ └──────────────┘ │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ EvaluationResult │
│ • Compare token counts │
│ • Calculate metrics │
│ • Detect biases │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ EvaluationReport │
│ • Summary statistics │
│ • Accuracy metrics │
│ • Bias analysis │
└─────────────────────────────────────┘
```
## Development
### Project structure
```
evaluation/
├── zai_eval/
│ ├── __init__.py
│ ├── cli.py # CLI interface
│ ├── client.py # HTTP clients
│ ├── models.py # Data models
│ ├── test_cases.py # Test case definitions
│ ├── metrics.py # Metrics calculation
│ └── report.py # Report generation
├── requirements.txt
├── pyproject.toml
├── .env.example
└── README.md
```
### Adding new test cases
Edit `zai_eval/test_cases.py`:
```python
TEST_CASES.append(
EvaluationRequest(
name="my_test",
description="My test description",
model="claude-3-sonnet-20240229",
max_tokens=100,
messages=[...],
)
)
```
## License
Same as parent project.