- proxy/go.mod: github.com/ardenone/zai-proxy → git.ardenone.com/jedarden/zai-proxy - dashboard/go.mod: github.com/ardenone/ardenone-cluster/containers/zai-proxy-dashboard → git.ardenone.com/jedarden/zai-proxy/dashboard - Update all Go import paths in proxy/ and dashboard/ to match new module paths - Add proxy/evaluation/ package (was missing from initial commit) - Add docs/plan/plan.md with architecture, security model, telemetry design, and migration checklist Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
243 lines
6.7 KiB
Markdown
243 lines
6.7 KiB
Markdown
# Z.AI Proxy Evaluation Framework - Example Usage
|
|
|
|
This document provides examples and usage patterns for the evaluation framework.
|
|
|
|
## Quick Start
|
|
|
|
### 1. Set up environment
|
|
|
|
```bash
|
|
cd /home/coder/ardenone-cluster/containers/zai-proxy/evaluation
|
|
|
|
# Create and activate virtual environment
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
pip install -e .
|
|
|
|
# Set up environment variables
|
|
export ZAI_API_KEY="your-zai-api-key"
|
|
export ANTHROPIC_API_KEY="your-anthropic-api-key"
|
|
export ZAI_PROXY_URL="http://zai-proxy.devpod.svc.cluster.local:8080"
|
|
```
|
|
|
|
### 2. Run all tests
|
|
|
|
```bash
|
|
zai-eval run
|
|
```
|
|
|
|
### 3. Run specific test
|
|
|
|
```bash
|
|
zai-eval run short_simple
|
|
```
|
|
|
|
### 4. Run with output reports
|
|
|
|
```bash
|
|
zai-eval run --output ./results --json --markdown
|
|
```
|
|
|
|
## Test Results Interpretation
|
|
|
|
### Console Output
|
|
|
|
```
|
|
╭──────────────────────────────────────────╮
|
|
│ Z.AI PROXY EVALUATION REPORT │
|
|
╜──────────────────────────────────────────╯
|
|
|
|
Summary
|
|
────────────────────────────────────
|
|
Total Requests: 14
|
|
Successful: 14
|
|
Failed: 0
|
|
|
|
Token Count Accuracy
|
|
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
|
|
┃ Metric ┃ Accuracy (%) ┃
|
|
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
|
|
│ Input Token Accuracy│ 85.71% │
|
|
│ Output Token Accuracy│ 92.86% │
|
|
│ Overall Accuracy │ 78.57% │
|
|
└────────────────────┴───────────────┘
|
|
|
|
Systematic Bias Analysis
|
|
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
|
|
┃ Metric ┃ Value ┃
|
|
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
|
|
│ Input Bias │ +2.3 tokens │
|
|
│ Output Bias │ +1.1 tokens │
|
|
└────────────────────┴───────────────┘
|
|
```
|
|
|
|
### Interpreting Bias
|
|
|
|
- **Positive bias (+)**: Proxy overcounts (more tokens than Anthropic)
|
|
- **Negative bias (-)**: Proxy undercounts (fewer tokens than Anthropic)
|
|
- **Near zero**: Accurate counting
|
|
|
|
## Test Cases Reference
|
|
|
|
| Test Name | Description | Expected Behavior |
|
|
|-----------|-------------|-------------------|
|
|
| short_simple | Short simple text | Should match exactly |
|
|
| medium_conversation | Medium conversation | Should match exactly |
|
|
| long_context | Long detailed text | May have small variance |
|
|
| code_snippet | Code content | Special characters may affect count |
|
|
| multilingual_text | Multiple languages | Different tokenization per language |
|
|
| special_characters | Many symbols | May differ due to encoding |
|
|
|
|
## Common Issues
|
|
|
|
### Issue: Proxy returns no token counts
|
|
|
|
**Symptom**: `input_tokens=None` in results
|
|
|
|
**Solution**: Check proxy is running with token counting enabled:
|
|
```bash
|
|
kubectl logs deployment/zai-proxy -n devpod | grep "Token counting"
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
Token counting enabled (tiktoken cl100k_base encoding, model: glm-4)
|
|
```
|
|
|
|
### Issue: Anthropic API returns 401
|
|
|
|
**Symptom**: `anthropic_response.error` contains "401"
|
|
|
|
**Solution**: Verify `ANTHROPIC_API_KEY` is set correctly:
|
|
```bash
|
|
echo $ANTHROPIC_API_KEY | cut -c1-10
|
|
```
|
|
|
|
### Issue: Connection refused
|
|
|
|
**Symptom**: `Connection refused` for proxy
|
|
|
|
**Solution**: Verify proxy URL:
|
|
```bash
|
|
# From within cluster
|
|
export ZAI_PROXY_URL="http://zai-proxy.devpod.svc.cluster.local:8080"
|
|
|
|
# From local machine
|
|
export ZAI_PROXY_URL="http://localhost:8080"
|
|
```
|
|
|
|
## Advanced Usage
|
|
|
|
### Custom test case
|
|
|
|
Create a Python script:
|
|
|
|
```python
|
|
from zai_eval.client import DualClient
|
|
from zai_eval.models import EvaluationResult
|
|
from zai_eval.metrics import calculate_metrics
|
|
import os
|
|
|
|
client = DualClient(
|
|
proxy_url=os.getenv("ZAI_PROXY_URL"),
|
|
proxy_api_key=os.getenv("ZAI_API_KEY"),
|
|
anthropic_api_key=os.getenv("ANTHROPIC_API_KEY"),
|
|
)
|
|
|
|
proxy_resp, anthropic_resp = client.evaluate_request(
|
|
model="claude-3-sonnet-20240229",
|
|
messages=[{"role": "user", "content": "Your custom prompt"}],
|
|
max_tokens=100,
|
|
)
|
|
|
|
result = EvaluationResult(
|
|
request_name="custom_test",
|
|
proxy_response=proxy_resp,
|
|
anthropic_response=anthropic_resp,
|
|
)
|
|
result.calculate_metrics()
|
|
|
|
print(f"Input tokens: Proxy={proxy_resp.input_tokens}, Anthropic={anthropic_resp.input_tokens}")
|
|
print(f"Difference: {result.input_diff} ({result.input_pct_diff:.1f}%)")
|
|
```
|
|
|
|
### Batch testing
|
|
|
|
```bash
|
|
# Run specific tests only
|
|
zai-eval run short_simple medium_conversation long_context
|
|
|
|
# With verbose output
|
|
zai-eval run --verbose
|
|
|
|
# Save to custom location
|
|
zai-eval run --output ~/evaluation-results --json --markdown
|
|
```
|
|
|
|
## Metrics Reference
|
|
|
|
### Accuracy Metrics
|
|
|
|
- **Input Token Accuracy**: Percentage of exact input token matches
|
|
- **Output Token Accuracy**: Percentage of exact output token matches
|
|
- **Overall Accuracy**: Percentage where both input AND output match
|
|
|
|
### Error Metrics
|
|
|
|
- **MAE (Mean Absolute Error)**: Average token difference
|
|
- **MPE (Mean Percentage Error)**: Average percentage difference
|
|
|
|
### Latency Metrics
|
|
|
|
- **Proxy Latency**: Time for proxy request (ms)
|
|
- **Anthropic Latency**: Time for Anthropic request (ms)
|
|
- **Overhead**: Additional latency from proxy
|
|
|
|
### Bias Analysis
|
|
|
|
- **Input Bias Mean**: Average over/under-count for input tokens
|
|
- **Output Bias Mean**: Average over/under-count for output tokens
|
|
- **Consistently High/Low**: Number of tests with consistent bias direction
|
|
|
|
## Integration with CI/CD
|
|
|
|
```yaml
|
|
# .github/workflows/evaluation.yml
|
|
name: Token Accuracy Evaluation
|
|
|
|
on:
|
|
schedule:
|
|
- cron: '0 0 * * *' # Daily
|
|
workflow_dispatch:
|
|
|
|
jobs:
|
|
evaluate:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v3
|
|
- name: Set up Python
|
|
uses: actions/setup-python@v4
|
|
with:
|
|
python-version: '3.11'
|
|
- name: Install dependencies
|
|
run: |
|
|
cd evaluation
|
|
pip install -r requirements.txt
|
|
pip install -e .
|
|
- name: Run evaluation
|
|
env:
|
|
ZAI_API_KEY: ${{ secrets.ZAI_API_KEY }}
|
|
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
ZAI_PROXY_URL: ${{ secrets.ZAI_PROXY_URL }}
|
|
run: |
|
|
cd evaluation
|
|
zai-eval run --output results --json --markdown
|
|
- name: Upload results
|
|
uses: actions/upload-artifact@v3
|
|
with:
|
|
name: evaluation-results
|
|
path: evaluation/results/
|
|
```
|