zai-proxy/docs/notes/zai-rate-limit-testing.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

354 lines
8.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Z.AI Rate Limit Testing & Validation
## Claimed Rate Limit: 2400 requests / 5 hours
**Effective rate**: 2400 / (5 × 60 × 60) = 2400 / 18000 = **0.133 requests/second**
This is quite restrictive! Let's validate and optimize.
## Testing Strategy
### Phase 1: Validate Actual Limits (Conservative)
Start with **very conservative** settings to avoid account suspension:
```yaml
env:
- name: RATE_LIMIT_INITIAL
value: "0.1" # Start at 0.1 req/s (360/hour, well below limit)
- name: RATE_LIMIT_MIN
value: "0.05" # Minimum 0.05 req/s (180/hour)
- name: RATE_LIMIT_MAX
value: "0.2" # Max 0.2 req/s (720/hour)
- name: MAX_RETRIES
value: "5" # More retries to handle bursts
```
**Expected behavior:**
- Should run smoothly with no 429s
- If we get 429s at 0.1 req/s, the limit is real
- If no 429s, we can gradually increase
### Phase 2: Find True Limit (Incremental Testing)
Run a **load test** with incrementally increasing rates:
```bash
# Create a test script
cat > /tmp/test-zai-limits.sh << 'EOF'
#!/bin/bash
PROXY_URL="http://zai-proxy.devpod.svc.cluster.local:8080"
TEST_DURATION=300 # 5 minutes per test
for rate in 0.1 0.15 0.2 0.3 0.5 1.0; do
echo "========================================"
echo "Testing rate: $rate req/s"
echo "========================================"
# Use Apache Bench (ab) or similar
total_requests=$(echo "$rate * $TEST_DURATION" | bc)
concurrency=$(echo "$rate * 2" | bc | cut -d. -f1)
concurrency=${concurrency:-1}
echo "Total requests: $total_requests"
echo "Concurrency: $concurrency"
# Run test
ab -n $total_requests -c $concurrency \
-H "Content-Type: application/json" \
-p /tmp/test-payload.json \
"$PROXY_URL/v1/messages" | tee /tmp/test-rate-$rate.log
# Check for 429s
count_429=$(grep -c "429" /tmp/test-rate-$rate.log || echo 0)
echo "429 errors: $count_429"
if [ $count_429 -gt 0 ]; then
echo "❌ Hit rate limit at $rate req/s"
echo "True limit is between previous rate and $rate"
break
else
echo "✅ No rate limit at $rate req/s"
fi
# Wait between tests
echo "Waiting 60s before next test..."
sleep 60
done
EOF
chmod +x /tmp/test-zai-limits.sh
```
### Phase 3: Monitor with Prometheus
While testing, monitor these queries:
```promql
# Current rate limit setting
zai_proxy_rate_limit_requests_per_second
# 429 error rate
sum(rate(zai_proxy_requests_total{status_code="429"}[5m]))
# Success rate
sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))
# Rate adjustments
rate(zai_proxy_rate_limit_adjustments_total[5m])
# Retry attempts due to 429
rate(zai_proxy_retry_attempts_total{reason="429"}[5m])
```
## Recommended Configuration
Based on **2400 requests / 5 hours = 0.133 req/s**:
### Conservative (Safe Production)
Stay **well below** the limit with headroom:
```yaml
env:
- name: RATE_LIMIT_INITIAL
value: "0.08" # 60% of limit (288/hour, 1728/5h)
- name: RATE_LIMIT_MIN
value: "0.05" # 38% of limit (180/hour)
- name: RATE_LIMIT_MAX
value: "0.12" # 90% of limit (432/hour, 2160/5h)
- name: MAX_RETRIES
value: "5"
```
**Rationale:**
- **Initial 0.08**: Start at 60% to allow burst traffic
- **Min 0.05**: Safety floor in case of aggressive backoff
- **Max 0.12**: Cap at 90% to avoid hitting limit during bursts
- **Retries 5**: More attempts to handle transient issues
### Aggressive (Maximum Utilization)
Push closer to the limit (risky, may hit 429s):
```yaml
env:
- name: RATE_LIMIT_INITIAL
value: "0.12" # 90% of limit
- name: RATE_LIMIT_MIN
value: "0.08" # 60% of limit
- name: RATE_LIMIT_MAX
value: "0.13" # 98% of limit (2340/5h)
- name: MAX_RETRIES
value: "7" # More retries needed
```
**Rationale:**
- **Initial 0.12**: Start near limit
- **Min 0.08**: Still have room to decrease
- **Max 0.13**: Push to 98% of claimed limit
- **Retries 7**: Handle bursts with more retries
## Burst Handling
The token bucket allows **bursts up to 2x the rate**:
```
Rate: 0.1 req/s
Burst: 0.2 tokens (allows 2 requests instantly)
```
This means you can handle:
- **Single spike**: 2 requests instantly
- **Sustained spike**: 0.2 req/s for burst duration
- **Then throttled**: Back to 0.1 req/s
Example burst scenario:
```
Time 0s: Send 2 requests instantly (use burst tokens)
Time 0s: Bucket now empty, wait for refill
Time 10s: Bucket has 1 token (0.1 × 10s)
Time 10s: Send 1 request
Time 20s: Bucket has 1 token again
```
## Scaling with Multiple Replicas
If you have **N replicas**, each gets the **full rate limit**:
```yaml
# With 3 replicas:
spec:
replicas: 3
env:
- name: RATE_LIMIT_MAX
value: "0.04" # 0.04 × 3 = 0.12 total
```
**But this assumes:**
- Load balancer distributes evenly
- No single replica hits limit
- Total cluster rate = N × per-replica rate
**Reality:**
- Uneven distribution can cause one replica to hit limit
- Better to set **per-replica rate = total_limit / (N × 1.5)** for safety
Example with 3 replicas:
```
Total limit: 0.133 req/s
Per-replica: 0.133 / (3 × 1.5) = 0.03 req/s
```
## Monitoring Dashboard
Create a Grafana panel to track rate limit efficiency:
```promql
# Requests per second (actual)
sum(rate(zai_proxy_requests_total[5m]))
# Vs. configured limit
zai_proxy_rate_limit_requests_per_second
# Efficiency ratio (want ~0.9)
sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))
/
zai_proxy_rate_limit_requests_per_second
```
**Target efficiency: 85-95%**
- Below 85%: You're under-utilizing (can increase rate)
- Above 95%: Risk of hitting limits (decrease rate)
## Testing Scenarios
### Scenario 1: Validate 2400/5h Limit
```bash
# Send exactly 2400 requests over 5 hours
# Rate: 0.133 req/s
# Set rate limit to exactly the claimed limit
kubectl set env deployment/zai-proxy -n devpod \
RATE_LIMIT_INITIAL=0.133 \
RATE_LIMIT_MIN=0.133 \
RATE_LIMIT_MAX=0.133
# Monitor for 5 hours
# If we get 429s, the limit is real
# If no 429s, we can go higher
```
### Scenario 2: Find Burst Capacity
```bash
# Send requests as fast as possible for 10 seconds
# See how many succeed before 429
for i in {1..100}; do
curl -X POST "http://zai-proxy:8080/v1/messages" \
-H "Content-Type: application/json" \
-d '{"model":"glm-4.7","messages":[{"role":"user","content":"test"}]}' &
done
wait
# Check logs for 429s
kubectl logs -n devpod deployment/zai-proxy | grep -c "429"
```
### Scenario 3: Sustained Load Test
```bash
# Use wrk or ab for sustained load
wrk -t4 -c10 -d300s --rate 0.15 \
-s /tmp/post-script.lua \
http://zai-proxy:8080/v1/messages
# Monitor:
# - Does rate limiter adapt?
# - Are 429s absorbed by retries?
# - Does success rate stay high?
```
## Expected Results
### If limit is 2400/5h (0.133 req/s):
| Setting | Expected Behavior |
|---------|-------------------|
| Rate = 0.1 | ✅ No 429s, smooth operation |
| Rate = 0.13 | ⚠️ Occasional 429s, absorbed by retries |
| Rate = 0.15 | ❌ Frequent 429s, some reach client |
| Rate = 0.2 | ❌ Constant 429s, rate decreases to min |
### If limit is actually higher:
| True Limit | Safe Max Rate | Notes |
|------------|---------------|-------|
| 5000/5h | 0.25 req/s | Common tier |
| 10000/5h | 0.5 req/s | Pro tier |
| 50000/5h | 2.5 req/s | Enterprise |
## Automation: Self-Tuning Script
Create a CronJob to automatically tune rate limits:
```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: zai-rate-tuner
namespace: devpod
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: tuner
image: appropriate/curl
command:
- /bin/sh
- -c
- |
# Query Prometheus for 429 rate
PROM_URL="http://prometheus:9090"
RATE_429=$(curl -s "$PROM_URL/api/v1/query?query=sum(rate(zai_proxy_requests_total{status_code=\"429\"}[6h]))" | jq -r '.data.result[0].value[1]')
# If 429 rate > 0.01, decrease limit by 20%
if (( $(echo "$RATE_429 > 0.01" | bc -l) )); then
echo "High 429 rate detected: $RATE_429"
echo "Decreasing rate limit..."
# Update deployment env vars
fi
```
## Summary & Recommendation
**For production with 2400/5h limit:**
1. **Start conservative**: 0.08 req/s (60% of limit)
2. **Monitor for 24-48h**: Check for any 429s
3. **Gradually increase**: +10% every 24h if no errors
4. **Stop when**: 429 rate exceeds 1% OR hit 0.12 req/s
5. **Set final config**: Last known-good rate as MAX
**Example final config:**
```yaml
env:
- name: RATE_LIMIT_INITIAL
value: "0.08"
- name: RATE_LIMIT_MIN
value: "0.05"
- name: RATE_LIMIT_MAX
value: "0.11" # Found via testing
- name: MAX_RETRIES
value: "5"
```
This ensures **maximum utilization** while **minimizing client-visible errors**! 🎯