Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.7 KiB
Z.AI Rate Limit Testing & Validation
Claimed Rate Limit: 2400 requests / 5 hours
Effective rate: 2400 / (5 × 60 × 60) = 2400 / 18000 = 0.133 requests/second
This is quite restrictive! Let's validate and optimize.
Testing Strategy
Phase 1: Validate Actual Limits (Conservative)
Start with very conservative settings to avoid account suspension:
env:
- name: RATE_LIMIT_INITIAL
value: "0.1" # Start at 0.1 req/s (360/hour, well below limit)
- name: RATE_LIMIT_MIN
value: "0.05" # Minimum 0.05 req/s (180/hour)
- name: RATE_LIMIT_MAX
value: "0.2" # Max 0.2 req/s (720/hour)
- name: MAX_RETRIES
value: "5" # More retries to handle bursts
Expected behavior:
- Should run smoothly with no 429s
- If we get 429s at 0.1 req/s, the limit is real
- If no 429s, we can gradually increase
Phase 2: Find True Limit (Incremental Testing)
Run a load test with incrementally increasing rates:
# Create a test script
cat > /tmp/test-zai-limits.sh << 'EOF'
#!/bin/bash
PROXY_URL="http://zai-proxy.devpod.svc.cluster.local:8080"
TEST_DURATION=300 # 5 minutes per test
for rate in 0.1 0.15 0.2 0.3 0.5 1.0; do
echo "========================================"
echo "Testing rate: $rate req/s"
echo "========================================"
# Use Apache Bench (ab) or similar
total_requests=$(echo "$rate * $TEST_DURATION" | bc)
concurrency=$(echo "$rate * 2" | bc | cut -d. -f1)
concurrency=${concurrency:-1}
echo "Total requests: $total_requests"
echo "Concurrency: $concurrency"
# Run test
ab -n $total_requests -c $concurrency \
-H "Content-Type: application/json" \
-p /tmp/test-payload.json \
"$PROXY_URL/v1/messages" | tee /tmp/test-rate-$rate.log
# Check for 429s
count_429=$(grep -c "429" /tmp/test-rate-$rate.log || echo 0)
echo "429 errors: $count_429"
if [ $count_429 -gt 0 ]; then
echo "❌ Hit rate limit at $rate req/s"
echo "True limit is between previous rate and $rate"
break
else
echo "✅ No rate limit at $rate req/s"
fi
# Wait between tests
echo "Waiting 60s before next test..."
sleep 60
done
EOF
chmod +x /tmp/test-zai-limits.sh
Phase 3: Monitor with Prometheus
While testing, monitor these queries:
# Current rate limit setting
zai_proxy_rate_limit_requests_per_second
# 429 error rate
sum(rate(zai_proxy_requests_total{status_code="429"}[5m]))
# Success rate
sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))
# Rate adjustments
rate(zai_proxy_rate_limit_adjustments_total[5m])
# Retry attempts due to 429
rate(zai_proxy_retry_attempts_total{reason="429"}[5m])
Recommended Configuration
Based on 2400 requests / 5 hours = 0.133 req/s:
Conservative (Safe Production)
Stay well below the limit with headroom:
env:
- name: RATE_LIMIT_INITIAL
value: "0.08" # 60% of limit (288/hour, 1728/5h)
- name: RATE_LIMIT_MIN
value: "0.05" # 38% of limit (180/hour)
- name: RATE_LIMIT_MAX
value: "0.12" # 90% of limit (432/hour, 2160/5h)
- name: MAX_RETRIES
value: "5"
Rationale:
- Initial 0.08: Start at 60% to allow burst traffic
- Min 0.05: Safety floor in case of aggressive backoff
- Max 0.12: Cap at 90% to avoid hitting limit during bursts
- Retries 5: More attempts to handle transient issues
Aggressive (Maximum Utilization)
Push closer to the limit (risky, may hit 429s):
env:
- name: RATE_LIMIT_INITIAL
value: "0.12" # 90% of limit
- name: RATE_LIMIT_MIN
value: "0.08" # 60% of limit
- name: RATE_LIMIT_MAX
value: "0.13" # 98% of limit (2340/5h)
- name: MAX_RETRIES
value: "7" # More retries needed
Rationale:
- Initial 0.12: Start near limit
- Min 0.08: Still have room to decrease
- Max 0.13: Push to 98% of claimed limit
- Retries 7: Handle bursts with more retries
Burst Handling
The token bucket allows bursts up to 2x the rate:
Rate: 0.1 req/s
Burst: 0.2 tokens (allows 2 requests instantly)
This means you can handle:
- Single spike: 2 requests instantly
- Sustained spike: 0.2 req/s for burst duration
- Then throttled: Back to 0.1 req/s
Example burst scenario:
Time 0s: Send 2 requests instantly (use burst tokens)
Time 0s: Bucket now empty, wait for refill
Time 10s: Bucket has 1 token (0.1 × 10s)
Time 10s: Send 1 request
Time 20s: Bucket has 1 token again
Scaling with Multiple Replicas
If you have N replicas, each gets the full rate limit:
# With 3 replicas:
spec:
replicas: 3
env:
- name: RATE_LIMIT_MAX
value: "0.04" # 0.04 × 3 = 0.12 total
But this assumes:
- Load balancer distributes evenly
- No single replica hits limit
- Total cluster rate = N × per-replica rate
Reality:
- Uneven distribution can cause one replica to hit limit
- Better to set per-replica rate = total_limit / (N × 1.5) for safety
Example with 3 replicas:
Total limit: 0.133 req/s
Per-replica: 0.133 / (3 × 1.5) = 0.03 req/s
Monitoring Dashboard
Create a Grafana panel to track rate limit efficiency:
# Requests per second (actual)
sum(rate(zai_proxy_requests_total[5m]))
# Vs. configured limit
zai_proxy_rate_limit_requests_per_second
# Efficiency ratio (want ~0.9)
sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))
/
zai_proxy_rate_limit_requests_per_second
Target efficiency: 85-95%
- Below 85%: You're under-utilizing (can increase rate)
- Above 95%: Risk of hitting limits (decrease rate)
Testing Scenarios
Scenario 1: Validate 2400/5h Limit
# Send exactly 2400 requests over 5 hours
# Rate: 0.133 req/s
# Set rate limit to exactly the claimed limit
kubectl set env deployment/zai-proxy -n devpod \
RATE_LIMIT_INITIAL=0.133 \
RATE_LIMIT_MIN=0.133 \
RATE_LIMIT_MAX=0.133
# Monitor for 5 hours
# If we get 429s, the limit is real
# If no 429s, we can go higher
Scenario 2: Find Burst Capacity
# Send requests as fast as possible for 10 seconds
# See how many succeed before 429
for i in {1..100}; do
curl -X POST "http://zai-proxy:8080/v1/messages" \
-H "Content-Type: application/json" \
-d '{"model":"glm-4.7","messages":[{"role":"user","content":"test"}]}' &
done
wait
# Check logs for 429s
kubectl logs -n devpod deployment/zai-proxy | grep -c "429"
Scenario 3: Sustained Load Test
# Use wrk or ab for sustained load
wrk -t4 -c10 -d300s --rate 0.15 \
-s /tmp/post-script.lua \
http://zai-proxy:8080/v1/messages
# Monitor:
# - Does rate limiter adapt?
# - Are 429s absorbed by retries?
# - Does success rate stay high?
Expected Results
If limit is 2400/5h (0.133 req/s):
| Setting | Expected Behavior |
|---|---|
| Rate = 0.1 | ✅ No 429s, smooth operation |
| Rate = 0.13 | ⚠️ Occasional 429s, absorbed by retries |
| Rate = 0.15 | ❌ Frequent 429s, some reach client |
| Rate = 0.2 | ❌ Constant 429s, rate decreases to min |
If limit is actually higher:
| True Limit | Safe Max Rate | Notes |
|---|---|---|
| 5000/5h | 0.25 req/s | Common tier |
| 10000/5h | 0.5 req/s | Pro tier |
| 50000/5h | 2.5 req/s | Enterprise |
Automation: Self-Tuning Script
Create a CronJob to automatically tune rate limits:
apiVersion: batch/v1
kind: CronJob
metadata:
name: zai-rate-tuner
namespace: devpod
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: tuner
image: appropriate/curl
command:
- /bin/sh
- -c
- |
# Query Prometheus for 429 rate
PROM_URL="http://prometheus:9090"
RATE_429=$(curl -s "$PROM_URL/api/v1/query?query=sum(rate(zai_proxy_requests_total{status_code=\"429\"}[6h]))" | jq -r '.data.result[0].value[1]')
# If 429 rate > 0.01, decrease limit by 20%
if (( $(echo "$RATE_429 > 0.01" | bc -l) )); then
echo "High 429 rate detected: $RATE_429"
echo "Decreasing rate limit..."
# Update deployment env vars
fi
Summary & Recommendation
For production with 2400/5h limit:
- Start conservative: 0.08 req/s (60% of limit)
- Monitor for 24-48h: Check for any 429s
- Gradually increase: +10% every 24h if no errors
- Stop when: 429 rate exceeds 1% OR hit 0.12 req/s
- Set final config: Last known-good rate as MAX
Example final config:
env:
- name: RATE_LIMIT_INITIAL
value: "0.08"
- name: RATE_LIMIT_MIN
value: "0.05"
- name: RATE_LIMIT_MAX
value: "0.11" # Found via testing
- name: MAX_RETRIES
value: "5"
This ensures maximum utilization while minimizing client-visible errors! 🎯