zai-proxy/docs/notes/zai-rate-limit-testing.md

# Z.AI Rate Limit Testing & Validation

## Claimed Rate Limit: 2400 requests / 5 hours

**Effective rate**: 2400 / (5 × 60 × 60) = 2400 / 18000 = **0.133 requests/second**

This is quite restrictive! Let's validate and optimize.

## Testing Strategy

### Phase 1: Validate Actual Limits (Conservative)

Start with **very conservative** settings to avoid account suspension:

```yaml
env:
  - name: RATE_LIMIT_INITIAL
    value: "0.1"    # Start at 0.1 req/s (360/hour, well below limit)
  - name: RATE_LIMIT_MIN
    value: "0.05"   # Minimum 0.05 req/s (180/hour)
  - name: RATE_LIMIT_MAX
    value: "0.2"    # Max 0.2 req/s (720/hour)
  - name: MAX_RETRIES
    value: "5"      # More retries to handle bursts
```

**Expected behavior:**
- Should run smoothly with no 429s
- If we get 429s at 0.1 req/s, the limit is real
- If no 429s, we can gradually increase

### Phase 2: Find True Limit (Incremental Testing)

Run a **load test** with incrementally increasing rates:

```bash
# Create a test script
cat > /tmp/test-zai-limits.sh << 'EOF'
#!/bin/bash

PROXY_URL="http://zai-proxy.devpod.svc.cluster.local:8080"
TEST_DURATION=300  # 5 minutes per test

for rate in 0.1 0.15 0.2 0.3 0.5 1.0; do
  echo "========================================"
  echo "Testing rate: $rate req/s"
  echo "========================================"

  # Use Apache Bench (ab) or similar
  total_requests=$(echo "$rate * $TEST_DURATION" | bc)
  concurrency=$(echo "$rate * 2" | bc | cut -d. -f1)
  concurrency=${concurrency:-1}

  echo "Total requests: $total_requests"
  echo "Concurrency: $concurrency"

  # Run test
  ab -n $total_requests -c $concurrency \
    -H "Content-Type: application/json" \
    -p /tmp/test-payload.json \
    "$PROXY_URL/v1/messages" | tee /tmp/test-rate-$rate.log

  # Check for 429s
  count_429=$(grep -c "429" /tmp/test-rate-$rate.log || echo 0)

  echo "429 errors: $count_429"

  if [ $count_429 -gt 0 ]; then
    echo "❌ Hit rate limit at $rate req/s"
    echo "True limit is between previous rate and $rate"
    break
  else
    echo "✅ No rate limit at $rate req/s"
  fi

  # Wait between tests
  echo "Waiting 60s before next test..."
  sleep 60
done
EOF

chmod +x /tmp/test-zai-limits.sh
```

### Phase 3: Monitor with Prometheus

While testing, monitor these queries:

```promql
# Current rate limit setting
zai_proxy_rate_limit_requests_per_second

# 429 error rate
sum(rate(zai_proxy_requests_total{status_code="429"}[5m]))

# Success rate
sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))

# Rate adjustments
rate(zai_proxy_rate_limit_adjustments_total[5m])

# Retry attempts due to 429
rate(zai_proxy_retry_attempts_total{reason="429"}[5m])
```

## Recommended Configuration

Based on **2400 requests / 5 hours = 0.133 req/s**:

### Conservative (Safe Production)

Stay **well below** the limit with headroom:

```yaml
env:
  - name: RATE_LIMIT_INITIAL
    value: "0.08"   # 60% of limit (288/hour, 1728/5h)
  - name: RATE_LIMIT_MIN
    value: "0.05"   # 38% of limit (180/hour)
  - name: RATE_LIMIT_MAX
    value: "0.12"   # 90% of limit (432/hour, 2160/5h)
  - name: MAX_RETRIES
    value: "5"
```

**Rationale:**
- **Initial 0.08**: Start at 60% to allow burst traffic
- **Min 0.05**: Safety floor in case of aggressive backoff
- **Max 0.12**: Cap at 90% to avoid hitting limit during bursts
- **Retries 5**: More attempts to handle transient issues

### Aggressive (Maximum Utilization)

Push closer to the limit (risky, may hit 429s):

```yaml
env:
  - name: RATE_LIMIT_INITIAL
    value: "0.12"   # 90% of limit
  - name: RATE_LIMIT_MIN
    value: "0.08"   # 60% of limit
  - name: RATE_LIMIT_MAX
    value: "0.13"   # 98% of limit (2340/5h)
  - name: MAX_RETRIES
    value: "7"      # More retries needed
```

**Rationale:**
- **Initial 0.12**: Start near limit
- **Min 0.08**: Still have room to decrease
- **Max 0.13**: Push to 98% of claimed limit
- **Retries 7**: Handle bursts with more retries

## Burst Handling

The token bucket allows **bursts up to 2x the rate**:

```
Rate: 0.1 req/s
Burst: 0.2 tokens (allows 2 requests instantly)
```

This means you can handle:
- **Single spike**: 2 requests instantly
- **Sustained spike**: 0.2 req/s for burst duration
- **Then throttled**: Back to 0.1 req/s

Example burst scenario:
```
Time 0s:   Send 2 requests instantly (use burst tokens)
Time 0s:   Bucket now empty, wait for refill
Time 10s:  Bucket has 1 token (0.1 × 10s)
Time 10s:  Send 1 request
Time 20s:  Bucket has 1 token again
```

## Scaling with Multiple Replicas

If you have **N replicas**, each gets the **full rate limit**:

```yaml
# With 3 replicas:
spec:
  replicas: 3

env:
  - name: RATE_LIMIT_MAX
    value: "0.04"   # 0.04 × 3 = 0.12 total
```

**But this assumes:**
- Load balancer distributes evenly
- No single replica hits limit
- Total cluster rate = N × per-replica rate

**Reality:**
- Uneven distribution can cause one replica to hit limit
- Better to set **per-replica rate = total_limit / (N × 1.5)** for safety

Example with 3 replicas:
```
Total limit: 0.133 req/s
Per-replica: 0.133 / (3 × 1.5) = 0.03 req/s
```

## Monitoring Dashboard

Create a Grafana panel to track rate limit efficiency:

```promql
# Requests per second (actual)
sum(rate(zai_proxy_requests_total[5m]))

# Vs. configured limit
zai_proxy_rate_limit_requests_per_second

# Efficiency ratio (want ~0.9)
sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))
/
zai_proxy_rate_limit_requests_per_second
```

**Target efficiency: 85-95%**
- Below 85%: You're under-utilizing (can increase rate)
- Above 95%: Risk of hitting limits (decrease rate)

## Testing Scenarios

### Scenario 1: Validate 2400/5h Limit

```bash
# Send exactly 2400 requests over 5 hours
# Rate: 0.133 req/s

# Set rate limit to exactly the claimed limit
kubectl set env deployment/zai-proxy -n devpod \
  RATE_LIMIT_INITIAL=0.133 \
  RATE_LIMIT_MIN=0.133 \
  RATE_LIMIT_MAX=0.133

# Monitor for 5 hours
# If we get 429s, the limit is real
# If no 429s, we can go higher
```

### Scenario 2: Find Burst Capacity

```bash
# Send requests as fast as possible for 10 seconds
# See how many succeed before 429

for i in {1..100}; do
  curl -X POST "http://zai-proxy:8080/v1/messages" \
    -H "Content-Type: application/json" \
    -d '{"model":"glm-4.7","messages":[{"role":"user","content":"test"}]}' &
done
wait

# Check logs for 429s
kubectl logs -n devpod deployment/zai-proxy | grep -c "429"
```

### Scenario 3: Sustained Load Test

```bash
# Use wrk or ab for sustained load
wrk -t4 -c10 -d300s --rate 0.15 \
  -s /tmp/post-script.lua \
  http://zai-proxy:8080/v1/messages

# Monitor:
# - Does rate limiter adapt?
# - Are 429s absorbed by retries?
# - Does success rate stay high?
```

## Expected Results

### If limit is 2400/5h (0.133 req/s):

| Setting | Expected Behavior |
|---------|-------------------|
| Rate = 0.1 | ✅ No 429s, smooth operation |
| Rate = 0.13 | ⚠️ Occasional 429s, absorbed by retries |
| Rate = 0.15 | ❌ Frequent 429s, some reach client |
| Rate = 0.2 | ❌ Constant 429s, rate decreases to min |

### If limit is actually higher:

| True Limit | Safe Max Rate | Notes |
|------------|---------------|-------|
| 5000/5h | 0.25 req/s | Common tier |
| 10000/5h | 0.5 req/s | Pro tier |
| 50000/5h | 2.5 req/s | Enterprise |

## Automation: Self-Tuning Script

Create a CronJob to automatically tune rate limits:

```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: zai-rate-tuner
  namespace: devpod
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: tuner
            image: appropriate/curl
            command:
            - /bin/sh
            - -c
            - |
              # Query Prometheus for 429 rate
              PROM_URL="http://prometheus:9090"
              RATE_429=$(curl -s "$PROM_URL/api/v1/query?query=sum(rate(zai_proxy_requests_total{status_code=\"429\"}[6h]))" | jq -r '.data.result[0].value[1]')

              # If 429 rate > 0.01, decrease limit by 20%
              if (( $(echo "$RATE_429 > 0.01" | bc -l) )); then
                echo "High 429 rate detected: $RATE_429"
                echo "Decreasing rate limit..."
                # Update deployment env vars
              fi
```

## Summary & Recommendation

**For production with 2400/5h limit:**

1. **Start conservative**: 0.08 req/s (60% of limit)
2. **Monitor for 24-48h**: Check for any 429s
3. **Gradually increase**: +10% every 24h if no errors
4. **Stop when**: 429 rate exceeds 1% OR hit 0.12 req/s
5. **Set final config**: Last known-good rate as MAX

**Example final config:**
```yaml
env:
  - name: RATE_LIMIT_INITIAL
    value: "0.08"
  - name: RATE_LIMIT_MIN
    value: "0.05"
  - name: RATE_LIMIT_MAX
    value: "0.11"   # Found via testing
  - name: MAX_RETRIES
    value: "5"
```

This ensures **maximum utilization** while **minimizing client-visible errors**! 🎯