zai-proxy/docs/notes/zai-proxy-rate-limiting.md
jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo
Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:53:52 -04:00

379 lines
12 KiB
Markdown

# Z.AI Proxy Adaptive Rate Limiting
## Overview
The zai-proxy v1.2.0+ includes **adaptive rate limiting** that automatically adjusts request rates to minimize 429 (rate limit) errors while maximizing throughput.
## How It Works
### Adaptive Algorithm
The proxy uses a **token bucket rate limiter** with automatic adjustment based on upstream 429 responses:
```
┌─────────────────────────────────────────────────────────────┐
│ Adaptive Rate Limiter │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Token Bucket │ ───> │ Rate: X/s │ │
│ │ Limiter │ │ Burst: 2X │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ Every 30s: │ │
│ │ │ Analyze 429s │ │
│ │ └──────┬───────┘ │
│ │ │ │
│ │ ┌──────────────┴──────────────┐ │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ >5% 429s? <1% 429s? │
│ │ Decrease 50% Increase 10% │
│ │ │ │ │
│ └──────┴─────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
```
### Rate Adjustment Logic
**Every 30 seconds**, the proxy evaluates recent requests:
1. **Calculate 429 rate**: `count(429) / count(total)`
2. **If 429 rate > 5%**:
- **Decrease rate by 50%** (aggressive backoff)
- Example: 20 req/s → 10 req/s
3. **If 429 rate < 1%**:
- **Increase rate by 10%** (gradual ramp-up)
- Example: 10 req/s → 11 req/s
4. **Rate bounds**:
- Never go below `RATE_LIMIT_MIN` (default: 1 req/s)
- Never exceed `RATE_LIMIT_MAX` (default: 50 req/s)
### Retry Logic
When a 429 error occurs:
1. **Respect Retry-After header** if present
2. **Exponential backoff**: 1s, 2s, 4s, 8s, ...
3. **Retry up to MAX_RETRIES times** (default: 3)
4. **Only then** return 429 to client
This means most transient rate limits are absorbed by the proxy.
## Configuration
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `RATE_LIMIT_INITIAL` | `10` | Starting rate in requests/second |
| `RATE_LIMIT_MIN` | `1` | Minimum rate (never go below) |
| `RATE_LIMIT_MAX` | `50` | Maximum rate (never exceed) |
| `MAX_RETRIES` | `3` | Number of retry attempts on 429/errors |
| `MAX_WORKERS` | `20` | Max concurrent requests (separate from rate) |
### Example: Conservative Rate Limiting
```yaml
env:
- name: RATE_LIMIT_INITIAL
value: "5" # Start slow
- name: RATE_LIMIT_MIN
value: "1"
- name: RATE_LIMIT_MAX
value: "20" # Cap at 20 req/s
- name: MAX_RETRIES
value: "5" # More retries
```
### Example: Aggressive Rate Limiting
```yaml
env:
- name: RATE_LIMIT_INITIAL
value: "30" # Start fast
- name: RATE_LIMIT_MIN
value: "5"
- name: RATE_LIMIT_MAX
value: "100" # Cap at 100 req/s
- name: MAX_RETRIES
value: "2" # Fewer retries
```
## New Metrics
### Rate Limiting Metrics
1. **`zai_proxy_rate_limit_requests_per_second`** (Gauge)
- Current rate limit setting
- Tracks how the rate adjusts over time
2. **`zai_proxy_rate_limit_wait_seconds`** (Histogram)
- Time requests spend waiting for rate limiter
- Buckets: 1ms to 10s
3. **`zai_proxy_rate_limit_adjustments_total`** (Counter)
- Count of rate adjustments by direction
- Labels: `direction=increase` or `direction=decrease`
4. **`zai_proxy_rate_limit_rejections_total`** (Counter)
- Requests rejected due to rate limiting
- Should be rare if MAX_WORKERS is set correctly
5. **`zai_proxy_retry_attempts_total`** (Counter)
- Retry attempts by reason
- Labels: `reason=429` or `reason=network_error`
### Useful Queries
**Current rate limit:**
```promql
zai_proxy_rate_limit_requests_per_second
```
**429 error rate:**
```promql
sum(rate(zai_proxy_requests_total{status_code="429"}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))
```
**Success rate (2xx):**
```promql
sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))
```
**Rate adjustments over time:**
```promql
rate(zai_proxy_rate_limit_adjustments_total[5m])
```
**Average wait time for rate limiter:**
```promql
histogram_quantile(0.90,
sum(rate(zai_proxy_rate_limit_wait_seconds_bucket[5m])) by (le)
)
```
**Retry rate:**
```promql
sum(rate(zai_proxy_retry_attempts_total[5m])) by (reason)
```
## Monitoring & Tuning
### Grafana Dashboard
The updated Grafana dashboard includes:
- **Rate limit gauge** - Current req/s limit
- **429 error rate panel** - Track rate limit errors
- **Rate adjustment history** - See when/how rate changes
- **Retry attempts** - Monitor retry frequency
### Tuning Strategy
#### 1. Monitor for 24-48 hours
Watch these metrics:
- `zai_proxy_rate_limit_requests_per_second` - Does it stabilize?
- `zai_proxy_requests_total{status_code="429"}` - Are 429s still occurring?
- `zai_proxy_retry_attempts_total{reason="429"}` - How many retries?
#### 2. Adjust based on patterns
**If you see:**
- **Frequent rate decreases** → Lower `RATE_LIMIT_INITIAL` or `RATE_LIMIT_MAX`
- **Rate stuck at minimum** → Your initial rate is too aggressive
- **429s still reaching clients** → Increase `MAX_RETRIES`
- **Long wait times** → Rate limit is too low for your traffic
#### 3. Optimize for your subscription
**Example: Z.AI subscription allows 60 req/min = 1 req/s**
```yaml
env:
- name: RATE_LIMIT_INITIAL
value: "0.8" # Start below limit
- name: RATE_LIMIT_MIN
value: "0.5" # 50% of limit
- name: RATE_LIMIT_MAX
value: "1.0" # Exactly at limit
- name: MAX_RETRIES
value: "5" # More retries
```
**Example: Z.AI subscription allows 1000 req/min = 16.7 req/s**
```yaml
env:
- name: RATE_LIMIT_INITIAL
value: "15" # Start below limit
- name: RATE_LIMIT_MIN
value: "8" # 50% of limit
- name: RATE_LIMIT_MAX
value: "16" # Slightly below limit for safety
- name: MAX_RETRIES
value: "3"
```
## Architecture
### Request Flow with Rate Limiting
```
Client Request
┌─────────────────┐
│ Worker Capacity │ ← MAX_WORKERS check
│ Check │ (20 concurrent)
└────────┬────────┘
┌─────────────────┐
│ Rate Limiter │ ← Token bucket wait
│ (Adaptive) │ (X req/s)
└────────┬────────┘
┌─────────────────┐
│ Upstream │
│ Request │
└────────┬────────┘
├─ 200 OK ──────────────────> Success
│ Record success
│ (maybe increase rate)
├─ 429 Rate Limit ──┐
│ │
│ ▼
│ ┌──────────────┐
│ │ Wait + Retry │ ← MAX_RETRIES
│ └──────┬───────┘ (3 attempts)
│ │
│ ├─ Success ──> Return to client
│ │ Record 429
│ │ (decrease rate)
│ │
│ └─ Still 429 ─> Return 429 to client
└─ Network Error ──┐
┌──────────────┐
│ Retry Logic │ ← Exponential backoff
└──────────────┘ 1s, 2s, 4s, ...
```
## Benefits
### 1. **Automatic Optimization**
- No manual tuning required
- Adapts to changing API limits
- Finds optimal rate automatically
### 2. **Resilience**
- Handles transient 429s via retries
- Backs off aggressively when needed
- Gradually increases when safe
### 3. **Cost Efficiency**
- Maximizes subscription utilization
- Reduces wasted requests
- Minimizes client-visible errors
### 4. **Observability**
- Rich metrics for monitoring
- Clear visibility into rate adjustments
- Easy to debug rate limit issues
## Troubleshooting
### Problem: Rate keeps decreasing
**Symptoms:**
- `zai_proxy_rate_limit_requests_per_second` trending down
- Frequent `decrease` adjustments
**Solutions:**
1. Lower `RATE_LIMIT_INITIAL` - Starting too high
2. Check upstream API health - May be globally rate-limited
3. Verify subscription limits - May have exceeded quota
### Problem: Still seeing 429 errors
**Symptoms:**
- `zai_proxy_requests_total{status_code="429"}` > 0 after retries
**Solutions:**
1. Increase `MAX_RETRIES` - More retry attempts
2. Lower `RATE_LIMIT_MAX` - Too aggressive ceiling
3. Check burst traffic - May need multiple proxy replicas
### Problem: High latency
**Symptoms:**
- `zai_proxy_rate_limit_wait_seconds` p90 > 1s
**Solutions:**
1. Increase `RATE_LIMIT_MAX` - Currently bottlenecked
2. Add more replicas - Distribute load
3. Check if rate stuck at minimum - Adjust RATE_LIMIT_MIN
### Problem: Rate not increasing
**Symptoms:**
- `zai_proxy_rate_limit_requests_per_second` stuck below max
- No `increase` adjustments
**Solutions:**
1. Wait longer - Increases are gradual (10% every 30s)
2. Check 429 rate - Must be <1% to increase
3. Verify traffic volume - Need sustained load to trigger increases
## Advanced: Per-Client Rate Limiting
For future enhancement, you could add per-client rate limiting:
```go
// Example: Rate limit per API key
type ClientRateLimiter struct {
limiters map[string]*AdaptiveRateLimiter
mu sync.RWMutex
}
func (crl *ClientRateLimiter) GetLimiter(clientID string) *AdaptiveRateLimiter {
crl.mu.RLock()
limiter, exists := crl.limiters[clientID]
crl.mu.RUnlock()
if exists {
return limiter
}
crl.mu.Lock()
defer crl.mu.Unlock()
limiter = NewAdaptiveRateLimiter(10, 1, 50)
crl.limiters[clientID] = limiter
return limiter
}
```
This would allow different rate limits per client/tenant.
## References
- [Token Bucket Algorithm](https://en.wikipedia.org/wiki/Token_bucket)
- [HTTP 429 Status Code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429)
- [Retry-After Header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After)
- [Exponential Backoff](https://en.wikipedia.org/wiki/Exponential_backoff)