zai-proxy/docs/notes/zai-proxy-rate-limiting.md

# Z.AI Proxy Adaptive Rate Limiting

## Overview

The zai-proxy v1.2.0+ includes **adaptive rate limiting** that automatically adjusts request rates to minimize 429 (rate limit) errors while maximizing throughput.

## How It Works

### Adaptive Algorithm

The proxy uses a **token bucket rate limiter** with automatic adjustment based on upstream 429 responses:

```
┌─────────────────────────────────────────────────────────────┐
│                  Adaptive Rate Limiter                       │
│                                                               │
│  ┌──────────────┐      ┌──────────────┐                     │
│  │ Token Bucket │ ───> │  Rate: X/s   │                     │
│  │   Limiter    │      │ Burst: 2X    │                     │
│  └──────────────┘      └──────────────┘                     │
│         │                      │                             │
│         │                      ▼                             │
│         │              ┌──────────────┐                     │
│         │              │ Every 30s:   │                     │
│         │              │ Analyze 429s │                     │
│         │              └──────┬───────┘                     │
│         │                     │                             │
│         │      ┌──────────────┴──────────────┐              │
│         │      │                             │              │
│         │      ▼                             ▼              │
│         │  >5% 429s?                    <1% 429s?          │
│         │  Decrease 50%                 Increase 10%        │
│         │      │                             │              │
│         └──────┴─────────────────────────────┘              │
│                                                               │
└─────────────────────────────────────────────────────────────┘
```

### Rate Adjustment Logic

**Every 30 seconds**, the proxy evaluates recent requests:

1. **Calculate 429 rate**: `count(429) / count(total)`

2. **If 429 rate > 5%**:
   - **Decrease rate by 50%** (aggressive backoff)
   - Example: 20 req/s → 10 req/s

3. **If 429 rate < 1%**:
   - **Increase rate by 10%** (gradual ramp-up)
   - Example: 10 req/s → 11 req/s

4. **Rate bounds**:
   - Never go below `RATE_LIMIT_MIN` (default: 1 req/s)
   - Never exceed `RATE_LIMIT_MAX` (default: 50 req/s)

### Retry Logic

When a 429 error occurs:

1. **Respect Retry-After header** if present
2. **Exponential backoff**: 1s, 2s, 4s, 8s, ...
3. **Retry up to MAX_RETRIES times** (default: 3)
4. **Only then** return 429 to client

This means most transient rate limits are absorbed by the proxy.

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `RATE_LIMIT_INITIAL` | `10` | Starting rate in requests/second |
| `RATE_LIMIT_MIN` | `1` | Minimum rate (never go below) |
| `RATE_LIMIT_MAX` | `50` | Maximum rate (never exceed) |
| `MAX_RETRIES` | `3` | Number of retry attempts on 429/errors |
| `MAX_WORKERS` | `20` | Max concurrent requests (separate from rate) |

### Example: Conservative Rate Limiting

```yaml
env:
  - name: RATE_LIMIT_INITIAL
    value: "5"    # Start slow
  - name: RATE_LIMIT_MIN
    value: "1"
  - name: RATE_LIMIT_MAX
    value: "20"   # Cap at 20 req/s
  - name: MAX_RETRIES
    value: "5"    # More retries
```

### Example: Aggressive Rate Limiting

```yaml
env:
  - name: RATE_LIMIT_INITIAL
    value: "30"   # Start fast
  - name: RATE_LIMIT_MIN
    value: "5"
  - name: RATE_LIMIT_MAX
    value: "100"  # Cap at 100 req/s
  - name: MAX_RETRIES
    value: "2"    # Fewer retries
```

## New Metrics

### Rate Limiting Metrics

1. **`zai_proxy_rate_limit_requests_per_second`** (Gauge)
   - Current rate limit setting
   - Tracks how the rate adjusts over time

2. **`zai_proxy_rate_limit_wait_seconds`** (Histogram)
   - Time requests spend waiting for rate limiter
   - Buckets: 1ms to 10s

3. **`zai_proxy_rate_limit_adjustments_total`** (Counter)
   - Count of rate adjustments by direction
   - Labels: `direction=increase` or `direction=decrease`

4. **`zai_proxy_rate_limit_rejections_total`** (Counter)
   - Requests rejected due to rate limiting
   - Should be rare if MAX_WORKERS is set correctly

5. **`zai_proxy_retry_attempts_total`** (Counter)
   - Retry attempts by reason
   - Labels: `reason=429` or `reason=network_error`

### Useful Queries

**Current rate limit:**
```promql
zai_proxy_rate_limit_requests_per_second
```

**429 error rate:**
```promql
sum(rate(zai_proxy_requests_total{status_code="429"}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))
```

**Success rate (2xx):**
```promql
sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))
```

**Rate adjustments over time:**
```promql
rate(zai_proxy_rate_limit_adjustments_total[5m])
```

**Average wait time for rate limiter:**
```promql
histogram_quantile(0.90,
  sum(rate(zai_proxy_rate_limit_wait_seconds_bucket[5m])) by (le)
)
```

**Retry rate:**
```promql
sum(rate(zai_proxy_retry_attempts_total[5m])) by (reason)
```

## Monitoring & Tuning

### Grafana Dashboard

The updated Grafana dashboard includes:
- **Rate limit gauge** - Current req/s limit
- **429 error rate panel** - Track rate limit errors
- **Rate adjustment history** - See when/how rate changes
- **Retry attempts** - Monitor retry frequency

### Tuning Strategy

#### 1. Monitor for 24-48 hours

Watch these metrics:
- `zai_proxy_rate_limit_requests_per_second` - Does it stabilize?
- `zai_proxy_requests_total{status_code="429"}` - Are 429s still occurring?
- `zai_proxy_retry_attempts_total{reason="429"}` - How many retries?

#### 2. Adjust based on patterns

**If you see:**
- **Frequent rate decreases** → Lower `RATE_LIMIT_INITIAL` or `RATE_LIMIT_MAX`
- **Rate stuck at minimum** → Your initial rate is too aggressive
- **429s still reaching clients** → Increase `MAX_RETRIES`
- **Long wait times** → Rate limit is too low for your traffic

#### 3. Optimize for your subscription

**Example: Z.AI subscription allows 60 req/min = 1 req/s**

```yaml
env:
  - name: RATE_LIMIT_INITIAL
    value: "0.8"   # Start below limit
  - name: RATE_LIMIT_MIN
    value: "0.5"   # 50% of limit
  - name: RATE_LIMIT_MAX
    value: "1.0"   # Exactly at limit
  - name: MAX_RETRIES
    value: "5"     # More retries
```

**Example: Z.AI subscription allows 1000 req/min = 16.7 req/s**

```yaml
env:
  - name: RATE_LIMIT_INITIAL
    value: "15"    # Start below limit
  - name: RATE_LIMIT_MIN
    value: "8"     # 50% of limit
  - name: RATE_LIMIT_MAX
    value: "16"    # Slightly below limit for safety
  - name: MAX_RETRIES
    value: "3"
```

## Architecture

### Request Flow with Rate Limiting

```
Client Request
    │
    ▼
┌─────────────────┐
│ Worker Capacity │ ← MAX_WORKERS check
│ Check           │   (20 concurrent)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Rate Limiter    │ ← Token bucket wait
│ (Adaptive)      │   (X req/s)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Upstream        │
│ Request         │
└────────┬────────┘
         │
         ├─ 200 OK ──────────────────> Success
         │                              Record success
         │                              (maybe increase rate)
         │
         ├─ 429 Rate Limit ──┐
         │                    │
         │                    ▼
         │              ┌──────────────┐
         │              │ Wait + Retry │ ← MAX_RETRIES
         │              └──────┬───────┘   (3 attempts)
         │                     │
         │                     ├─ Success ──> Return to client
         │                     │               Record 429
         │                     │               (decrease rate)
         │                     │
         │                     └─ Still 429 ─> Return 429 to client
         │
         └─ Network Error ──┐
                            │
                            ▼
                      ┌──────────────┐
                      │ Retry Logic  │ ← Exponential backoff
                      └──────────────┘   1s, 2s, 4s, ...
```

## Benefits

### 1. **Automatic Optimization**
- No manual tuning required
- Adapts to changing API limits
- Finds optimal rate automatically

### 2. **Resilience**
- Handles transient 429s via retries
- Backs off aggressively when needed
- Gradually increases when safe

### 3. **Cost Efficiency**
- Maximizes subscription utilization
- Reduces wasted requests
- Minimizes client-visible errors

### 4. **Observability**
- Rich metrics for monitoring
- Clear visibility into rate adjustments
- Easy to debug rate limit issues

## Troubleshooting

### Problem: Rate keeps decreasing

**Symptoms:**
- `zai_proxy_rate_limit_requests_per_second` trending down
- Frequent `decrease` adjustments

**Solutions:**
1. Lower `RATE_LIMIT_INITIAL` - Starting too high
2. Check upstream API health - May be globally rate-limited
3. Verify subscription limits - May have exceeded quota

### Problem: Still seeing 429 errors

**Symptoms:**
- `zai_proxy_requests_total{status_code="429"}` > 0 after retries

**Solutions:**
1. Increase `MAX_RETRIES` - More retry attempts
2. Lower `RATE_LIMIT_MAX` - Too aggressive ceiling
3. Check burst traffic - May need multiple proxy replicas

### Problem: High latency

**Symptoms:**
- `zai_proxy_rate_limit_wait_seconds` p90 > 1s

**Solutions:**
1. Increase `RATE_LIMIT_MAX` - Currently bottlenecked
2. Add more replicas - Distribute load
3. Check if rate stuck at minimum - Adjust RATE_LIMIT_MIN

### Problem: Rate not increasing

**Symptoms:**
- `zai_proxy_rate_limit_requests_per_second` stuck below max
- No `increase` adjustments

**Solutions:**
1. Wait longer - Increases are gradual (10% every 30s)
2. Check 429 rate - Must be <1% to increase
3. Verify traffic volume - Need sustained load to trigger increases

## Advanced: Per-Client Rate Limiting

For future enhancement, you could add per-client rate limiting:

```go
// Example: Rate limit per API key
type ClientRateLimiter struct {
    limiters map[string]*AdaptiveRateLimiter
    mu       sync.RWMutex
}

func (crl *ClientRateLimiter) GetLimiter(clientID string) *AdaptiveRateLimiter {
    crl.mu.RLock()
    limiter, exists := crl.limiters[clientID]
    crl.mu.RUnlock()

    if exists {
        return limiter
    }

    crl.mu.Lock()
    defer crl.mu.Unlock()

    limiter = NewAdaptiveRateLimiter(10, 1, 50)
    crl.limiters[clientID] = limiter
    return limiter
}
```

This would allow different rate limits per client/tenant.

## References

- [Token Bucket Algorithm](https://en.wikipedia.org/wiki/Token_bucket)
- [HTTP 429 Status Code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429)
- [Retry-After Header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After)
- [Exponential Backoff](https://en.wikipedia.org/wiki/Exponential_backoff)