Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
379 lines
12 KiB
Markdown
379 lines
12 KiB
Markdown
# Z.AI Proxy Adaptive Rate Limiting
|
|
|
|
## Overview
|
|
|
|
The zai-proxy v1.2.0+ includes **adaptive rate limiting** that automatically adjusts request rates to minimize 429 (rate limit) errors while maximizing throughput.
|
|
|
|
## How It Works
|
|
|
|
### Adaptive Algorithm
|
|
|
|
The proxy uses a **token bucket rate limiter** with automatic adjustment based on upstream 429 responses:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Adaptive Rate Limiter │
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Token Bucket │ ───> │ Rate: X/s │ │
|
|
│ │ Limiter │ │ Burst: 2X │ │
|
|
│ └──────────────┘ └──────────────┘ │
|
|
│ │ │ │
|
|
│ │ ▼ │
|
|
│ │ ┌──────────────┐ │
|
|
│ │ │ Every 30s: │ │
|
|
│ │ │ Analyze 429s │ │
|
|
│ │ └──────┬───────┘ │
|
|
│ │ │ │
|
|
│ │ ┌──────────────┴──────────────┐ │
|
|
│ │ │ │ │
|
|
│ │ ▼ ▼ │
|
|
│ │ >5% 429s? <1% 429s? │
|
|
│ │ Decrease 50% Increase 10% │
|
|
│ │ │ │ │
|
|
│ └──────┴─────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Rate Adjustment Logic
|
|
|
|
**Every 30 seconds**, the proxy evaluates recent requests:
|
|
|
|
1. **Calculate 429 rate**: `count(429) / count(total)`
|
|
|
|
2. **If 429 rate > 5%**:
|
|
- **Decrease rate by 50%** (aggressive backoff)
|
|
- Example: 20 req/s → 10 req/s
|
|
|
|
3. **If 429 rate < 1%**:
|
|
- **Increase rate by 10%** (gradual ramp-up)
|
|
- Example: 10 req/s → 11 req/s
|
|
|
|
4. **Rate bounds**:
|
|
- Never go below `RATE_LIMIT_MIN` (default: 1 req/s)
|
|
- Never exceed `RATE_LIMIT_MAX` (default: 50 req/s)
|
|
|
|
### Retry Logic
|
|
|
|
When a 429 error occurs:
|
|
|
|
1. **Respect Retry-After header** if present
|
|
2. **Exponential backoff**: 1s, 2s, 4s, 8s, ...
|
|
3. **Retry up to MAX_RETRIES times** (default: 3)
|
|
4. **Only then** return 429 to client
|
|
|
|
This means most transient rate limits are absorbed by the proxy.
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `RATE_LIMIT_INITIAL` | `10` | Starting rate in requests/second |
|
|
| `RATE_LIMIT_MIN` | `1` | Minimum rate (never go below) |
|
|
| `RATE_LIMIT_MAX` | `50` | Maximum rate (never exceed) |
|
|
| `MAX_RETRIES` | `3` | Number of retry attempts on 429/errors |
|
|
| `MAX_WORKERS` | `20` | Max concurrent requests (separate from rate) |
|
|
|
|
### Example: Conservative Rate Limiting
|
|
|
|
```yaml
|
|
env:
|
|
- name: RATE_LIMIT_INITIAL
|
|
value: "5" # Start slow
|
|
- name: RATE_LIMIT_MIN
|
|
value: "1"
|
|
- name: RATE_LIMIT_MAX
|
|
value: "20" # Cap at 20 req/s
|
|
- name: MAX_RETRIES
|
|
value: "5" # More retries
|
|
```
|
|
|
|
### Example: Aggressive Rate Limiting
|
|
|
|
```yaml
|
|
env:
|
|
- name: RATE_LIMIT_INITIAL
|
|
value: "30" # Start fast
|
|
- name: RATE_LIMIT_MIN
|
|
value: "5"
|
|
- name: RATE_LIMIT_MAX
|
|
value: "100" # Cap at 100 req/s
|
|
- name: MAX_RETRIES
|
|
value: "2" # Fewer retries
|
|
```
|
|
|
|
## New Metrics
|
|
|
|
### Rate Limiting Metrics
|
|
|
|
1. **`zai_proxy_rate_limit_requests_per_second`** (Gauge)
|
|
- Current rate limit setting
|
|
- Tracks how the rate adjusts over time
|
|
|
|
2. **`zai_proxy_rate_limit_wait_seconds`** (Histogram)
|
|
- Time requests spend waiting for rate limiter
|
|
- Buckets: 1ms to 10s
|
|
|
|
3. **`zai_proxy_rate_limit_adjustments_total`** (Counter)
|
|
- Count of rate adjustments by direction
|
|
- Labels: `direction=increase` or `direction=decrease`
|
|
|
|
4. **`zai_proxy_rate_limit_rejections_total`** (Counter)
|
|
- Requests rejected due to rate limiting
|
|
- Should be rare if MAX_WORKERS is set correctly
|
|
|
|
5. **`zai_proxy_retry_attempts_total`** (Counter)
|
|
- Retry attempts by reason
|
|
- Labels: `reason=429` or `reason=network_error`
|
|
|
|
### Useful Queries
|
|
|
|
**Current rate limit:**
|
|
```promql
|
|
zai_proxy_rate_limit_requests_per_second
|
|
```
|
|
|
|
**429 error rate:**
|
|
```promql
|
|
sum(rate(zai_proxy_requests_total{status_code="429"}[5m]))
|
|
/
|
|
sum(rate(zai_proxy_requests_total[5m]))
|
|
```
|
|
|
|
**Success rate (2xx):**
|
|
```promql
|
|
sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))
|
|
/
|
|
sum(rate(zai_proxy_requests_total[5m]))
|
|
```
|
|
|
|
**Rate adjustments over time:**
|
|
```promql
|
|
rate(zai_proxy_rate_limit_adjustments_total[5m])
|
|
```
|
|
|
|
**Average wait time for rate limiter:**
|
|
```promql
|
|
histogram_quantile(0.90,
|
|
sum(rate(zai_proxy_rate_limit_wait_seconds_bucket[5m])) by (le)
|
|
)
|
|
```
|
|
|
|
**Retry rate:**
|
|
```promql
|
|
sum(rate(zai_proxy_retry_attempts_total[5m])) by (reason)
|
|
```
|
|
|
|
## Monitoring & Tuning
|
|
|
|
### Grafana Dashboard
|
|
|
|
The updated Grafana dashboard includes:
|
|
- **Rate limit gauge** - Current req/s limit
|
|
- **429 error rate panel** - Track rate limit errors
|
|
- **Rate adjustment history** - See when/how rate changes
|
|
- **Retry attempts** - Monitor retry frequency
|
|
|
|
### Tuning Strategy
|
|
|
|
#### 1. Monitor for 24-48 hours
|
|
|
|
Watch these metrics:
|
|
- `zai_proxy_rate_limit_requests_per_second` - Does it stabilize?
|
|
- `zai_proxy_requests_total{status_code="429"}` - Are 429s still occurring?
|
|
- `zai_proxy_retry_attempts_total{reason="429"}` - How many retries?
|
|
|
|
#### 2. Adjust based on patterns
|
|
|
|
**If you see:**
|
|
- **Frequent rate decreases** → Lower `RATE_LIMIT_INITIAL` or `RATE_LIMIT_MAX`
|
|
- **Rate stuck at minimum** → Your initial rate is too aggressive
|
|
- **429s still reaching clients** → Increase `MAX_RETRIES`
|
|
- **Long wait times** → Rate limit is too low for your traffic
|
|
|
|
#### 3. Optimize for your subscription
|
|
|
|
**Example: Z.AI subscription allows 60 req/min = 1 req/s**
|
|
|
|
```yaml
|
|
env:
|
|
- name: RATE_LIMIT_INITIAL
|
|
value: "0.8" # Start below limit
|
|
- name: RATE_LIMIT_MIN
|
|
value: "0.5" # 50% of limit
|
|
- name: RATE_LIMIT_MAX
|
|
value: "1.0" # Exactly at limit
|
|
- name: MAX_RETRIES
|
|
value: "5" # More retries
|
|
```
|
|
|
|
**Example: Z.AI subscription allows 1000 req/min = 16.7 req/s**
|
|
|
|
```yaml
|
|
env:
|
|
- name: RATE_LIMIT_INITIAL
|
|
value: "15" # Start below limit
|
|
- name: RATE_LIMIT_MIN
|
|
value: "8" # 50% of limit
|
|
- name: RATE_LIMIT_MAX
|
|
value: "16" # Slightly below limit for safety
|
|
- name: MAX_RETRIES
|
|
value: "3"
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Request Flow with Rate Limiting
|
|
|
|
```
|
|
Client Request
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Worker Capacity │ ← MAX_WORKERS check
|
|
│ Check │ (20 concurrent)
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Rate Limiter │ ← Token bucket wait
|
|
│ (Adaptive) │ (X req/s)
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Upstream │
|
|
│ Request │
|
|
└────────┬────────┘
|
|
│
|
|
├─ 200 OK ──────────────────> Success
|
|
│ Record success
|
|
│ (maybe increase rate)
|
|
│
|
|
├─ 429 Rate Limit ──┐
|
|
│ │
|
|
│ ▼
|
|
│ ┌──────────────┐
|
|
│ │ Wait + Retry │ ← MAX_RETRIES
|
|
│ └──────┬───────┘ (3 attempts)
|
|
│ │
|
|
│ ├─ Success ──> Return to client
|
|
│ │ Record 429
|
|
│ │ (decrease rate)
|
|
│ │
|
|
│ └─ Still 429 ─> Return 429 to client
|
|
│
|
|
└─ Network Error ──┐
|
|
│
|
|
▼
|
|
┌──────────────┐
|
|
│ Retry Logic │ ← Exponential backoff
|
|
└──────────────┘ 1s, 2s, 4s, ...
|
|
```
|
|
|
|
## Benefits
|
|
|
|
### 1. **Automatic Optimization**
|
|
- No manual tuning required
|
|
- Adapts to changing API limits
|
|
- Finds optimal rate automatically
|
|
|
|
### 2. **Resilience**
|
|
- Handles transient 429s via retries
|
|
- Backs off aggressively when needed
|
|
- Gradually increases when safe
|
|
|
|
### 3. **Cost Efficiency**
|
|
- Maximizes subscription utilization
|
|
- Reduces wasted requests
|
|
- Minimizes client-visible errors
|
|
|
|
### 4. **Observability**
|
|
- Rich metrics for monitoring
|
|
- Clear visibility into rate adjustments
|
|
- Easy to debug rate limit issues
|
|
|
|
## Troubleshooting
|
|
|
|
### Problem: Rate keeps decreasing
|
|
|
|
**Symptoms:**
|
|
- `zai_proxy_rate_limit_requests_per_second` trending down
|
|
- Frequent `decrease` adjustments
|
|
|
|
**Solutions:**
|
|
1. Lower `RATE_LIMIT_INITIAL` - Starting too high
|
|
2. Check upstream API health - May be globally rate-limited
|
|
3. Verify subscription limits - May have exceeded quota
|
|
|
|
### Problem: Still seeing 429 errors
|
|
|
|
**Symptoms:**
|
|
- `zai_proxy_requests_total{status_code="429"}` > 0 after retries
|
|
|
|
**Solutions:**
|
|
1. Increase `MAX_RETRIES` - More retry attempts
|
|
2. Lower `RATE_LIMIT_MAX` - Too aggressive ceiling
|
|
3. Check burst traffic - May need multiple proxy replicas
|
|
|
|
### Problem: High latency
|
|
|
|
**Symptoms:**
|
|
- `zai_proxy_rate_limit_wait_seconds` p90 > 1s
|
|
|
|
**Solutions:**
|
|
1. Increase `RATE_LIMIT_MAX` - Currently bottlenecked
|
|
2. Add more replicas - Distribute load
|
|
3. Check if rate stuck at minimum - Adjust RATE_LIMIT_MIN
|
|
|
|
### Problem: Rate not increasing
|
|
|
|
**Symptoms:**
|
|
- `zai_proxy_rate_limit_requests_per_second` stuck below max
|
|
- No `increase` adjustments
|
|
|
|
**Solutions:**
|
|
1. Wait longer - Increases are gradual (10% every 30s)
|
|
2. Check 429 rate - Must be <1% to increase
|
|
3. Verify traffic volume - Need sustained load to trigger increases
|
|
|
|
## Advanced: Per-Client Rate Limiting
|
|
|
|
For future enhancement, you could add per-client rate limiting:
|
|
|
|
```go
|
|
// Example: Rate limit per API key
|
|
type ClientRateLimiter struct {
|
|
limiters map[string]*AdaptiveRateLimiter
|
|
mu sync.RWMutex
|
|
}
|
|
|
|
func (crl *ClientRateLimiter) GetLimiter(clientID string) *AdaptiveRateLimiter {
|
|
crl.mu.RLock()
|
|
limiter, exists := crl.limiters[clientID]
|
|
crl.mu.RUnlock()
|
|
|
|
if exists {
|
|
return limiter
|
|
}
|
|
|
|
crl.mu.Lock()
|
|
defer crl.mu.Unlock()
|
|
|
|
limiter = NewAdaptiveRateLimiter(10, 1, 50)
|
|
crl.limiters[clientID] = limiter
|
|
return limiter
|
|
}
|
|
```
|
|
|
|
This would allow different rate limits per client/tenant.
|
|
|
|
## References
|
|
|
|
- [Token Bucket Algorithm](https://en.wikipedia.org/wiki/Token_bucket)
|
|
- [HTTP 429 Status Code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429)
|
|
- [Retry-After Header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After)
|
|
- [Exponential Backoff](https://en.wikipedia.org/wiki/Exponential_backoff)
|