Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
12 KiB
Z.AI Proxy Adaptive Rate Limiting
Overview
The zai-proxy v1.2.0+ includes adaptive rate limiting that automatically adjusts request rates to minimize 429 (rate limit) errors while maximizing throughput.
How It Works
Adaptive Algorithm
The proxy uses a token bucket rate limiter with automatic adjustment based on upstream 429 responses:
┌─────────────────────────────────────────────────────────────┐
│ Adaptive Rate Limiter │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Token Bucket │ ───> │ Rate: X/s │ │
│ │ Limiter │ │ Burst: 2X │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ Every 30s: │ │
│ │ │ Analyze 429s │ │
│ │ └──────┬───────┘ │
│ │ │ │
│ │ ┌──────────────┴──────────────┐ │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ >5% 429s? <1% 429s? │
│ │ Decrease 50% Increase 10% │
│ │ │ │ │
│ └──────┴─────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Rate Adjustment Logic
Every 30 seconds, the proxy evaluates recent requests:
-
Calculate 429 rate:
count(429) / count(total) -
If 429 rate > 5%:
- Decrease rate by 50% (aggressive backoff)
- Example: 20 req/s → 10 req/s
-
If 429 rate < 1%:
- Increase rate by 10% (gradual ramp-up)
- Example: 10 req/s → 11 req/s
-
Rate bounds:
- Never go below
RATE_LIMIT_MIN(default: 1 req/s) - Never exceed
RATE_LIMIT_MAX(default: 50 req/s)
- Never go below
Retry Logic
When a 429 error occurs:
- Respect Retry-After header if present
- Exponential backoff: 1s, 2s, 4s, 8s, ...
- Retry up to MAX_RETRIES times (default: 3)
- Only then return 429 to client
This means most transient rate limits are absorbed by the proxy.
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
RATE_LIMIT_INITIAL |
10 |
Starting rate in requests/second |
RATE_LIMIT_MIN |
1 |
Minimum rate (never go below) |
RATE_LIMIT_MAX |
50 |
Maximum rate (never exceed) |
MAX_RETRIES |
3 |
Number of retry attempts on 429/errors |
MAX_WORKERS |
20 |
Max concurrent requests (separate from rate) |
Example: Conservative Rate Limiting
env:
- name: RATE_LIMIT_INITIAL
value: "5" # Start slow
- name: RATE_LIMIT_MIN
value: "1"
- name: RATE_LIMIT_MAX
value: "20" # Cap at 20 req/s
- name: MAX_RETRIES
value: "5" # More retries
Example: Aggressive Rate Limiting
env:
- name: RATE_LIMIT_INITIAL
value: "30" # Start fast
- name: RATE_LIMIT_MIN
value: "5"
- name: RATE_LIMIT_MAX
value: "100" # Cap at 100 req/s
- name: MAX_RETRIES
value: "2" # Fewer retries
New Metrics
Rate Limiting Metrics
-
zai_proxy_rate_limit_requests_per_second(Gauge)- Current rate limit setting
- Tracks how the rate adjusts over time
-
zai_proxy_rate_limit_wait_seconds(Histogram)- Time requests spend waiting for rate limiter
- Buckets: 1ms to 10s
-
zai_proxy_rate_limit_adjustments_total(Counter)- Count of rate adjustments by direction
- Labels:
direction=increaseordirection=decrease
-
zai_proxy_rate_limit_rejections_total(Counter)- Requests rejected due to rate limiting
- Should be rare if MAX_WORKERS is set correctly
-
zai_proxy_retry_attempts_total(Counter)- Retry attempts by reason
- Labels:
reason=429orreason=network_error
Useful Queries
Current rate limit:
zai_proxy_rate_limit_requests_per_second
429 error rate:
sum(rate(zai_proxy_requests_total{status_code="429"}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))
Success rate (2xx):
sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))
Rate adjustments over time:
rate(zai_proxy_rate_limit_adjustments_total[5m])
Average wait time for rate limiter:
histogram_quantile(0.90,
sum(rate(zai_proxy_rate_limit_wait_seconds_bucket[5m])) by (le)
)
Retry rate:
sum(rate(zai_proxy_retry_attempts_total[5m])) by (reason)
Monitoring & Tuning
Grafana Dashboard
The updated Grafana dashboard includes:
- Rate limit gauge - Current req/s limit
- 429 error rate panel - Track rate limit errors
- Rate adjustment history - See when/how rate changes
- Retry attempts - Monitor retry frequency
Tuning Strategy
1. Monitor for 24-48 hours
Watch these metrics:
zai_proxy_rate_limit_requests_per_second- Does it stabilize?zai_proxy_requests_total{status_code="429"}- Are 429s still occurring?zai_proxy_retry_attempts_total{reason="429"}- How many retries?
2. Adjust based on patterns
If you see:
- Frequent rate decreases → Lower
RATE_LIMIT_INITIALorRATE_LIMIT_MAX - Rate stuck at minimum → Your initial rate is too aggressive
- 429s still reaching clients → Increase
MAX_RETRIES - Long wait times → Rate limit is too low for your traffic
3. Optimize for your subscription
Example: Z.AI subscription allows 60 req/min = 1 req/s
env:
- name: RATE_LIMIT_INITIAL
value: "0.8" # Start below limit
- name: RATE_LIMIT_MIN
value: "0.5" # 50% of limit
- name: RATE_LIMIT_MAX
value: "1.0" # Exactly at limit
- name: MAX_RETRIES
value: "5" # More retries
Example: Z.AI subscription allows 1000 req/min = 16.7 req/s
env:
- name: RATE_LIMIT_INITIAL
value: "15" # Start below limit
- name: RATE_LIMIT_MIN
value: "8" # 50% of limit
- name: RATE_LIMIT_MAX
value: "16" # Slightly below limit for safety
- name: MAX_RETRIES
value: "3"
Architecture
Request Flow with Rate Limiting
Client Request
│
▼
┌─────────────────┐
│ Worker Capacity │ ← MAX_WORKERS check
│ Check │ (20 concurrent)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Rate Limiter │ ← Token bucket wait
│ (Adaptive) │ (X req/s)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Upstream │
│ Request │
└────────┬────────┘
│
├─ 200 OK ──────────────────> Success
│ Record success
│ (maybe increase rate)
│
├─ 429 Rate Limit ──┐
│ │
│ ▼
│ ┌──────────────┐
│ │ Wait + Retry │ ← MAX_RETRIES
│ └──────┬───────┘ (3 attempts)
│ │
│ ├─ Success ──> Return to client
│ │ Record 429
│ │ (decrease rate)
│ │
│ └─ Still 429 ─> Return 429 to client
│
└─ Network Error ──┐
│
▼
┌──────────────┐
│ Retry Logic │ ← Exponential backoff
└──────────────┘ 1s, 2s, 4s, ...
Benefits
1. Automatic Optimization
- No manual tuning required
- Adapts to changing API limits
- Finds optimal rate automatically
2. Resilience
- Handles transient 429s via retries
- Backs off aggressively when needed
- Gradually increases when safe
3. Cost Efficiency
- Maximizes subscription utilization
- Reduces wasted requests
- Minimizes client-visible errors
4. Observability
- Rich metrics for monitoring
- Clear visibility into rate adjustments
- Easy to debug rate limit issues
Troubleshooting
Problem: Rate keeps decreasing
Symptoms:
zai_proxy_rate_limit_requests_per_secondtrending down- Frequent
decreaseadjustments
Solutions:
- Lower
RATE_LIMIT_INITIAL- Starting too high - Check upstream API health - May be globally rate-limited
- Verify subscription limits - May have exceeded quota
Problem: Still seeing 429 errors
Symptoms:
zai_proxy_requests_total{status_code="429"}> 0 after retries
Solutions:
- Increase
MAX_RETRIES- More retry attempts - Lower
RATE_LIMIT_MAX- Too aggressive ceiling - Check burst traffic - May need multiple proxy replicas
Problem: High latency
Symptoms:
zai_proxy_rate_limit_wait_secondsp90 > 1s
Solutions:
- Increase
RATE_LIMIT_MAX- Currently bottlenecked - Add more replicas - Distribute load
- Check if rate stuck at minimum - Adjust RATE_LIMIT_MIN
Problem: Rate not increasing
Symptoms:
zai_proxy_rate_limit_requests_per_secondstuck below max- No
increaseadjustments
Solutions:
- Wait longer - Increases are gradual (10% every 30s)
- Check 429 rate - Must be <1% to increase
- Verify traffic volume - Need sustained load to trigger increases
Advanced: Per-Client Rate Limiting
For future enhancement, you could add per-client rate limiting:
// Example: Rate limit per API key
type ClientRateLimiter struct {
limiters map[string]*AdaptiveRateLimiter
mu sync.RWMutex
}
func (crl *ClientRateLimiter) GetLimiter(clientID string) *AdaptiveRateLimiter {
crl.mu.RLock()
limiter, exists := crl.limiters[clientID]
crl.mu.RUnlock()
if exists {
return limiter
}
crl.mu.Lock()
defer crl.mu.Unlock()
limiter = NewAdaptiveRateLimiter(10, 1, 50)
crl.limiters[clientID] = limiter
return limiter
}
This would allow different rate limits per client/tenant.