jedarden e7c24a0c08 feat: initial zai-proxy ecosystem repo

Extracted from ardenone-cluster/containers/zai-proxy and
ardenone-cluster/containers/zai-proxy-dashboard.

- proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0)
  - Token counting, rate limiting, Prometheus metrics, canary support
- dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0)
  - Prometheus collector, SQLite storage, SSE live updates
- docs/: Operational notes, research, and plan subdirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:53:52 -04:00

12 KiB

Raw Blame History

Z.AI Proxy Adaptive Rate Limiting

Overview

The zai-proxy v1.2.0+ includes adaptive rate limiting that automatically adjusts request rates to minimize 429 (rate limit) errors while maximizing throughput.

How It Works

Adaptive Algorithm

The proxy uses a token bucket rate limiter with automatic adjustment based on upstream 429 responses:

┌─────────────────────────────────────────────────────────────┐
│                  Adaptive Rate Limiter                       │
│                                                               │
│  ┌──────────────┐      ┌──────────────┐                     │
│  │ Token Bucket │ ───> │  Rate: X/s   │                     │
│  │   Limiter    │      │ Burst: 2X    │                     │
│  └──────────────┘      └──────────────┘                     │
│         │                      │                             │
│         │                      ▼                             │
│         │              ┌──────────────┐                     │
│         │              │ Every 30s:   │                     │
│         │              │ Analyze 429s │                     │
│         │              └──────┬───────┘                     │
│         │                     │                             │
│         │      ┌──────────────┴──────────────┐              │
│         │      │                             │              │
│         │      ▼                             ▼              │
│         │  >5% 429s?                    <1% 429s?          │
│         │  Decrease 50%                 Increase 10%        │
│         │      │                             │              │
│         └──────┴─────────────────────────────┘              │
│                                                               │
└─────────────────────────────────────────────────────────────┘

Rate Adjustment Logic

Every 30 seconds, the proxy evaluates recent requests:

Calculate 429 rate: count(429) / count(total)
If 429 rate > 5%:
- Decrease rate by 50% (aggressive backoff)
- Example: 20 req/s → 10 req/s
If 429 rate < 1%:
- Increase rate by 10% (gradual ramp-up)
- Example: 10 req/s → 11 req/s
Rate bounds:
- Never go below RATE_LIMIT_MIN (default: 1 req/s)
- Never exceed RATE_LIMIT_MAX (default: 50 req/s)

Retry Logic

When a 429 error occurs:

Respect Retry-After header if present
Exponential backoff: 1s, 2s, 4s, 8s, ...
Retry up to MAX_RETRIES times (default: 3)
Only then return 429 to client

This means most transient rate limits are absorbed by the proxy.

Configuration

Environment Variables

Variable	Default	Description
`RATE_LIMIT_INITIAL`	`10`	Starting rate in requests/second
`RATE_LIMIT_MIN`	`1`	Minimum rate (never go below)
`RATE_LIMIT_MAX`	`50`	Maximum rate (never exceed)
`MAX_RETRIES`	`3`	Number of retry attempts on 429/errors
`MAX_WORKERS`	`20`	Max concurrent requests (separate from rate)

Example: Conservative Rate Limiting

env:
  - name: RATE_LIMIT_INITIAL
    value: "5"    # Start slow
  - name: RATE_LIMIT_MIN
    value: "1"
  - name: RATE_LIMIT_MAX
    value: "20"   # Cap at 20 req/s
  - name: MAX_RETRIES
    value: "5"    # More retries

Example: Aggressive Rate Limiting

env:
  - name: RATE_LIMIT_INITIAL
    value: "30"   # Start fast
  - name: RATE_LIMIT_MIN
    value: "5"
  - name: RATE_LIMIT_MAX
    value: "100"  # Cap at 100 req/s
  - name: MAX_RETRIES
    value: "2"    # Fewer retries

New Metrics

Rate Limiting Metrics

zai_proxy_rate_limit_requests_per_second (Gauge)
- Current rate limit setting
- Tracks how the rate adjusts over time
zai_proxy_rate_limit_wait_seconds (Histogram)
- Time requests spend waiting for rate limiter
- Buckets: 1ms to 10s
zai_proxy_rate_limit_adjustments_total (Counter)
- Count of rate adjustments by direction
- Labels: direction=increase or direction=decrease
zai_proxy_rate_limit_rejections_total (Counter)
- Requests rejected due to rate limiting
- Should be rare if MAX_WORKERS is set correctly
zai_proxy_retry_attempts_total (Counter)
- Retry attempts by reason
- Labels: reason=429 or reason=network_error

Useful Queries

Current rate limit:

zai_proxy_rate_limit_requests_per_second

429 error rate:

sum(rate(zai_proxy_requests_total{status_code="429"}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))

Success rate (2xx):

sum(rate(zai_proxy_requests_total{status_code=~"2.."}[5m]))
/
sum(rate(zai_proxy_requests_total[5m]))

Rate adjustments over time:

rate(zai_proxy_rate_limit_adjustments_total[5m])

Average wait time for rate limiter:

histogram_quantile(0.90,
  sum(rate(zai_proxy_rate_limit_wait_seconds_bucket[5m])) by (le)
)

Retry rate:

sum(rate(zai_proxy_retry_attempts_total[5m])) by (reason)

Monitoring & Tuning

Grafana Dashboard

The updated Grafana dashboard includes:

Rate limit gauge - Current req/s limit
429 error rate panel - Track rate limit errors
Rate adjustment history - See when/how rate changes
Retry attempts - Monitor retry frequency

Tuning Strategy

1. Monitor for 24-48 hours

Watch these metrics:

zai_proxy_rate_limit_requests_per_second - Does it stabilize?
zai_proxy_requests_total{status_code="429"} - Are 429s still occurring?
zai_proxy_retry_attempts_total{reason="429"} - How many retries?

2. Adjust based on patterns

If you see:

Frequent rate decreases → Lower RATE_LIMIT_INITIAL or RATE_LIMIT_MAX
Rate stuck at minimum → Your initial rate is too aggressive
429s still reaching clients → Increase MAX_RETRIES
Long wait times → Rate limit is too low for your traffic

3. Optimize for your subscription

Example: Z.AI subscription allows 60 req/min = 1 req/s

env:
  - name: RATE_LIMIT_INITIAL
    value: "0.8"   # Start below limit
  - name: RATE_LIMIT_MIN
    value: "0.5"   # 50% of limit
  - name: RATE_LIMIT_MAX
    value: "1.0"   # Exactly at limit
  - name: MAX_RETRIES
    value: "5"     # More retries

Example: Z.AI subscription allows 1000 req/min = 16.7 req/s

env:
  - name: RATE_LIMIT_INITIAL
    value: "15"    # Start below limit
  - name: RATE_LIMIT_MIN
    value: "8"     # 50% of limit
  - name: RATE_LIMIT_MAX
    value: "16"    # Slightly below limit for safety
  - name: MAX_RETRIES
    value: "3"

Architecture

Request Flow with Rate Limiting

Client Request
    │
    ▼
┌─────────────────┐
│ Worker Capacity │ ← MAX_WORKERS check
│ Check           │   (20 concurrent)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Rate Limiter    │ ← Token bucket wait
│ (Adaptive)      │   (X req/s)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Upstream        │
│ Request         │
└────────┬────────┘
         │
         ├─ 200 OK ──────────────────> Success
         │                              Record success
         │                              (maybe increase rate)
         │
         ├─ 429 Rate Limit ──┐
         │                    │
         │                    ▼
         │              ┌──────────────┐
         │              │ Wait + Retry │ ← MAX_RETRIES
         │              └──────┬───────┘   (3 attempts)
         │                     │
         │                     ├─ Success ──> Return to client
         │                     │               Record 429
         │                     │               (decrease rate)
         │                     │
         │                     └─ Still 429 ─> Return 429 to client
         │
         └─ Network Error ──┐
                            │
                            ▼
                      ┌──────────────┐
                      │ Retry Logic  │ ← Exponential backoff
                      └──────────────┘   1s, 2s, 4s, ...

Benefits

1. Automatic Optimization

No manual tuning required
Adapts to changing API limits
Finds optimal rate automatically

2. Resilience

Handles transient 429s via retries
Backs off aggressively when needed
Gradually increases when safe

3. Cost Efficiency

Maximizes subscription utilization
Reduces wasted requests
Minimizes client-visible errors

4. Observability

Rich metrics for monitoring
Clear visibility into rate adjustments
Easy to debug rate limit issues

Troubleshooting

Problem: Rate keeps decreasing

Symptoms:

zai_proxy_rate_limit_requests_per_second trending down
Frequent decrease adjustments

Solutions:

Lower RATE_LIMIT_INITIAL - Starting too high
Check upstream API health - May be globally rate-limited
Verify subscription limits - May have exceeded quota

Problem: Still seeing 429 errors

Symptoms:

zai_proxy_requests_total{status_code="429"} > 0 after retries

Solutions:

Increase MAX_RETRIES - More retry attempts
Lower RATE_LIMIT_MAX - Too aggressive ceiling
Check burst traffic - May need multiple proxy replicas

Problem: High latency

Symptoms:

zai_proxy_rate_limit_wait_seconds p90 > 1s

Solutions:

Increase RATE_LIMIT_MAX - Currently bottlenecked
Add more replicas - Distribute load
Check if rate stuck at minimum - Adjust RATE_LIMIT_MIN

Problem: Rate not increasing

Symptoms:

zai_proxy_rate_limit_requests_per_second stuck below max
No increase adjustments

Solutions:

Wait longer - Increases are gradual (10% every 30s)
Check 429 rate - Must be <1% to increase
Verify traffic volume - Need sustained load to trigger increases

Advanced: Per-Client Rate Limiting

For future enhancement, you could add per-client rate limiting:

// Example: Rate limit per API key
type ClientRateLimiter struct {
    limiters map[string]*AdaptiveRateLimiter
    mu       sync.RWMutex
}

func (crl *ClientRateLimiter) GetLimiter(clientID string) *AdaptiveRateLimiter {
    crl.mu.RLock()
    limiter, exists := crl.limiters[clientID]
    crl.mu.RUnlock()

    if exists {
        return limiter
    }

    crl.mu.Lock()
    defer crl.mu.Unlock()

    limiter = NewAdaptiveRateLimiter(10, 1, 50)
    crl.limiters[clientID] = limiter
    return limiter
}

This would allow different rate limits per client/tenant.

12 KiB Raw Blame History

Z.AI Proxy Adaptive Rate Limiting

Overview

How It Works

Adaptive Algorithm

Rate Adjustment Logic

Retry Logic

Configuration

Environment Variables

Example: Conservative Rate Limiting

Example: Aggressive Rate Limiting

New Metrics

Rate Limiting Metrics

Useful Queries

Monitoring & Tuning

Grafana Dashboard

Tuning Strategy

1. Monitor for 24-48 hours

2. Adjust based on patterns

3. Optimize for your subscription

Architecture

Request Flow with Rate Limiting

Benefits

1. Automatic Optimization

2. Resilience

3. Cost Efficiency

4. Observability

Troubleshooting

Problem: Rate keeps decreasing

Problem: Still seeing 429 errors

Problem: High latency

Problem: Rate not increasing

Advanced: Per-Client Rate Limiting

References

12 KiB

Raw Blame History