Extracted from ardenone-cluster/containers/zai-proxy and ardenone-cluster/containers/zai-proxy-dashboard. - proxy/: OpenAI-compatible ZAI reverse proxy (Go, v1.10.0) - Token counting, rate limiting, Prometheus metrics, canary support - dashboard/: Metrics dashboard backend + React frontend (Go, v1.0.0) - Prometheus collector, SQLite storage, SSE live updates - docs/: Operational notes, research, and plan subdirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
20 KiB
Canary Troubleshooting Guide
This guide covers common failure scenarios when testing canary deployments and how to diagnose and resolve them.
Quick Diagnosis Checklist
When canary issues occur, run these commands first:
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
# 1. Check canary pod status
kubectl get pods -n mcp -l app=zai-proxy,variant=test
# 2. Check recent canary events
kubectl get events -n mcp --field-selector involvedObject.name=zai-proxy-test --sort-by='.lastTimestamp'
# 3. Stream canary logs
kubectl logs -f -n mcp deployment/zai-proxy-test
# 4. Check canary metrics
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/metrics | grep -E "zai_proxy_(requests|errors|tokens)"
# 5. Test canary health endpoint
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/health
Failure Scenarios
Scenario 1: Canary Pods CrashLoopBackOff
Symptoms:
- Pods show
CrashLoopBackOffstatus - Pod restarts repeatedly
- No logs or only startup logs visible
Diagnosis:
# Check pod status and restart count
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
-o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount,STATUS:.status.phase
# Describe pod to see events
kubectl describe pod -n mcp -l app=zai-proxy,variant=test
# Check previous container logs (if crash occurred)
kubectl logs -n mcp deployment/zai-proxy-test --previous
# Check termination reason
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
-o jsonpath='{.items[0].status.containerStatuses[0].lastState.terminated.reason}'
Common Causes & Fixes:
| Cause | Fix |
|---|---|
| Missing environment variable | Add missing env var to deployment manifest |
| Invalid API key | Update zai-api-key secret with valid key |
| Port conflict | Verify containerPort matches application port |
| Out of memory (OOMKilled) | Increase memory limits in deployment |
| Application panic/error | Fix code issue and rebuild image |
Example Fix - Missing Environment Variable:
# Check what's missing
kubectl describe pod -n mcp -l app=zai-proxy,variant=test | grep -A 10 "Environment"
# Edit deployment to add missing env var
kubectl edit deployment/zai-proxy-test -n mcp
# Add the missing variable to env: section
# Then rollout restart
kubectl rollout restart deployment/zai-proxy-test -n mcp
Example Fix - OOMKilled:
# Check if OOMKilled
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
-o jsonpath='{.items[0].status.containerStatuses[0].lastState.terminated.reason}'
# If output is "OOMKilled", increase memory limit
# Edit deployment
kubectl edit deployment/zai-proxy-test -n mcp
# Change resources:
# limits:
# memory: "128Mi" # Increased from 64Mi
Scenario 2: Canary Pods Not Ready
Symptoms:
- Pods show
Runningbut0/1Ready - Readiness probe failing
- Pod accepts no traffic
Diagnosis:
# Check readiness probe status
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
-o custom-columns=NAME:.metadata.name,READY:.status.containerStatuses[0].ready
# Describe pod to see probe details
kubectl describe pod -n mcp -l app=zai-proxy,variant=test | grep -A 5 "Readiness"
# Check if health endpoint responds
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/health
# Check if application is listening on port
kubectl exec -n mcp deployment/zai-proxy-test -- \
netstat -tlnp | grep 8080
Common Causes & Fixes:
| Cause | Fix |
|---|---|
| Health endpoint not responding | Fix /health endpoint in code |
| Port mismatch | Fix containerPort or application port |
| Slow startup (not ready before probe) | Increase initialDelaySeconds |
| Dependency not available | Fix external service connectivity |
Example Fix - Slow Startup:
# Edit deployment
kubectl edit deployment/zai-proxy-test -n mcp
# Adjust readiness probe:
# readinessProbe:
# httpGet:
# path: /health
# port: 8080
# initialDelaySeconds: 10 # Increased from 3
# periodSeconds: 5
Example Fix - Port Mismatch:
# Check what port app is listening on
kubectl exec -n mcp deployment/zai-proxy-test -- \
netstat -tlnp
# If listening on 8081 but probe checking 8080:
kubectl edit deployment/zai-proxy-test -n mcp
# Change containerPort to match actual port:
# ports:
# - containerPort: 8081 # Corrected from 8080
Scenario 3: High Error Rate (>5%)
Symptoms:
- 5xx errors increasing
- Workers report failures
- Prometheus alert firing
Diagnosis:
# Check error rate from metrics
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/metrics | \
grep -E "zai_proxy_requests_total.*status=\"5" | sort -t'"' -k5
# Stream logs for errors
kubectl logs -f -n mcp deployment/zai-proxy-test | grep -i error
# Check for rate limiting (429 errors)
kubectl logs -n mcp deployment/zai-proxy-test --tail=100 | grep "429\|rate limit"
# Check z.ai API connectivity
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s -w "\n%{http_code}\n" -H "Authorization: Bearer $ZAI_API_KEY" \
https://api.z.ai/v1/chat/completions -d '{"model":"glm-4.7","messages":[],"max_tokens":1}'
Common Causes & Fixes:
| Cause | Fix |
|---|---|
| Rate limiting from z.ai | Reduce RATE_LIMIT_INITIAL/MAX values |
| Invalid API key | Update zai-api-key secret |
| Upstream API changes | Check z.ai API status/docs |
| Token counting errors | Fix tokenizer logic |
| Request timeout errors | Increase client timeout settings |
Example Fix - Rate Limiting:
# Check current rate limit settings
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/metrics | grep rate_limit
# Edit deployment to reduce rate limits
kubectl edit deployment/zai-proxy-test -n mcp
# Adjust rate limit env vars:
# env:
# - name: RATE_LIMIT_INITIAL
# value: "1" # Reduced from 2
# - name: RATE_LIMIT_MAX
# value: "3" # Reduced from 5
# Rollout restart to apply
kubectl rollout restart deployment/zai-proxy-test -n mcp
Example Fix - Invalid API Key:
# Test if API key works
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s https://api.z.ai/v1/models -H "Authorization: Bearer $ZAI_API_KEY"
# If returns 401, update the secret
kubectl create secret generic zai-api-key -n mcp \
--from-literal=api-key='NEW_API_KEY_HERE' \
--dry-run=client -o yaml | kubectl apply -f -
# Rollout restart to pick up new key
kubectl rollout restart deployment/zai-proxy-test -n mcp
Scenario 4: Token Counting Broken
Symptoms:
- No token metrics in /metrics
- Logs show "Token counting failed"
- Token values are zero or incorrect
Diagnosis:
# Check for token metrics
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/metrics | grep zai_proxy_tokens
# Check logs for token errors
kubectl logs -n mcp deployment/zai-proxy-test --tail=100 | grep -i "token"
# Check if DEPLOYMENT_VARIANT is set
kubectl exec -n mcp deployment/zai-proxy-test -- \
printenv DEPLOYMENT_VARIANT
# Verify tokenizer is working
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/metrics | grep token_count_duration
Common Causes & Fixes:
| Cause | Fix |
|---|---|
| DEPLOYMENT_VARIANT not set | Add DEPLOYMENT_VARIANT=canary to env |
| Tokenizer initialization failed | Fix tokenizer code and rebuild |
| Tokenizer config invalid | Update tokenizer configuration |
| Memory pressure (tokenizer OOM) | Increase memory limits |
Example Fix - Missing DEPLOYMENT_VARIANT:
# Check if env var is set
kubectl exec -n mcp deployment/zai-proxy-test -- printenv DEPLOYMENT_VARIANT
# If not set, edit deployment
kubectl edit deployment/zai-proxy-test -n mcp
# Add to env section:
# - name: DEPLOYMENT_VARIANT
# value: "canary"
Scenario 5: Latency Degradation
Symptoms:
- P95 latency >2x baseline
- Slow response times
- Prometheus latency alert firing
Diagnosis:
# Check request duration metrics
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/metrics | \
grep zai_proxy_request_duration_seconds | grep -v "0\." | sort -t'"' -k5
# Check token counting duration
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/metrics | \
grep token_count_duration_seconds | sort -t'"' -k5
# Check resource usage
kubectl top pods -n mcp -l app=zai-proxy,variant=test
# Check for high CPU throttling
kubectl describe pod -n mcp -l app=zai-proxy,variant=test | grep -A 3 "Limits"
Common Causes & Fixes:
| Cause | Fix |
|---|---|
| CPU throttling | Increase CPU limits |
| Token counting slow | Optimize tokenizer or increase resources |
| Upstream API slow | Check z.ai API status |
| Network latency | Check cluster network issues |
| Memory pressure (GC) | Increase memory limits |
Example Fix - CPU Throttling:
# Check current CPU limits
kubectl get deployment/zai-proxy-test -n mcp \
-o jsonpath='{.spec.template.spec.containers[0].resources.limits.cpu}'
# If low (e.g., 100m), increase it
kubectl edit deployment/zai-proxy-test -n mcp
# Change:
# resources:
# limits:
# cpu: "200m" # Increased from 100m
Scenario 6: Image Pull Errors
Symptoms:
- Pods stuck in
ImagePullBackOfforErrImagePull - Pod events show "Failed to pull image"
Diagnosis:
# Check pod events for pull errors
kubectl describe pod -n mcp -l app=zai-proxy,variant=test | grep -A 5 "Events:"
# Check what image is being pulled
kubectl get deployment/zai-proxy-test -n mcp \
-o jsonpath='{.spec.template.spec.containers[0].image}' && echo
# Verify image exists (from devpod)
curl -s https://hub.docker.com/v2/repositories/ronaldraygun/zai-proxy/tags/ | \
jq '.results[].name' | grep VERSION
# Check if image pull secret exists
kubectl get secret docker-hub-registry -n mcp
Common Causes & Fixes:
| Cause | Fix |
|---|---|
| Image tag doesn't exist | Build and push image with correct tag |
| Wrong registry/username | Fix image name in deployment |
| Expired registry credentials | Update docker-hub-registry secret |
| Registry rate limiting | Wait or use authenticated pulls |
| Network issue | Check cluster egress to Docker Hub |
Example Fix - Image Tag Doesn't Exist:
# List available tags
curl -s https://hub.docker.com/v2/repositories/ronaldraygun/zai-proxy/tags/ | \
jq '.results[].name'
# If tag doesn't exist, build it
cd /home/coder/ardenone-cluster/containers/zai-proxy
docker build -t ronaldraygun/zai-proxy:VERSION .
docker push ronaldraygun/zai-proxy:VERSION
# Or update deployment to use existing tag
kubectl set image deployment/zai-proxy-test \
proxy=ronaldraygun/zai-proxy:EXISTING_TAG -n mcp
Example Fix - Update Credentials:
# Create new docker-registry secret
kubectl create secret docker-registry docker-hub-registry -n mcp \
--docker-server=https://index.docker.io/v1/ \
--docker-username=USERNAME \
--docker-password=PASSWORD \
--docker-email=EMAIL \
--dry-run=client -o yaml | kubectl apply -f -
# Rollout restart to use new secret
kubectl rollout restart deployment/zai-proxy-test -n mcp
Scenario 7: Workers Can't Connect
Symptoms:
- Workers report connection refused
- No requests in canary logs
- Workers fall back to production
Diagnosis:
# Check canary service exists
kubectl get svc zai-proxy-test -n mcp
# Check service endpoints
kubectl get endpoints zai-proxy-test -n mcp
# Test from devpod
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/health
# Check service selector matches pods
kubectl get svc zai-proxy-test -n mcp \
-o jsonpath='{.spec.selector}' && echo
# Check pod labels
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
-o jsonpath='{.items[0].metadata.labels}' && echo
Common Causes & Fixes:
| Cause | Fix |
|---|---|
| Service doesn't exist | Create service |
| Selector mismatch | Fix service selector or pod labels |
| No ready pods | Fix pod issues first |
| Network policy blocking | Update network policies |
| DNS not resolving | Check CoreDNS |
Example Fix - Selector Mismatch:
# Check service selector
kubectl get svc zai-proxy-test -n mcp -o jsonpath='{.spec.selector}'
# Check pod labels
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
-o jsonpath='{.items[0].metadata.labels}'
# If mismatch, edit service
kubectl edit svc zai-proxy-test -n mcp
# Fix selector to match pod labels:
# selector:
# app: zai-proxy
# variant: test # Ensure this matches pod labels
Scenario 8: Prometheus Alerts Firing
Symptoms:
- Alerts visible in Grafana/AlertManager
- Alertmanager notifications sent
- Metrics show threshold violations
Diagnosis:
# Check current alert values
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/metrics | \
grep -E "(zai_proxy_requests_total|zai_proxy_errors)"
# Compare with production
kubectl exec -n mcp deployment/zai-proxy -- \
curl -s http://localhost:8080/metrics | \
grep -E "(zai_proxy_requests_total|zai_proxy_errors)"
# Check alert rules
kubectl get prometheusrule -n mcp zai-proxy-canary -o yaml
# Check alert status in Prometheus
# (Access Prometheus UI and check alerts)
Common Alerts & Fixes:
| Alert | Meaning | Fix |
|---|---|---|
| ZaiProxyCanaryDeploymentDown | Canary pods not ready | Fix pod health issues |
| ZaiProxyCanaryErrorRateHigherThanProduction | Canary error rate > production | Fix canary errors or rollback |
| ZaiProxyCanaryHighErrorRate | Error rate >5% | Fix upstream issues |
| ZaiProxyCanaryLatencyDegraded | P95 latency >2x baseline | Optimize or increase resources |
Example Fix - Error Rate Higher Than Production:
# Check canary error rate
kubectl exec -n mcp deployment/zai-proxy-test -- \
curl -s http://localhost:8080/metrics | grep error
# Check production error rate
kubectl exec -n mcp deployment/zai-proxy -- \
curl -s http://localhost:8080/metrics | grep error
# If canary significantly worse, investigate logs
kubectl logs -n mcp deployment/zai-proxy-test --tail=500 | grep -i error
# If canary issue, rollback to production
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0
Diagnostic Scripts
Full Health Check Script
#!/bin/bash
# canary-health-check.sh - Comprehensive canary health check
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
NAMESPACE="mcp"
DEPLOYMENT="zai-proxy-test"
echo "=== Canary Health Check ==="
echo
# 1. Pod Status
echo "1. Pod Status:"
kubectl get pods -n $NAMESPACE -l app=zai-proxy,variant=test
echo
# 2. Recent Events
echo "2. Recent Events (last 5):"
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$DEPLOYMENT \
--sort-by='.lastTimestamp' | tail -5
echo
# 3. Error Rate
echo "3. Error Rate (last 5 min):"
ERROR_RATE=$(kubectl exec -n $NAMESPACE deployment/$DEPLOYMENT -- \
curl -s http://localhost:8080/metrics | \
grep 'zai_proxy_requests_total.*status="5"' | awk '{sum+=$2} END {print sum+0}')
TOTAL_RATE=$(kubectl exec -n $NAMESPACE deployment/$DEPLOYMENT -- \
curl -s http://localhost:8080/metrics | \
grep 'zai_proxy_requests_total' | awk '{sum+=$2} END {print sum+0}')
if [ "$TOTAL_RATE" -gt 0 ]; then
echo "Error rate: $(echo "scale=2; $ERROR_RATE * 100 / $TOTAL_RATE" | bc)%"
else
echo "No requests recorded"
fi
echo
# 4. Token Counting
echo "4. Token Counting:"
TOKEN_METRICS=$(kubectl exec -n $NAMESPACE deployment/$DEPLOYMENT -- \
curl -s http://localhost:8080/metrics | grep zai_proxy_tokens_total)
if [ -n "$TOKEN_METRICS" ]; then
echo "$TOKEN_METRICS"
else
echo "WARNING: No token metrics found"
fi
echo
# 5. Recent Errors in Logs
echo "5. Recent Errors (last 10):"
kubectl logs -n $NAMESPACE deployment/$DEPLOYMENT --tail=100 | grep -i error | tail -5
echo
# 6. Resource Usage
echo "6. Resource Usage:"
kubectl top pods -n $NAMESPACE -l app=zai-proxy,variant=test
echo
echo "=== Health Check Complete ==="
Compare Canary vs Production
#!/bin/bash
# compare-variants.sh - Compare canary vs production metrics
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
NAMESPACE="mcp"
echo "=== Canary vs Production Comparison ==="
echo
# Get metrics from both
CANARY_METRICS=$(kubectl exec -n $NAMESPACE deployment/zai-proxy-test -- \
curl -s http://localhost:8080/metrics)
PROD_METRICS=$(kubectl exec -n $NAMESPACE deployment/zai-proxy -- \
curl -s http://localhost:8080/metrics)
# Compare error rates
echo "Error Rates:"
CANARY_ERR=$(echo "$CANARY_METRICS" | grep 'zai_proxy_requests_total.*status="5"' | awk '{sum+=$2} END {print sum+0}')
PROD_ERR=$(echo "$PROD_METRICS" | grep 'zai_proxy_requests_total.*status="5"' | awk '{sum+=$2} END {print sum+0}')
echo " Canary: $CANARY_ERR errors"
echo " Production: $PROD_ERR errors"
echo
# Compare P95 latency
echo "P95 Latency:"
CANARY_P95=$(echo "$CANARY_METRICS" | grep 'quantile="0.95"' | grep request_duration | awk '{print $2}')
PROD_P95=$(echo "$PROD_METRICS" | grep 'quantile="0.95"' | grep request_duration | awk '{print $2}')
echo " Canary: ${CANARY_P95}s"
echo " Production: ${PROD_P95}s"
echo
# Compare token counting
echo "Token Counting:"
CANARY_TOK=$(echo "$CANARY_METRICS" | grep 'zai_proxy_tokens_total' | awk '{sum+=$2} END {print sum+0}')
PROD_TOK=$(echo "$PROD_METRICS" | grep 'zai_proxy_tokens_total' | awk '{sum+=$2} END {print sum+0}')
echo " Canary: $CANARY_TOK tokens"
echo " Production: $PROD_TOK tokens"
echo
echo "=== Comparison Complete ==="
Quick Reference Card
┌─────────────────────────────────────────────────────────────────┐
│ CANARY TROUBLESHOOTING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ STATUS CHECK FIX │
│ ───────────────────────────────────────────────────────────── │
│ CrashLoop kubectl logs --previous Fix app / rebuild │
│ NotReady kubectl describe pod Fix probes / port │
│ ImagePullBack kubectl get secret Push image / fix auth │
│ 5xx Errors kubectl logs | grep error Fix rate limit / key │
│ High Latency kubectl top pods Increase CPU/mem │
│ No Metrics kubectl exec -- curl /m Fix DEPLOYMENT_VARIANT│
│ No Conn kubectl get endpoints Fix service selector │
│ │
├─────────────────────────────────────────────────────────────────┤
│ EMERGENCY ROLLBACK: │
│ kubectl scale deployment/zai-proxy-test -n mcp --replicas=0 │
│ │
└─────────────────────────────────────────────────────────────────┘
Related Documentation
- CANARY_ROLLBACK_PROCEDURE.md - Rollback procedure
- CANARY_PROMOTION_PROCEDURE.md - Promotion procedure
- CANARY_PROMOTION_CHECKLIST.md - Promotion checklist
- TROUBLESHOOTING.md - General troubleshooting
Getting Help
If issues persist after troubleshooting:
- Check Grafana dashboard:
grafana-dashboard-zai-proxy-canary-comparison - Review Prometheus alerts in AlertManager
- Check recent deployment changes:
kubectl rollout history deployment/zai-proxy-test -n mcp - Contact on-call for zai-proxy service
- Create issue in repository with diagnostic output