zai-proxy/docs/notes/CANARY_TROUBLESHOOTING_GUIDE.md

# Canary Troubleshooting Guide

This guide covers common failure scenarios when testing canary deployments and how to diagnose and resolve them.

## Quick Diagnosis Checklist

When canary issues occur, run these commands first:

```bash
export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig

# 1. Check canary pod status
kubectl get pods -n mcp -l app=zai-proxy,variant=test

# 2. Check recent canary events
kubectl get events -n mcp --field-selector involvedObject.name=zai-proxy-test --sort-by='.lastTimestamp'

# 3. Stream canary logs
kubectl logs -f -n mcp deployment/zai-proxy-test

# 4. Check canary metrics
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics | grep -E "zai_proxy_(requests|errors|tokens)"

# 5. Test canary health endpoint
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/health
```

## Failure Scenarios

### Scenario 1: Canary Pods CrashLoopBackOff

**Symptoms:**
- Pods show `CrashLoopBackOff` status
- Pod restarts repeatedly
- No logs or only startup logs visible

**Diagnosis:**
```bash
# Check pod status and restart count
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
  -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount,STATUS:.status.phase

# Describe pod to see events
kubectl describe pod -n mcp -l app=zai-proxy,variant=test

# Check previous container logs (if crash occurred)
kubectl logs -n mcp deployment/zai-proxy-test --previous

# Check termination reason
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
  -o jsonpath='{.items[0].status.containerStatuses[0].lastState.terminated.reason}'
```

**Common Causes & Fixes:**

| Cause | Fix |
|-------|-----|
| Missing environment variable | Add missing env var to deployment manifest |
| Invalid API key | Update `zai-api-key` secret with valid key |
| Port conflict | Verify containerPort matches application port |
| Out of memory (OOMKilled) | Increase memory limits in deployment |
| Application panic/error | Fix code issue and rebuild image |

**Example Fix - Missing Environment Variable:**
```bash
# Check what's missing
kubectl describe pod -n mcp -l app=zai-proxy,variant=test | grep -A 10 "Environment"

# Edit deployment to add missing env var
kubectl edit deployment/zai-proxy-test -n mcp

# Add the missing variable to env: section
# Then rollout restart
kubectl rollout restart deployment/zai-proxy-test -n mcp
```

**Example Fix - OOMKilled:**
```bash
# Check if OOMKilled
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
  -o jsonpath='{.items[0].status.containerStatuses[0].lastState.terminated.reason}'
# If output is "OOMKilled", increase memory limit

# Edit deployment
kubectl edit deployment/zai-proxy-test -n mcp

# Change resources:
#   limits:
#     memory: "128Mi"  # Increased from 64Mi
```

---

### Scenario 2: Canary Pods Not Ready

**Symptoms:**
- Pods show `Running` but `0/1` Ready
- Readiness probe failing
- Pod accepts no traffic

**Diagnosis:**
```bash
# Check readiness probe status
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
  -o custom-columns=NAME:.metadata.name,READY:.status.containerStatuses[0].ready

# Describe pod to see probe details
kubectl describe pod -n mcp -l app=zai-proxy,variant=test | grep -A 5 "Readiness"

# Check if health endpoint responds
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/health

# Check if application is listening on port
kubectl exec -n mcp deployment/zai-proxy-test -- \
  netstat -tlnp | grep 8080
```

**Common Causes & Fixes:**

| Cause | Fix |
|-------|-----|
| Health endpoint not responding | Fix /health endpoint in code |
| Port mismatch | Fix containerPort or application port |
| Slow startup (not ready before probe) | Increase initialDelaySeconds |
| Dependency not available | Fix external service connectivity |

**Example Fix - Slow Startup:**
```bash
# Edit deployment
kubectl edit deployment/zai-proxy-test -n mcp

# Adjust readiness probe:
#   readinessProbe:
#     httpGet:
#       path: /health
#       port: 8080
#     initialDelaySeconds: 10  # Increased from 3
#     periodSeconds: 5
```

**Example Fix - Port Mismatch:**
```bash
# Check what port app is listening on
kubectl exec -n mcp deployment/zai-proxy-test -- \
  netstat -tlnp

# If listening on 8081 but probe checking 8080:
kubectl edit deployment/zai-proxy-test -n mcp

# Change containerPort to match actual port:
#   ports:
#   - containerPort: 8081  # Corrected from 8080
```

---

### Scenario 3: High Error Rate (>5%)

**Symptoms:**
- 5xx errors increasing
- Workers report failures
- Prometheus alert firing

**Diagnosis:**
```bash
# Check error rate from metrics
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics | \
  grep -E "zai_proxy_requests_total.*status=\"5" | sort -t'"' -k5

# Stream logs for errors
kubectl logs -f -n mcp deployment/zai-proxy-test | grep -i error

# Check for rate limiting (429 errors)
kubectl logs -n mcp deployment/zai-proxy-test --tail=100 | grep "429\|rate limit"

# Check z.ai API connectivity
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s -w "\n%{http_code}\n" -H "Authorization: Bearer $ZAI_API_KEY" \
  https://api.z.ai/v1/chat/completions -d '{"model":"glm-4.7","messages":[],"max_tokens":1}'
```

**Common Causes & Fixes:**

| Cause | Fix |
|-------|-----|
| Rate limiting from z.ai | Reduce RATE_LIMIT_INITIAL/MAX values |
| Invalid API key | Update zai-api-key secret |
| Upstream API changes | Check z.ai API status/docs |
| Token counting errors | Fix tokenizer logic |
| Request timeout errors | Increase client timeout settings |

**Example Fix - Rate Limiting:**
```bash
# Check current rate limit settings
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics | grep rate_limit

# Edit deployment to reduce rate limits
kubectl edit deployment/zai-proxy-test -n mcp

# Adjust rate limit env vars:
#   env:
#   - name: RATE_LIMIT_INITIAL
#     value: "1"  # Reduced from 2
#   - name: RATE_LIMIT_MAX
#     value: "3"  # Reduced from 5

# Rollout restart to apply
kubectl rollout restart deployment/zai-proxy-test -n mcp
```

**Example Fix - Invalid API Key:**
```bash
# Test if API key works
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s https://api.z.ai/v1/models -H "Authorization: Bearer $ZAI_API_KEY"

# If returns 401, update the secret
kubectl create secret generic zai-api-key -n mcp \
  --from-literal=api-key='NEW_API_KEY_HERE' \
  --dry-run=client -o yaml | kubectl apply -f -

# Rollout restart to pick up new key
kubectl rollout restart deployment/zai-proxy-test -n mcp
```

---

### Scenario 4: Token Counting Broken

**Symptoms:**
- No token metrics in /metrics
- Logs show "Token counting failed"
- Token values are zero or incorrect

**Diagnosis:**
```bash
# Check for token metrics
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics | grep zai_proxy_tokens

# Check logs for token errors
kubectl logs -n mcp deployment/zai-proxy-test --tail=100 | grep -i "token"

# Check if DEPLOYMENT_VARIANT is set
kubectl exec -n mcp deployment/zai-proxy-test -- \
  printenv DEPLOYMENT_VARIANT

# Verify tokenizer is working
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics | grep token_count_duration
```

**Common Causes & Fixes:**

| Cause | Fix |
|-------|-----|
| DEPLOYMENT_VARIANT not set | Add DEPLOYMENT_VARIANT=canary to env |
| Tokenizer initialization failed | Fix tokenizer code and rebuild |
| Tokenizer config invalid | Update tokenizer configuration |
| Memory pressure (tokenizer OOM) | Increase memory limits |

**Example Fix - Missing DEPLOYMENT_VARIANT:**
```bash
# Check if env var is set
kubectl exec -n mcp deployment/zai-proxy-test -- printenv DEPLOYMENT_VARIANT

# If not set, edit deployment
kubectl edit deployment/zai-proxy-test -n mcp

# Add to env section:
#   - name: DEPLOYMENT_VARIANT
#     value: "canary"
```

---

### Scenario 5: Latency Degradation

**Symptoms:**
- P95 latency >2x baseline
- Slow response times
- Prometheus latency alert firing

**Diagnosis:**
```bash
# Check request duration metrics
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics | \
  grep zai_proxy_request_duration_seconds | grep -v "0\." | sort -t'"' -k5

# Check token counting duration
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics | \
  grep token_count_duration_seconds | sort -t'"' -k5

# Check resource usage
kubectl top pods -n mcp -l app=zai-proxy,variant=test

# Check for high CPU throttling
kubectl describe pod -n mcp -l app=zai-proxy,variant=test | grep -A 3 "Limits"
```

**Common Causes & Fixes:**

| Cause | Fix |
|-------|-----|
| CPU throttling | Increase CPU limits |
| Token counting slow | Optimize tokenizer or increase resources |
| Upstream API slow | Check z.ai API status |
| Network latency | Check cluster network issues |
| Memory pressure (GC) | Increase memory limits |

**Example Fix - CPU Throttling:**
```bash
# Check current CPU limits
kubectl get deployment/zai-proxy-test -n mcp \
  -o jsonpath='{.spec.template.spec.containers[0].resources.limits.cpu}'

# If low (e.g., 100m), increase it
kubectl edit deployment/zai-proxy-test -n mcp

# Change:
#   resources:
#     limits:
#       cpu: "200m"  # Increased from 100m
```

---

### Scenario 6: Image Pull Errors

**Symptoms:**
- Pods stuck in `ImagePullBackOff` or `ErrImagePull`
- Pod events show "Failed to pull image"

**Diagnosis:**
```bash
# Check pod events for pull errors
kubectl describe pod -n mcp -l app=zai-proxy,variant=test | grep -A 5 "Events:"

# Check what image is being pulled
kubectl get deployment/zai-proxy-test -n mcp \
  -o jsonpath='{.spec.template.spec.containers[0].image}' && echo

# Verify image exists (from devpod)
curl -s https://hub.docker.com/v2/repositories/ronaldraygun/zai-proxy/tags/ | \
  jq '.results[].name' | grep VERSION

# Check if image pull secret exists
kubectl get secret docker-hub-registry -n mcp
```

**Common Causes & Fixes:**

| Cause | Fix |
|-------|-----|
| Image tag doesn't exist | Build and push image with correct tag |
| Wrong registry/username | Fix image name in deployment |
| Expired registry credentials | Update docker-hub-registry secret |
| Registry rate limiting | Wait or use authenticated pulls |
| Network issue | Check cluster egress to Docker Hub |

**Example Fix - Image Tag Doesn't Exist:**
```bash
# List available tags
curl -s https://hub.docker.com/v2/repositories/ronaldraygun/zai-proxy/tags/ | \
  jq '.results[].name'

# If tag doesn't exist, build it
cd /home/coder/ardenone-cluster/containers/zai-proxy
docker build -t ronaldraygun/zai-proxy:VERSION .
docker push ronaldraygun/zai-proxy:VERSION

# Or update deployment to use existing tag
kubectl set image deployment/zai-proxy-test \
  proxy=ronaldraygun/zai-proxy:EXISTING_TAG -n mcp
```

**Example Fix - Update Credentials:**
```bash
# Create new docker-registry secret
kubectl create secret docker-registry docker-hub-registry -n mcp \
  --docker-server=https://index.docker.io/v1/ \
  --docker-username=USERNAME \
  --docker-password=PASSWORD \
  --docker-email=EMAIL \
  --dry-run=client -o yaml | kubectl apply -f -

# Rollout restart to use new secret
kubectl rollout restart deployment/zai-proxy-test -n mcp
```

---

### Scenario 7: Workers Can't Connect

**Symptoms:**
- Workers report connection refused
- No requests in canary logs
- Workers fall back to production

**Diagnosis:**
```bash
# Check canary service exists
kubectl get svc zai-proxy-test -n mcp

# Check service endpoints
kubectl get endpoints zai-proxy-test -n mcp

# Test from devpod
curl -s http://zai-proxy-test.mcp.svc.cluster.local:8080/health

# Check service selector matches pods
kubectl get svc zai-proxy-test -n mcp \
  -o jsonpath='{.spec.selector}' && echo

# Check pod labels
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
  -o jsonpath='{.items[0].metadata.labels}' && echo
```

**Common Causes & Fixes:**

| Cause | Fix |
|-------|-----|
| Service doesn't exist | Create service |
| Selector mismatch | Fix service selector or pod labels |
| No ready pods | Fix pod issues first |
| Network policy blocking | Update network policies |
| DNS not resolving | Check CoreDNS |

**Example Fix - Selector Mismatch:**
```bash
# Check service selector
kubectl get svc zai-proxy-test -n mcp -o jsonpath='{.spec.selector}'

# Check pod labels
kubectl get pods -n mcp -l app=zai-proxy,variant=test \
  -o jsonpath='{.items[0].metadata.labels}'

# If mismatch, edit service
kubectl edit svc zai-proxy-test -n mcp

# Fix selector to match pod labels:
#   selector:
#     app: zai-proxy
#     variant: test  # Ensure this matches pod labels
```

---

### Scenario 8: Prometheus Alerts Firing

**Symptoms:**
- Alerts visible in Grafana/AlertManager
- Alertmanager notifications sent
- Metrics show threshold violations

**Diagnosis:**
```bash
# Check current alert values
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics | \
  grep -E "(zai_proxy_requests_total|zai_proxy_errors)"

# Compare with production
kubectl exec -n mcp deployment/zai-proxy -- \
  curl -s http://localhost:8080/metrics | \
  grep -E "(zai_proxy_requests_total|zai_proxy_errors)"

# Check alert rules
kubectl get prometheusrule -n mcp zai-proxy-canary -o yaml

# Check alert status in Prometheus
# (Access Prometheus UI and check alerts)
```

**Common Alerts & Fixes:**

| Alert | Meaning | Fix |
|-------|---------|-----|
| ZaiProxyCanaryDeploymentDown | Canary pods not ready | Fix pod health issues |
| ZaiProxyCanaryErrorRateHigherThanProduction | Canary error rate > production | Fix canary errors or rollback |
| ZaiProxyCanaryHighErrorRate | Error rate >5% | Fix upstream issues |
| ZaiProxyCanaryLatencyDegraded | P95 latency >2x baseline | Optimize or increase resources |

**Example Fix - Error Rate Higher Than Production:**
```bash
# Check canary error rate
kubectl exec -n mcp deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics | grep error

# Check production error rate
kubectl exec -n mcp deployment/zai-proxy -- \
  curl -s http://localhost:8080/metrics | grep error

# If canary significantly worse, investigate logs
kubectl logs -n mcp deployment/zai-proxy-test --tail=500 | grep -i error

# If canary issue, rollback to production
kubectl scale deployment/zai-proxy-test -n mcp --replicas=0
```

---

## Diagnostic Scripts

### Full Health Check Script

```bash
#!/bin/bash
# canary-health-check.sh - Comprehensive canary health check

export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
NAMESPACE="mcp"
DEPLOYMENT="zai-proxy-test"

echo "=== Canary Health Check ==="
echo

# 1. Pod Status
echo "1. Pod Status:"
kubectl get pods -n $NAMESPACE -l app=zai-proxy,variant=test
echo

# 2. Recent Events
echo "2. Recent Events (last 5):"
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$DEPLOYMENT \
  --sort-by='.lastTimestamp' | tail -5
echo

# 3. Error Rate
echo "3. Error Rate (last 5 min):"
ERROR_RATE=$(kubectl exec -n $NAMESPACE deployment/$DEPLOYMENT -- \
  curl -s http://localhost:8080/metrics | \
  grep 'zai_proxy_requests_total.*status="5"' | awk '{sum+=$2} END {print sum+0}')
TOTAL_RATE=$(kubectl exec -n $NAMESPACE deployment/$DEPLOYMENT -- \
  curl -s http://localhost:8080/metrics | \
  grep 'zai_proxy_requests_total' | awk '{sum+=$2} END {print sum+0}')
if [ "$TOTAL_RATE" -gt 0 ]; then
  echo "Error rate: $(echo "scale=2; $ERROR_RATE * 100 / $TOTAL_RATE" | bc)%"
else
  echo "No requests recorded"
fi
echo

# 4. Token Counting
echo "4. Token Counting:"
TOKEN_METRICS=$(kubectl exec -n $NAMESPACE deployment/$DEPLOYMENT -- \
  curl -s http://localhost:8080/metrics | grep zai_proxy_tokens_total)
if [ -n "$TOKEN_METRICS" ]; then
  echo "$TOKEN_METRICS"
else
  echo "WARNING: No token metrics found"
fi
echo

# 5. Recent Errors in Logs
echo "5. Recent Errors (last 10):"
kubectl logs -n $NAMESPACE deployment/$DEPLOYMENT --tail=100 | grep -i error | tail -5
echo

# 6. Resource Usage
echo "6. Resource Usage:"
kubectl top pods -n $NAMESPACE -l app=zai-proxy,variant=test
echo

echo "=== Health Check Complete ==="
```

### Compare Canary vs Production

```bash
#!/bin/bash
# compare-variants.sh - Compare canary vs production metrics

export KUBECONFIG=/home/coder/.kube/apexalgo-iad.kubeconfig
NAMESPACE="mcp"

echo "=== Canary vs Production Comparison ==="
echo

# Get metrics from both
CANARY_METRICS=$(kubectl exec -n $NAMESPACE deployment/zai-proxy-test -- \
  curl -s http://localhost:8080/metrics)
PROD_METRICS=$(kubectl exec -n $NAMESPACE deployment/zai-proxy -- \
  curl -s http://localhost:8080/metrics)

# Compare error rates
echo "Error Rates:"
CANARY_ERR=$(echo "$CANARY_METRICS" | grep 'zai_proxy_requests_total.*status="5"' | awk '{sum+=$2} END {print sum+0}')
PROD_ERR=$(echo "$PROD_METRICS" | grep 'zai_proxy_requests_total.*status="5"' | awk '{sum+=$2} END {print sum+0}')
echo "  Canary:   $CANARY_ERR errors"
echo "  Production: $PROD_ERR errors"
echo

# Compare P95 latency
echo "P95 Latency:"
CANARY_P95=$(echo "$CANARY_METRICS" | grep 'quantile="0.95"' | grep request_duration | awk '{print $2}')
PROD_P95=$(echo "$PROD_METRICS" | grep 'quantile="0.95"' | grep request_duration | awk '{print $2}')
echo "  Canary:   ${CANARY_P95}s"
echo "  Production: ${PROD_P95}s"
echo

# Compare token counting
echo "Token Counting:"
CANARY_TOK=$(echo "$CANARY_METRICS" | grep 'zai_proxy_tokens_total' | awk '{sum+=$2} END {print sum+0}')
PROD_TOK=$(echo "$PROD_METRICS" | grep 'zai_proxy_tokens_total' | awk '{sum+=$2} END {print sum+0}')
echo "  Canary:   $CANARY_TOK tokens"
echo "  Production: $PROD_TOK tokens"
echo

echo "=== Comparison Complete ==="
```

## Quick Reference Card

```
┌─────────────────────────────────────────────────────────────────┐
│                    CANARY TROUBLESHOOTING                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  STATUS          CHECK                    FIX                    │
│  ─────────────────────────────────────────────────────────────  │
│  CrashLoop      kubectl logs --previous   Fix app / rebuild     │
│  NotReady       kubectl describe pod      Fix probes / port     │
│  ImagePullBack  kubectl get secret        Push image / fix auth │
│  5xx Errors     kubectl logs | grep error  Fix rate limit / key │
│  High Latency   kubectl top pods          Increase CPU/mem      │
│  No Metrics     kubectl exec -- curl /m   Fix DEPLOYMENT_VARIANT│
│  No Conn        kubectl get endpoints     Fix service selector  │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│  EMERGENCY ROLLBACK:                                            │
│  kubectl scale deployment/zai-proxy-test -n mcp --replicas=0    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

## Related Documentation

- [CANARY_ROLLBACK_PROCEDURE.md](CANARY_ROLLBACK_PROCEDURE.md) - Rollback procedure
- [CANARY_PROMOTION_PROCEDURE.md](CANARY_PROMOTION_PROCEDURE.md) - Promotion procedure
- [CANARY_PROMOTION_CHECKLIST.md](CANARY_PROMOTION_CHECKLIST.md) - Promotion checklist
- [TROUBLESHOOTING.md](TROUBLESHOOTING.md) - General troubleshooting

## Getting Help

If issues persist after troubleshooting:

1. Check Grafana dashboard: `grafana-dashboard-zai-proxy-canary-comparison`
2. Review Prometheus alerts in AlertManager
3. Check recent deployment changes: `kubectl rollout history deployment/zai-proxy-test -n mcp`
4. Contact on-call for zai-proxy service
5. Create issue in repository with diagnostic output