# Diagnostic Playbook This playbook provides a systematic approach to diagnosing issues in a Miroir cluster. Run these steps in order when investigating any problem. ## Prerequisites Set up your environment: ```bash export MIROIR_URL="https://miroir.example.com" export MIROIR_KEY="your-admin-key" export NAMESPACE="search" # adjust if needed ``` ## Step 1: Check Cluster Health ### 1.1 Verify all pods are running ```bash kubectl get pods -n $NAMESPACE ``` **Expected output**: All pods in `Running` state, Ready 1/1 or 2/2. **Common issues**: - Pods in `Pending` → resource constraints, scheduler issues - Pods in `CrashLoopBackOff` → config errors, OOM kills - Pods with `Ready: 0/1` → startup probe failing, dependency unavailable ### 1.2 Check recent pod restarts ```bash kubectl get pods -n $NAMESPACE -o json | jq -r '.items[] | "\(.metadata.name): \(.status.containerStatuses[0].restartCount) restarts"' ``` **Action**: Investigate pods with > 3 restarts in the last hour. ### 1.3 Check resource usage ```bash kubectl top pods -n $NAMESPACE kubectl top nodes ``` **Action**: If CPU/memory limits are hit, consider scaling up or adjusting limits. ## Step 2: Check Miroir Topology ### 2.1 Get topology overview ```bash curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '.' ``` **Expected output**: ```json { "shards": 128, "replication_factor": 2, "nodes": [ {"node_id": "node-0", "status": "active", "shards": [...]}, {"node_id": "node-1", "status": "active", "shards": [...]}, {"node_id": "node-2", "status": "active", "shards": [...]} ] } ``` **Common issues**: - `status: "degraded"` → node is unreachable or unhealthy - `status: "draining"` → node migration in progress - `shards: []` → node has no assigned shards (newly added) ### 2.2 Check for degraded shards ```bash curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq ' .nodes as $nodes | .shards as $total | ($nodes | map(.shards | length) | add) as $assigned | "Assigned: \($assigned)/\($total*3) (RF × \($nodes | length))", "Degraded nodes: \([.nodes[] | select(.status != "active")] | length)" ' ``` **Action**: Any degraded nodes need investigation (see Step 4). ### 2.3 Verify node agreement on topology ```bash for i in 0 1 2; do echo "=== node-$i ===" kubectl exec -n $NAMESPACE miroir-$i -- \ curl -s localhost:7700/_miroir/topology | jq '.shards, .replication_factor' done ``` **Expected**: All nodes report the same shard count and RF. **Action**: If nodes disagree, restart coordinator pod to force reconciliation. ## Step 3: Check Metrics ### 3.1 Get metrics summary ```bash curl -s "$MIROIR_URL/_miroir/metrics?key=$MIROIR_KEY" | jq ' { degraded_shards: .degraded_shards // 0, task_queue_depth: .task_queue_depth // 0, search_latency_p99: .search_latency_p99_ms // 0, write_latency_p99: .write_latency_p99_ms // 0, cdc_lag_seconds: .cdc_lag_seconds // 0 } ' ``` **Key thresholds**: - `degraded_shards > 0` → investigate node health - `task_queue_depth > 1000` → task processing bottleneck - `search_latency_p99 > 1000` → slow queries, need optimization - `cdc_lag_seconds > 300` → CDC falling behind ### 3.2 Check Prometheus metrics (if available) ```bash # Via Prometheus API curl -s "http://prometheus:9090/api/v1/query?query=miroir_degraded_shards" | jq '.data.result[0].value[1]' # Via pod metrics endpoint kubectl exec -n $NAMESPACE miroir-0 -- curl -s localhost:9091/metrics | grep miroir_ ``` ## Step 4: Check Logs for Errors ### 4.1 Get recent errors from all pods ```bash for pod in $(kubectl get pods -n $NAMESPACE -l app=miroir -o name); do echo "=== $pod ===" kubectl logs -n $NAMESPACE $pod --tail=100 | jq -rc 'select(.level=="ERROR")' || true echo "" done ``` **Common error patterns**: - `connection refused` → peer pod down or network issue - `timeout` → slow query, overloaded node - `hash mismatch` → potential data corruption (run anti-entropy) - `lease expired` → leader election contention ### 4.2 Check coordinator logs for topology changes ```bash kubectl logs -n $NAMESPACE -l app=miroir,role=coordinator --tail=200 | \ jq -rc 'select(.message | test("topology|node|shard"))' ``` ### 4.3 Check for crash loop patterns ```bash kubectl logs -n $NAMESPACE miroir-0 --previous --tail=100 | \ jq -rc 'select(.level=="ERROR" or .level=="FATAL")' || true ``` ## Step 5: Check Task Status ### 5.1 List stuck or long-running tasks ```bash curl -s "$MIROIR_URL/_miroir/tasks?key=$MIROIR_KEY&status=processing" | \ jq -r '.tasks[] | "\(.miroir_id) (\(.task_type // "unknown")): \(.created_at)"' ``` **Action**: Investigate tasks running > 1 hour. ### 5.2 Get detailed task status ```bash miroir-ctl task status --task-id --verbose ``` ### 5.3 Check task registry health ```bash # SQLite mode kubectl exec -n $NAMESPACE miroir-0 -- \ sqlite3 /data/miroir.db "SELECT status, COUNT(*) FROM tasks GROUP BY status;" # Redis mode kubectl exec -n $NAMESPACE redis-0 -- \ redis-cli --scan --pattern "miroir:tasks:*" | wc -l ``` ## Step 6: Check Anti-Entropy Status ### 6.1 Last AE run time ```bash curl -s "$MIROIR_URL/_miroir/anti-entropy/status?key=$MIROIR_KEY" | \ jq '{last_run: .last_run_at, next_run: .next_run_at, divergences_found: .divergences_found}' ``` **Action**: If `last_run_at` is > 24 hours ago, AE may be stuck. ### 6.2 Check for divergence ```bash curl -s "$MIROIR_URL/_miroir/anti-entropy/divergence?key=$MIROIR_KEY" | \ jq '.divergent_shards | length' ``` **Action**: Any divergent shards should trigger an AE run. ## Step 7: Check External Dependencies ### 7.1 Check Redis connectivity ```bash kubectl exec -n $NAMESPACE miroir-0 -- \ redis-cli -h redis-headless ping ``` **Expected**: `PONG` ### 7.2 Check Meilisearch backend connectivity ```bash for i in 0 1 2; do echo "=== miroir-$i ===" kubectl exec -n $NAMESPACE miroir-$i -- \ curl -s http://localhost:7700/health | jq '.status' done ``` **Expected**: `"available"` ### 7.3 Check network policies ```bash kubectl get networkpolicy -n $NAMESPACE kubectl describe networkpolicy miroir-allow-peer -n $NAMESPACE ``` ## Step 8: Run Self-Diagnostics ### 8.1 Miroir self-check endpoint ```bash curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY" | jq '.' ``` **Expected output**: ```json { "status": "healthy", "checks": { "topology": "ok", "task_store": "ok", "coordinator_leader": "ok", "peers_connected": "ok" } } ``` ### 8.2 Run canary tests ```bash # List configured canaries curl -s "$MIROIR_URL/_miroir/canaries?key=$MIROIR_KEY" | \ jq -r '.canaries[] | .id' # Trigger a canary run curl -X POST "$MIROIR_URL/_miroir/canaries/search-health/run?key=$MIROIR_KEY" ``` ## Decision Tree Based on findings, follow this tree: ``` Are any pods not running? ├─ Yes → Check pod logs (Step 4), describe pod for events └─ No → Continue Are any nodes degraded? ├─ Yes → Check node logs, verify network, restart if needed └─ No → Continue Is task queue depth > 1000? ├─ Yes → Check for stuck tasks (Step 5), scale workers if needed └─ No → Continue Is search latency high? ├─ Yes → Check query patterns, consider query optimization └─ No → Continue Any errors in logs? ├─ Yes → Investigate specific error pattern └─ No → Issue may be external, check dependencies (Step 7) ``` ## Escalation Checklist Before escalating, gather: 1. **Topology output** (Step 2.1) 2. **Recent errors** (Step 4.1) 3. **Stuck tasks** (Step 5.1) 4. **Metrics snapshot** (Step 3.1) 5. **Pod status** (Step 1.1) Attach these to your GitHub issue or support ticket. ## Prevention: Regular Health Checks Set up a cron job or monitoring alert to run this daily: ```bash #!/bin/bash # daily-health-check.sh # Quick health check HEALTH=$(curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY") STATUS=$(echo $HEALTH | jq -r '.status') if [ "$STATUS" != "healthy" ]; then echo "UNHEALTHY: $HEALTH" # Send alert fi ``` ## Related Documentation - [Common Issues Guide](../troubleshooting.md) - [Node Drain Runbook](../runbooks/node-drain.md) - [Migration Runbook](../migration_runbook.md) - [Metrics Reference](../operations/metrics.md)