Implements P11.5 acceptance criteria: - Created docs/troubleshooting.md with 10 common issues - Created docs/troubleshooting/diagnostics.md with systematic diagnostic playbook - Documented 3 required plan §11 issues (primary key required, degraded search results, stuck tasks) - Added 7 additional issues from Phase 9 chaos testing and operations - Cross-linked from README, migration runbook, and dump import guide Documented issues: 1. "primary key required" - Miroir vs Meilisearch difference 2. Search returns fewer results - degraded node handling 3. Task polling stuck - per-node task status recovery 4. Node drain blocked - RF constraints 5. Migration stuck after coordinator crash - recovery procedures 6. High memory usage on Redis - cleanup procedures 7. Index creation fails - topology inconsistency 8. Alias flip conflicts - single vs multi alias types 9. Search timeout during migration - throttling options 10. CDC cursor out of sync - recovery and re-index Diagnostic playbook covers: - Cluster health checks (pods, nodes, resources) - Topology verification and node agreement - Metrics analysis (degraded shards, task queue, latency) - Log analysis for error patterns - Task status inspection - Anti-entropy status - External dependency checks - Self-diagnostics and canary tests Closes: miroir-uyx.5
315 lines
8.1 KiB
Markdown
315 lines
8.1 KiB
Markdown
# Diagnostic Playbook
|
||
|
||
This playbook provides a systematic approach to diagnosing issues in a Miroir cluster. Run these steps in order when investigating any problem.
|
||
|
||
## Prerequisites
|
||
|
||
Set up your environment:
|
||
```bash
|
||
export MIROIR_URL="https://miroir.example.com"
|
||
export MIROIR_KEY="your-admin-key"
|
||
export NAMESPACE="search" # adjust if needed
|
||
```
|
||
|
||
## Step 1: Check Cluster Health
|
||
|
||
### 1.1 Verify all pods are running
|
||
```bash
|
||
kubectl get pods -n $NAMESPACE
|
||
```
|
||
|
||
**Expected output**: All pods in `Running` state, Ready 1/1 or 2/2.
|
||
|
||
**Common issues**:
|
||
- Pods in `Pending` → resource constraints, scheduler issues
|
||
- Pods in `CrashLoopBackOff` → config errors, OOM kills
|
||
- Pods with `Ready: 0/1` → startup probe failing, dependency unavailable
|
||
|
||
### 1.2 Check recent pod restarts
|
||
```bash
|
||
kubectl get pods -n $NAMESPACE -o json | jq -r '.items[] | "\(.metadata.name): \(.status.containerStatuses[0].restartCount) restarts"'
|
||
```
|
||
|
||
**Action**: Investigate pods with > 3 restarts in the last hour.
|
||
|
||
### 1.3 Check resource usage
|
||
```bash
|
||
kubectl top pods -n $NAMESPACE
|
||
kubectl top nodes
|
||
```
|
||
|
||
**Action**: If CPU/memory limits are hit, consider scaling up or adjusting limits.
|
||
|
||
## Step 2: Check Miroir Topology
|
||
|
||
### 2.1 Get topology overview
|
||
```bash
|
||
curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '.'
|
||
```
|
||
|
||
**Expected output**:
|
||
```json
|
||
{
|
||
"shards": 128,
|
||
"replication_factor": 2,
|
||
"nodes": [
|
||
{"node_id": "node-0", "status": "active", "shards": [...]},
|
||
{"node_id": "node-1", "status": "active", "shards": [...]},
|
||
{"node_id": "node-2", "status": "active", "shards": [...]}
|
||
]
|
||
}
|
||
```
|
||
|
||
**Common issues**:
|
||
- `status: "degraded"` → node is unreachable or unhealthy
|
||
- `status: "draining"` → node migration in progress
|
||
- `shards: []` → node has no assigned shards (newly added)
|
||
|
||
### 2.2 Check for degraded shards
|
||
```bash
|
||
curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '
|
||
.nodes as $nodes |
|
||
.shards as $total |
|
||
($nodes | map(.shards | length) | add) as $assigned |
|
||
"Assigned: \($assigned)/\($total*3) (RF × \($nodes | length))",
|
||
"Degraded nodes: \([.nodes[] | select(.status != "active")] | length)"
|
||
'
|
||
```
|
||
|
||
**Action**: Any degraded nodes need investigation (see Step 4).
|
||
|
||
### 2.3 Verify node agreement on topology
|
||
```bash
|
||
for i in 0 1 2; do
|
||
echo "=== node-$i ==="
|
||
kubectl exec -n $NAMESPACE miroir-$i -- \
|
||
curl -s localhost:7700/_miroir/topology | jq '.shards, .replication_factor'
|
||
done
|
||
```
|
||
|
||
**Expected**: All nodes report the same shard count and RF.
|
||
|
||
**Action**: If nodes disagree, restart coordinator pod to force reconciliation.
|
||
|
||
## Step 3: Check Metrics
|
||
|
||
### 3.1 Get metrics summary
|
||
```bash
|
||
curl -s "$MIROIR_URL/_miroir/metrics?key=$MIROIR_KEY" | jq '
|
||
{
|
||
degraded_shards: .degraded_shards // 0,
|
||
task_queue_depth: .task_queue_depth // 0,
|
||
search_latency_p99: .search_latency_p99_ms // 0,
|
||
write_latency_p99: .write_latency_p99_ms // 0,
|
||
cdc_lag_seconds: .cdc_lag_seconds // 0
|
||
}
|
||
'
|
||
```
|
||
|
||
**Key thresholds**:
|
||
- `degraded_shards > 0` → investigate node health
|
||
- `task_queue_depth > 1000` → task processing bottleneck
|
||
- `search_latency_p99 > 1000` → slow queries, need optimization
|
||
- `cdc_lag_seconds > 300` → CDC falling behind
|
||
|
||
### 3.2 Check Prometheus metrics (if available)
|
||
```bash
|
||
# Via Prometheus API
|
||
curl -s "http://prometheus:9090/api/v1/query?query=miroir_degraded_shards" | jq '.data.result[0].value[1]'
|
||
|
||
# Via pod metrics endpoint
|
||
kubectl exec -n $NAMESPACE miroir-0 -- curl -s localhost:9091/metrics | grep miroir_
|
||
```
|
||
|
||
## Step 4: Check Logs for Errors
|
||
|
||
### 4.1 Get recent errors from all pods
|
||
```bash
|
||
for pod in $(kubectl get pods -n $NAMESPACE -l app=miroir -o name); do
|
||
echo "=== $pod ==="
|
||
kubectl logs -n $NAMESPACE $pod --tail=100 | jq -rc 'select(.level=="ERROR")' || true
|
||
echo ""
|
||
done
|
||
```
|
||
|
||
**Common error patterns**:
|
||
- `connection refused` → peer pod down or network issue
|
||
- `timeout` → slow query, overloaded node
|
||
- `hash mismatch` → potential data corruption (run anti-entropy)
|
||
- `lease expired` → leader election contention
|
||
|
||
### 4.2 Check coordinator logs for topology changes
|
||
```bash
|
||
kubectl logs -n $NAMESPACE -l app=miroir,role=coordinator --tail=200 | \
|
||
jq -rc 'select(.message | test("topology|node|shard"))'
|
||
```
|
||
|
||
### 4.3 Check for crash loop patterns
|
||
```bash
|
||
kubectl logs -n $NAMESPACE miroir-0 --previous --tail=100 | \
|
||
jq -rc 'select(.level=="ERROR" or .level=="FATAL")' || true
|
||
```
|
||
|
||
## Step 5: Check Task Status
|
||
|
||
### 5.1 List stuck or long-running tasks
|
||
```bash
|
||
curl -s "$MIROIR_URL/_miroir/tasks?key=$MIROIR_KEY&status=processing" | \
|
||
jq -r '.tasks[] | "\(.miroir_id) (\(.task_type // "unknown")): \(.created_at)"'
|
||
```
|
||
|
||
**Action**: Investigate tasks running > 1 hour.
|
||
|
||
### 5.2 Get detailed task status
|
||
```bash
|
||
miroir-ctl task status --task-id <miroir_task_id> --verbose
|
||
```
|
||
|
||
### 5.3 Check task registry health
|
||
```bash
|
||
# SQLite mode
|
||
kubectl exec -n $NAMESPACE miroir-0 -- \
|
||
sqlite3 /data/miroir.db "SELECT status, COUNT(*) FROM tasks GROUP BY status;"
|
||
|
||
# Redis mode
|
||
kubectl exec -n $NAMESPACE redis-0 -- \
|
||
redis-cli --scan --pattern "miroir:tasks:*" | wc -l
|
||
```
|
||
|
||
## Step 6: Check Anti-Entropy Status
|
||
|
||
### 6.1 Last AE run time
|
||
```bash
|
||
curl -s "$MIROIR_URL/_miroir/anti-entropy/status?key=$MIROIR_KEY" | \
|
||
jq '{last_run: .last_run_at, next_run: .next_run_at, divergences_found: .divergences_found}'
|
||
```
|
||
|
||
**Action**: If `last_run_at` is > 24 hours ago, AE may be stuck.
|
||
|
||
### 6.2 Check for divergence
|
||
```bash
|
||
curl -s "$MIROIR_URL/_miroir/anti-entropy/divergence?key=$MIROIR_KEY" | \
|
||
jq '.divergent_shards | length'
|
||
```
|
||
|
||
**Action**: Any divergent shards should trigger an AE run.
|
||
|
||
## Step 7: Check External Dependencies
|
||
|
||
### 7.1 Check Redis connectivity
|
||
```bash
|
||
kubectl exec -n $NAMESPACE miroir-0 -- \
|
||
redis-cli -h redis-headless ping
|
||
```
|
||
|
||
**Expected**: `PONG`
|
||
|
||
### 7.2 Check Meilisearch backend connectivity
|
||
```bash
|
||
for i in 0 1 2; do
|
||
echo "=== miroir-$i ==="
|
||
kubectl exec -n $NAMESPACE miroir-$i -- \
|
||
curl -s http://localhost:7700/health | jq '.status'
|
||
done
|
||
```
|
||
|
||
**Expected**: `"available"`
|
||
|
||
### 7.3 Check network policies
|
||
```bash
|
||
kubectl get networkpolicy -n $NAMESPACE
|
||
kubectl describe networkpolicy miroir-allow-peer -n $NAMESPACE
|
||
```
|
||
|
||
## Step 8: Run Self-Diagnostics
|
||
|
||
### 8.1 Miroir self-check endpoint
|
||
```bash
|
||
curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY" | jq '.'
|
||
```
|
||
|
||
**Expected output**:
|
||
```json
|
||
{
|
||
"status": "healthy",
|
||
"checks": {
|
||
"topology": "ok",
|
||
"task_store": "ok",
|
||
"coordinator_leader": "ok",
|
||
"peers_connected": "ok"
|
||
}
|
||
}
|
||
```
|
||
|
||
### 8.2 Run canary tests
|
||
```bash
|
||
# List configured canaries
|
||
curl -s "$MIROIR_URL/_miroir/canaries?key=$MIROIR_KEY" | \
|
||
jq -r '.canaries[] | .id'
|
||
|
||
# Trigger a canary run
|
||
curl -X POST "$MIROIR_URL/_miroir/canaries/search-health/run?key=$MIROIR_KEY"
|
||
```
|
||
|
||
## Decision Tree
|
||
|
||
Based on findings, follow this tree:
|
||
|
||
```
|
||
Are any pods not running?
|
||
├─ Yes → Check pod logs (Step 4), describe pod for events
|
||
└─ No → Continue
|
||
|
||
Are any nodes degraded?
|
||
├─ Yes → Check node logs, verify network, restart if needed
|
||
└─ No → Continue
|
||
|
||
Is task queue depth > 1000?
|
||
├─ Yes → Check for stuck tasks (Step 5), scale workers if needed
|
||
└─ No → Continue
|
||
|
||
Is search latency high?
|
||
├─ Yes → Check query patterns, consider query optimization
|
||
└─ No → Continue
|
||
|
||
Any errors in logs?
|
||
├─ Yes → Investigate specific error pattern
|
||
└─ No → Issue may be external, check dependencies (Step 7)
|
||
```
|
||
|
||
## Escalation Checklist
|
||
|
||
Before escalating, gather:
|
||
|
||
1. **Topology output** (Step 2.1)
|
||
2. **Recent errors** (Step 4.1)
|
||
3. **Stuck tasks** (Step 5.1)
|
||
4. **Metrics snapshot** (Step 3.1)
|
||
5. **Pod status** (Step 1.1)
|
||
|
||
Attach these to your GitHub issue or support ticket.
|
||
|
||
## Prevention: Regular Health Checks
|
||
|
||
Set up a cron job or monitoring alert to run this daily:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# daily-health-check.sh
|
||
|
||
# Quick health check
|
||
HEALTH=$(curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY")
|
||
STATUS=$(echo $HEALTH | jq -r '.status')
|
||
|
||
if [ "$STATUS" != "healthy" ]; then
|
||
echo "UNHEALTHY: $HEALTH"
|
||
# Send alert
|
||
fi
|
||
```
|
||
|
||
## Related Documentation
|
||
|
||
- [Common Issues Guide](../troubleshooting.md)
|
||
- [Node Drain Runbook](../runbooks/node-drain.md)
|
||
- [Migration Runbook](../migration_runbook.md)
|
||
- [Metrics Reference](../operations/metrics.md)
|