jedarden c5238b1bcd docs(troubleshooting): add common issues guide and diagnostic playbook (P11.5)

Implements P11.5 acceptance criteria:
- Created docs/troubleshooting.md with 10 common issues
- Created docs/troubleshooting/diagnostics.md with systematic diagnostic playbook
- Documented 3 required plan §11 issues (primary key required, degraded search results, stuck tasks)
- Added 7 additional issues from Phase 9 chaos testing and operations
- Cross-linked from README, migration runbook, and dump import guide

Documented issues:
1. "primary key required" - Miroir vs Meilisearch difference
2. Search returns fewer results - degraded node handling
3. Task polling stuck - per-node task status recovery
4. Node drain blocked - RF constraints
5. Migration stuck after coordinator crash - recovery procedures
6. High memory usage on Redis - cleanup procedures
7. Index creation fails - topology inconsistency
8. Alias flip conflicts - single vs multi alias types
9. Search timeout during migration - throttling options
10. CDC cursor out of sync - recovery and re-index

Diagnostic playbook covers:
- Cluster health checks (pods, nodes, resources)
- Topology verification and node agreement
- Metrics analysis (degraded shards, task queue, latency)
- Log analysis for error patterns
- Task status inspection
- Anti-entropy status
- External dependency checks
- Self-diagnostics and canary tests

Closes: miroir-uyx.5

2026-05-24 14:02:13 -04:00

8.1 KiB

Raw Permalink Blame History

Diagnostic Playbook

This playbook provides a systematic approach to diagnosing issues in a Miroir cluster. Run these steps in order when investigating any problem.

Prerequisites

Set up your environment:

export MIROIR_URL="https://miroir.example.com"
export MIROIR_KEY="your-admin-key"
export NAMESPACE="search"  # adjust if needed

Step 1: Check Cluster Health

1.1 Verify all pods are running

kubectl get pods -n $NAMESPACE

Expected output: All pods in Running state, Ready 1/1 or 2/2.

Common issues:

Pods in Pending → resource constraints, scheduler issues
Pods in CrashLoopBackOff → config errors, OOM kills
Pods with Ready: 0/1 → startup probe failing, dependency unavailable

1.2 Check recent pod restarts

kubectl get pods -n $NAMESPACE -o json | jq -r '.items[] | "\(.metadata.name): \(.status.containerStatuses[0].restartCount) restarts"'

Action: Investigate pods with > 3 restarts in the last hour.

1.3 Check resource usage

kubectl top pods -n $NAMESPACE
kubectl top nodes

Action: If CPU/memory limits are hit, consider scaling up or adjusting limits.

Step 2: Check Miroir Topology

2.1 Get topology overview

curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '.'

Expected output:

{
  "shards": 128,
  "replication_factor": 2,
  "nodes": [
    {"node_id": "node-0", "status": "active", "shards": [...]},
    {"node_id": "node-1", "status": "active", "shards": [...]},
    {"node_id": "node-2", "status": "active", "shards": [...]}
  ]
}

Common issues:

status: "degraded" → node is unreachable or unhealthy
status: "draining" → node migration in progress
shards: [] → node has no assigned shards (newly added)

2.2 Check for degraded shards

curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '
  .nodes as $nodes |
  .shards as $total |
  ($nodes | map(.shards | length) | add) as $assigned |
  "Assigned: \($assigned)/\($total*3) (RF × \($nodes | length))",
  "Degraded nodes: \([.nodes[] | select(.status != "active")] | length)"
'

Action: Any degraded nodes need investigation (see Step 4).

2.3 Verify node agreement on topology

for i in 0 1 2; do
  echo "=== node-$i ==="
  kubectl exec -n $NAMESPACE miroir-$i -- \
    curl -s localhost:7700/_miroir/topology | jq '.shards, .replication_factor'
done

Expected: All nodes report the same shard count and RF.

Action: If nodes disagree, restart coordinator pod to force reconciliation.

Step 3: Check Metrics

3.1 Get metrics summary

curl -s "$MIROIR_URL/_miroir/metrics?key=$MIROIR_KEY" | jq '
{
  degraded_shards: .degraded_shards // 0,
  task_queue_depth: .task_queue_depth // 0,
  search_latency_p99: .search_latency_p99_ms // 0,
  write_latency_p99: .write_latency_p99_ms // 0,
  cdc_lag_seconds: .cdc_lag_seconds // 0
}
'

Key thresholds:

degraded_shards > 0 → investigate node health
task_queue_depth > 1000 → task processing bottleneck
search_latency_p99 > 1000 → slow queries, need optimization
cdc_lag_seconds > 300 → CDC falling behind

3.2 Check Prometheus metrics (if available)

# Via Prometheus API
curl -s "http://prometheus:9090/api/v1/query?query=miroir_degraded_shards" | jq '.data.result[0].value[1]'

# Via pod metrics endpoint
kubectl exec -n $NAMESPACE miroir-0 -- curl -s localhost:9091/metrics | grep miroir_

Step 4: Check Logs for Errors

4.1 Get recent errors from all pods

for pod in $(kubectl get pods -n $NAMESPACE -l app=miroir -o name); do
  echo "=== $pod ==="
  kubectl logs -n $NAMESPACE $pod --tail=100 | jq -rc 'select(.level=="ERROR")' || true
  echo ""
done

Common error patterns:

connection refused → peer pod down or network issue
timeout → slow query, overloaded node
hash mismatch → potential data corruption (run anti-entropy)
lease expired → leader election contention

4.2 Check coordinator logs for topology changes

kubectl logs -n $NAMESPACE -l app=miroir,role=coordinator --tail=200 | \
  jq -rc 'select(.message | test("topology|node|shard"))'

4.3 Check for crash loop patterns

kubectl logs -n $NAMESPACE miroir-0 --previous --tail=100 | \
  jq -rc 'select(.level=="ERROR" or .level=="FATAL")' || true

Step 5: Check Task Status

5.1 List stuck or long-running tasks

curl -s "$MIROIR_URL/_miroir/tasks?key=$MIROIR_KEY&status=processing" | \
  jq -r '.tasks[] | "\(.miroir_id) (\(.task_type // "unknown")): \(.created_at)"'

Action: Investigate tasks running > 1 hour.

5.2 Get detailed task status

miroir-ctl task status --task-id <miroir_task_id> --verbose

5.3 Check task registry health

# SQLite mode
kubectl exec -n $NAMESPACE miroir-0 -- \
  sqlite3 /data/miroir.db "SELECT status, COUNT(*) FROM tasks GROUP BY status;"

# Redis mode
kubectl exec -n $NAMESPACE redis-0 -- \
  redis-cli --scan --pattern "miroir:tasks:*" | wc -l

Step 6: Check Anti-Entropy Status

6.1 Last AE run time

curl -s "$MIROIR_URL/_miroir/anti-entropy/status?key=$MIROIR_KEY" | \
  jq '{last_run: .last_run_at, next_run: .next_run_at, divergences_found: .divergences_found}'

Action: If last_run_at is > 24 hours ago, AE may be stuck.

6.2 Check for divergence

curl -s "$MIROIR_URL/_miroir/anti-entropy/divergence?key=$MIROIR_KEY" | \
  jq '.divergent_shards | length'

Action: Any divergent shards should trigger an AE run.

Step 7: Check External Dependencies

7.1 Check Redis connectivity

kubectl exec -n $NAMESPACE miroir-0 -- \
  redis-cli -h redis-headless ping

Expected: PONG

7.2 Check Meilisearch backend connectivity

for i in 0 1 2; do
  echo "=== miroir-$i ==="
  kubectl exec -n $NAMESPACE miroir-$i -- \
    curl -s http://localhost:7700/health | jq '.status'
done

Expected: "available"

7.3 Check network policies

kubectl get networkpolicy -n $NAMESPACE
kubectl describe networkpolicy miroir-allow-peer -n $NAMESPACE

Step 8: Run Self-Diagnostics

8.1 Miroir self-check endpoint

curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY" | jq '.'