miroir/docs/troubleshooting/diagnostics.md
jedarden c5238b1bcd docs(troubleshooting): add common issues guide and diagnostic playbook (P11.5)
Implements P11.5 acceptance criteria:
- Created docs/troubleshooting.md with 10 common issues
- Created docs/troubleshooting/diagnostics.md with systematic diagnostic playbook
- Documented 3 required plan §11 issues (primary key required, degraded search results, stuck tasks)
- Added 7 additional issues from Phase 9 chaos testing and operations
- Cross-linked from README, migration runbook, and dump import guide

Documented issues:
1. "primary key required" - Miroir vs Meilisearch difference
2. Search returns fewer results - degraded node handling
3. Task polling stuck - per-node task status recovery
4. Node drain blocked - RF constraints
5. Migration stuck after coordinator crash - recovery procedures
6. High memory usage on Redis - cleanup procedures
7. Index creation fails - topology inconsistency
8. Alias flip conflicts - single vs multi alias types
9. Search timeout during migration - throttling options
10. CDC cursor out of sync - recovery and re-index

Diagnostic playbook covers:
- Cluster health checks (pods, nodes, resources)
- Topology verification and node agreement
- Metrics analysis (degraded shards, task queue, latency)
- Log analysis for error patterns
- Task status inspection
- Anti-entropy status
- External dependency checks
- Self-diagnostics and canary tests

Closes: miroir-uyx.5
2026-05-24 14:02:13 -04:00

8.1 KiB
Raw Permalink Blame History

Diagnostic Playbook

This playbook provides a systematic approach to diagnosing issues in a Miroir cluster. Run these steps in order when investigating any problem.

Prerequisites

Set up your environment:

export MIROIR_URL="https://miroir.example.com"
export MIROIR_KEY="your-admin-key"
export NAMESPACE="search"  # adjust if needed

Step 1: Check Cluster Health

1.1 Verify all pods are running

kubectl get pods -n $NAMESPACE

Expected output: All pods in Running state, Ready 1/1 or 2/2.

Common issues:

  • Pods in Pending → resource constraints, scheduler issues
  • Pods in CrashLoopBackOff → config errors, OOM kills
  • Pods with Ready: 0/1 → startup probe failing, dependency unavailable

1.2 Check recent pod restarts

kubectl get pods -n $NAMESPACE -o json | jq -r '.items[] | "\(.metadata.name): \(.status.containerStatuses[0].restartCount) restarts"'

Action: Investigate pods with > 3 restarts in the last hour.

1.3 Check resource usage

kubectl top pods -n $NAMESPACE
kubectl top nodes

Action: If CPU/memory limits are hit, consider scaling up or adjusting limits.

Step 2: Check Miroir Topology

2.1 Get topology overview

curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '.'

Expected output:

{
  "shards": 128,
  "replication_factor": 2,
  "nodes": [
    {"node_id": "node-0", "status": "active", "shards": [...]},
    {"node_id": "node-1", "status": "active", "shards": [...]},
    {"node_id": "node-2", "status": "active", "shards": [...]}
  ]
}

Common issues:

  • status: "degraded" → node is unreachable or unhealthy
  • status: "draining" → node migration in progress
  • shards: [] → node has no assigned shards (newly added)

2.2 Check for degraded shards

curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '
  .nodes as $nodes |
  .shards as $total |
  ($nodes | map(.shards | length) | add) as $assigned |
  "Assigned: \($assigned)/\($total*3) (RF × \($nodes | length))",
  "Degraded nodes: \([.nodes[] | select(.status != "active")] | length)"
'

Action: Any degraded nodes need investigation (see Step 4).

2.3 Verify node agreement on topology

for i in 0 1 2; do
  echo "=== node-$i ==="
  kubectl exec -n $NAMESPACE miroir-$i -- \
    curl -s localhost:7700/_miroir/topology | jq '.shards, .replication_factor'
done

Expected: All nodes report the same shard count and RF.

Action: If nodes disagree, restart coordinator pod to force reconciliation.

Step 3: Check Metrics

3.1 Get metrics summary

curl -s "$MIROIR_URL/_miroir/metrics?key=$MIROIR_KEY" | jq '
{
  degraded_shards: .degraded_shards // 0,
  task_queue_depth: .task_queue_depth // 0,
  search_latency_p99: .search_latency_p99_ms // 0,
  write_latency_p99: .write_latency_p99_ms // 0,
  cdc_lag_seconds: .cdc_lag_seconds // 0
}
'

Key thresholds:

  • degraded_shards > 0 → investigate node health
  • task_queue_depth > 1000 → task processing bottleneck
  • search_latency_p99 > 1000 → slow queries, need optimization
  • cdc_lag_seconds > 300 → CDC falling behind

3.2 Check Prometheus metrics (if available)

# Via Prometheus API
curl -s "http://prometheus:9090/api/v1/query?query=miroir_degraded_shards" | jq '.data.result[0].value[1]'

# Via pod metrics endpoint
kubectl exec -n $NAMESPACE miroir-0 -- curl -s localhost:9091/metrics | grep miroir_

Step 4: Check Logs for Errors

4.1 Get recent errors from all pods

for pod in $(kubectl get pods -n $NAMESPACE -l app=miroir -o name); do
  echo "=== $pod ==="
  kubectl logs -n $NAMESPACE $pod --tail=100 | jq -rc 'select(.level=="ERROR")' || true
  echo ""
done

Common error patterns:

  • connection refused → peer pod down or network issue
  • timeout → slow query, overloaded node
  • hash mismatch → potential data corruption (run anti-entropy)
  • lease expired → leader election contention

4.2 Check coordinator logs for topology changes

kubectl logs -n $NAMESPACE -l app=miroir,role=coordinator --tail=200 | \
  jq -rc 'select(.message | test("topology|node|shard"))'

4.3 Check for crash loop patterns

kubectl logs -n $NAMESPACE miroir-0 --previous --tail=100 | \
  jq -rc 'select(.level=="ERROR" or .level=="FATAL")' || true

Step 5: Check Task Status

5.1 List stuck or long-running tasks

curl -s "$MIROIR_URL/_miroir/tasks?key=$MIROIR_KEY&status=processing" | \
  jq -r '.tasks[] | "\(.miroir_id) (\(.task_type // "unknown")): \(.created_at)"'

Action: Investigate tasks running > 1 hour.

5.2 Get detailed task status

miroir-ctl task status --task-id <miroir_task_id> --verbose

5.3 Check task registry health

# SQLite mode
kubectl exec -n $NAMESPACE miroir-0 -- \
  sqlite3 /data/miroir.db "SELECT status, COUNT(*) FROM tasks GROUP BY status;"

# Redis mode
kubectl exec -n $NAMESPACE redis-0 -- \
  redis-cli --scan --pattern "miroir:tasks:*" | wc -l

Step 6: Check Anti-Entropy Status

6.1 Last AE run time

curl -s "$MIROIR_URL/_miroir/anti-entropy/status?key=$MIROIR_KEY" | \
  jq '{last_run: .last_run_at, next_run: .next_run_at, divergences_found: .divergences_found}'

Action: If last_run_at is > 24 hours ago, AE may be stuck.

6.2 Check for divergence

curl -s "$MIROIR_URL/_miroir/anti-entropy/divergence?key=$MIROIR_KEY" | \
  jq '.divergent_shards | length'

Action: Any divergent shards should trigger an AE run.

Step 7: Check External Dependencies

7.1 Check Redis connectivity

kubectl exec -n $NAMESPACE miroir-0 -- \
  redis-cli -h redis-headless ping

Expected: PONG

7.2 Check Meilisearch backend connectivity

for i in 0 1 2; do
  echo "=== miroir-$i ==="
  kubectl exec -n $NAMESPACE miroir-$i -- \
    curl -s http://localhost:7700/health | jq '.status'
done

Expected: "available"

7.3 Check network policies

kubectl get networkpolicy -n $NAMESPACE
kubectl describe networkpolicy miroir-allow-peer -n $NAMESPACE

Step 8: Run Self-Diagnostics

8.1 Miroir self-check endpoint

curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY" | jq '.'

Expected output:

{
  "status": "healthy",
  "checks": {
    "topology": "ok",
    "task_store": "ok",
    "coordinator_leader": "ok",
    "peers_connected": "ok"
  }
}

8.2 Run canary tests

# List configured canaries
curl -s "$MIROIR_URL/_miroir/canaries?key=$MIROIR_KEY" | \
  jq -r '.canaries[] | .id'

# Trigger a canary run
curl -X POST "$MIROIR_URL/_miroir/canaries/search-health/run?key=$MIROIR_KEY"

Decision Tree

Based on findings, follow this tree:

Are any pods not running?
├─ Yes → Check pod logs (Step 4), describe pod for events
└─ No → Continue

Are any nodes degraded?
├─ Yes → Check node logs, verify network, restart if needed
└─ No → Continue

Is task queue depth > 1000?
├─ Yes → Check for stuck tasks (Step 5), scale workers if needed
└─ No → Continue

Is search latency high?
├─ Yes → Check query patterns, consider query optimization
└─ No → Continue

Any errors in logs?
├─ Yes → Investigate specific error pattern
└─ No → Issue may be external, check dependencies (Step 7)

Escalation Checklist

Before escalating, gather:

  1. Topology output (Step 2.1)
  2. Recent errors (Step 4.1)
  3. Stuck tasks (Step 5.1)
  4. Metrics snapshot (Step 3.1)
  5. Pod status (Step 1.1)

Attach these to your GitHub issue or support ticket.

Prevention: Regular Health Checks

Set up a cron job or monitoring alert to run this daily:

#!/bin/bash
# daily-health-check.sh

# Quick health check
HEALTH=$(curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY")
STATUS=$(echo $HEALTH | jq -r '.status')

if [ "$STATUS" != "healthy" ]; then
  echo "UNHEALTHY: $HEALTH"
  # Send alert
fi