docs(troubleshooting): add common issues guide and diagnostic playbook (P11.5)

Implements P11.5 acceptance criteria: - Created docs/troubleshooting.md with 10 common issues - Created docs/troubleshooting/diagnostics.md with systematic diagnostic playbook - Documented 3 required plan §11 issues (primary key required, degraded search results, stuck tasks) - Added 7 additional issues from Phase 9 chaos testing and operations - Cross-linked from README, migration runbook, and dump import guide Documented issues: 1. "primary key required" - Miroir vs Meilisearch difference 2. Search returns fewer results - degraded node handling 3. Task polling stuck - per-node task status recovery 4. Node drain blocked - RF constraints 5. Migration stuck after coordinator crash - recovery procedures 6. High memory usage on Redis - cleanup procedures 7. Index creation fails - topology inconsistency 8. Alias flip conflicts - single vs multi alias types 9. Search timeout during migration - throttling options 10. CDC cursor out of sync - recovery and re-index Diagnostic playbook covers: - Cluster health checks (pods, nodes, resources) - Topology verification and node agreement - Metrics analysis (degraded shards, task queue, latency) - Log analysis for error patterns - Task status inspection - Anti-entropy status - External dependency checks - Self-diagnostics and canary tests Closes: miroir-uyx.5
2026-05-24 14:02:13 -04:00 · 2026-05-24 14:02:13 -04:00 · c5238b1bcd
commit c5238b1bcd
parent b7f3b816ba
5 changed files with 1030 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -87,6 +87,7 @@ See [`docs/versioning-policy.md`](docs/versioning-policy.md) for the full versio
 - [Helm Chart](charts/miroir/) — Production deployment on Kubernetes
 - [Deployment Guides](docs/onboarding/) — Production setup, sizing, and operational considerations
 - [Migration Runbook](docs/migration_runbook.md) — Paths from single-node Meilisearch to Miroir
+- [Troubleshooting Guide](docs/troubleshooting.md) — Common issues and diagnostic playbook

 ## Quick Start

--- a/docs/migration_runbook.md
+++ b/docs/migration_runbook.md
@ -365,6 +365,8 @@ miroir-ctl anti-entropy run --index products --shards 0-63

 ### Troubleshooting Guide

+> **For comprehensive troubleshooting**: See the [Troubleshooting Guide](../troubleshooting.md) for common issues and the [Diagnostic Playbook](diagnostics.md) for systematic diagnosis.
+
 #### High Write Latency

 **Symptoms**: Write latency increased by > 2× during dual-write
--- a/docs/migrations/from-meilisearch-dump.md
+++ b/docs/migrations/from-meilisearch-dump.md
@ -0,0 +1,308 @@
+# Migrating from Meilisearch: Dump and Reload
+
+**Use this option if:** Your existing Meilisearch index is **under 10 GB** and you can tolerate brief downtime during the export/import.
+
+**Migration time:** 1-2 hours for 10 GB (network and disk dependent)
+
+---
+
+## Overview
+
+1. Export a dump from your existing Meilisearch instance
+2. Deploy Miroir
+3. Import the dump via Miroir's streaming router (default) — documents are routed to their owning shards during import
+4. Fall back to broadcast mode only if Miroir cannot reconstruct your dump variant
+
+---
+
+## Preconditions
+
+- [ ] Existing Meilisearch instance is accessible and healthy
+- [ ] Target Miroir cluster is deployed with sufficient capacity (existing corpus size + 20% buffer)
+- [ ] Dump version is compatible with Miroir's Meilisearch version (check `GET /version` on both)
+- [ ] Network connectivity between old instance and Miroir cluster
+- [ ] Admin API key for Miroir
+
+**Capacity check:**
+
+```bash
+# Check existing index size
+curl https://old-meili.example.com/indexes \
+  -H "Authorization: Bearer <master-key>"
+
+# Estimate required storage (corpus + 20% buffer)
+# If old corpus is 8 GB, provision at least 10 GB per Miroir node
+```
+
+---
+
+## Step-by-Step
+
+### Step 1: Export dump from existing Meilisearch
+
+```bash
+# Trigger dump creation
+curl -X POST https://old-meili.example.com/dumps \
+  -H "Authorization: Bearer <master-key>"
+
+# Response: {"uid":"20240524-123456","status":"enqueued","taskUid":42}
+
+# Poll for completion
+curl https://old-meili.example.com/tasks/42 \
+  -H "Authorization: Bearer <master-key>"
+
+# When status is "succeeded", note the dump file path
+# Download the dump
+curl https://old-meili.example.com/dumps/20240524-123456/download \
+  -H "Authorization: Bearer <master-key>" \
+  --output meilisearch-export.dump
+```
+
+**Expected time:** ~5-10 minutes per GB
+
+---
+
+### Step 2: Deploy Miroir
+
+If Miroir is not yet deployed:
+
+```bash
+# Add Helm repo
+helm repo add miroir https://jedarden.github.io/miroir
+helm repo update
+
+# Create namespace and secrets
+kubectl create namespace search
+kubectl -n search create secret generic miroir-secrets \
+  --from-literal=masterKey="<strong-key>" \
+  --from-literal=nodeMasterKey="<node-key>" \
+  --from-literal=adminApiKey="<admin-key>"
+kubectl -n search create secret generic meilisearch-secrets \
+  --from-literal=masterKey="<node-key>"
+
+# Install (adjust replica count based on corpus size)
+helm install search miroir/miroir \
+  --namespace search \
+  --values my-values.yaml \
+  --set meilisearch.replicas=3 \
+  --wait
+```
+
+**Verify deployment:**
+
+```bash
+kubectl get pods -n search
+# All pods should be Running
+
+curl https://search.example.com/health
+# {"status":"available"}
+```
+
+---
+
+### Step 3: Import dump via Miroir (streaming mode)
+
+**Streaming mode (default, recommended):** Documents are routed to their owning shards during import. No cross-cluster broadcast, no post-import rebalance.
+
+```bash
+# Import the dump
+curl -X POST https://search.example.com/_miroir/dumps/import \
+  -H "Authorization: Bearer <admin-key>" \
+  -F "dump=@meilisearch-export.dump" \
+  -F "indexUid=myindex"
+
+# Response: {"miroir_task_id":"mtask-00123"}
+
+# Monitor progress
+curl https://search.example.com/_miroir/dumps/import/mtask-00123/status \
+  -H "Authorization: Bearer <admin-key>"
+
+# Or use miroir-ctl
+miroir-ctl task status mtask-00123
+```
+
+**Progression:**
+
+| Phase | Description |
+|-------|-------------|
+| `Parsing` | Reading dump metadata and settings |
+| `SettingsBroadcast` | Applying index settings via two-phase broadcast |
+| `StreamingDocuments` | Routing documents to owning shards |
+| `Complete` | Import finished successfully |
+
+**Expected time:** ~1-2 hours for 10 GB (depends on network and cluster size)
+
+---
+
+### Step 4: Verification
+
+```bash
+# Verify document counts match
+curl https://old-meili.example.com/indexes/myindex/stats \
+  -H "Authorization: Bearer <master-key>" | jq '.numberOfDocuments'
+
+curl https://search.example.com/indexes/myindex/stats \
+  -H "Authorization: Bearer <miroir-key>" | jq '.numberOfDocuments'
+
+# Sample query comparison
+curl -X POST https://old-meili.example.com/indexes/myindex/search \
+  -H "Authorization: Bearer <search-key>" \
+  -H "Content-Type: application/json" \
+  -d '{"q": "test", "limit": 10}'
+
+curl -X POST https://search.example.com/indexes/myindex/search \
+  -H "Authorization: Bearer <miroir-key>" \
+  -H "Content-Type: application/json" \
+  -d '{"q": "test", "limit": 10}'
+
+# Results should match (ordering may differ slightly due to distributed merge)
+```
+
+---
+
+### Step 5: Update application configuration
+
+Update your application to point to Miroir:
+
+```python
+# Before
+client = meilisearch.Client('https://old-meili.example.com', 'key')
+
+# After
+client = meilisearch.Client('https://search.example.com', 'miroir-key')
+```
+
+```typescript
+// Before
+const client = new MeiliSearch({ host: 'https://old-meili.example.com', apiKey: 'key' })
+
+// After
+const client = new MeiliSearch({ host: 'https://search.example.com', apiKey: 'miroir-key' })
+```
+
+```go
+// Before
+client := meilisearch.NewClient(meilisearch.ClientConfig{
+  Host: "https://old-meili.example.com",
+  APIKey: "key",
+})
+
+// After
+client := meilisearch.NewClient(meilisearch.ClientConfig{
+  Host: "https://search.example.com",
+  APIKey: "miroir-key",
+})
+```
+
+---
+
+## Fallback: Broadcast Mode
+
+If Miroir cannot fully reconstruct your dump variant (e.g., custom dump format from a Meilisearch fork), fall back to broadcast mode:
+
+**Warning:** Broadcast mode imports the dump to **every node**, transiently placing 100% of the corpus on each node. This requires manual rebalancing afterward.
+
+```bash
+# Set broadcast mode via Helm values
+helm upgrade search miroir/miroir \
+  --namespace search \
+  --values my-values.yaml \
+  --set miroir.dump_import.mode=broadcast
+
+# Or modify ConfigMap directly
+kubectl edit configmap miroir-config -n search
+# Set: miroir.dump_import.mode: broadcast
+
+# Restart proxy pods
+kubectl rollout restart deployment miroir-proxy -n search
+
+# Import (now using broadcast mode)
+curl -X POST https://search.example.com/_miroir/dumps/import \
+  -H "Authorization: Bearer <admin-key>" \
+  -F "dump=@meilisearch-export.dump" \
+  -F "indexUid=myindex"
+
+# After import completes, rebalance to delete non-owning copies
+miroir-ctl rebalance start --index myindex
+miroir-ctl rebalance status --watch
+```
+
+---
+
+## Rollback
+
+If verification fails or you need to roll back:
+
+```bash
+# Point application back to old instance
+# (revert SDK configuration changes)
+
+# Delete imported index from Miroir
+curl -X DELETE https://search.example.com/indexes/myindex \
+  -H "Authorization: Bearer <admin-key>"
+```
+
+---
+
+## Troubleshooting
+
+### Import stuck at `SettingsBroadcast`
+
+**Cause:** Two-phase settings broadcast waiting for all nodes to acknowledge.
+
+**Solution:**
+
+```bash
+# Check node health
+miroir-ctl status
+
+# Verify all nodes are healthy
+kubectl get pods -n search -l app=meilisearch
+
+# If a node is degraded, fix it first
+kubectl describe pod <pod-name> -n search
+```
+
+### Import fails with "incompatible dump format"
+
+**Cause:** Dump format from Meilisearch version not supported by Miroir's nodes.
+
+**Solution:** Check Meilisearch versions match:
+
+```bash
+# Old instance
+curl https://old-meili.example.com/version
+
+# Miroir nodes
+kubectl exec -n search <pod-name> -- curl http://localhost:7700/version
+```
+
+If versions differ significantly, either:
+1. Upgrade old instance to match Miroir's version before exporting dump
+2. Use **re-index** migration instead (see `from-meilisearch-reindex.md`)
+
+### Document counts don't match after import
+
+**Cause:** Streaming router may have failed to route some documents.
+
+**Solution:**
+
+```bash
+# Check import task for errors
+miroir-ctl task status mtask-00123
+
+# Re-run import if errors found
+# (Idempotent — duplicate documents are ignored)
+
+# Or run anti-entropy to detect and repair divergences
+miroir-ctl anti-entropy run --index myindex
+```
+
+---
+
+## See Also
+
+- [Plan §13.9 — Streaming routed dump import](../plan/plan.md#139-streaming-routed-dump-import)
+- [Re-index migration](from-meilisearch-reindex.md) — for large corpora
+- [Live cutover migration](from-meilisearch-live-cutover.md) — for zero-downtime
+- [Troubleshooting Guide](../troubleshooting.md) — common issues and solutions
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@ -0,0 +1,404 @@
+# Common Issues & Troubleshooting
+
+This guide covers the most common issues encountered when running Miroir in production, along with their symptoms, causes, and fixes.
+
+## Quick Diagnostics
+
+Before diving into specific issues, run the [diagnostic playbook](diagnostics.md) to gather baseline information about your cluster's health.
+
+## Common Issues
+
+### Error: "primary key required"
+
+#### Symptom
+Client sees:
+```json
+HTTP 400 {
+  "code": "miroir_primary_key_required",
+  "message": "Miroir requires an explicit primary key at index creation"
+}
+```
+
+#### Cause
+The index was created without a `primaryKey` field. Miroir cannot route documents without knowing the primary key in advance.
+
+#### Fix
+```bash
+curl -X POST https://miroir/indexes \
+  -H "Authorization: Bearer $KEY" \
+  -d '{
+    "uid": "myindex",
+    "primaryKey": "id"
+  }'
+```
+
+#### Why this differs from Meilisearch
+Meilisearch can infer the primary key from the first document batch. Miroir cannot — it needs to hash the PK *before* any node sees it to determine which shard owns the document. Explicit `primaryKey` at index creation is required.
+
+---
+
+### Search returns fewer results than expected
+
+#### Symptom
+Search queries return fewer results than known document count, especially after node failures or during migrations.
+
+#### Cause
+A replica holding a shard is degraded or unreachable. Miroir's cross-reference mechanism skips degraded replicas to avoid returning incomplete or stale results, which can reduce result counts when RF > 1.
+
+#### Fix
+1. Check topology for degraded nodes:
+   ```bash
+   curl -s https://miroir/_miroir/topology | jq '.nodes[] | select(.status != "active")'
+   ```
+
+2. Check for degraded shards:
+   ```bash
+   curl -s https://miroir/_miroir/metrics | jq '.degraded_shards'
+   ```
+
+3. If a node is degraded, check its logs:
+   ```bash
+   kubectl logs miroir-0 --tail=100 | jq 'select(.level=="ERROR")'
+   ```
+
+4. Restart the degraded pod if it's stuck:
+   ```bash
+   kubectl delete pod miroir-0
+   ```
+
+#### Prevention
+- Set up canaries to proactively detect search degradation
+- Monitor the `miroir_degraded_shards` metric
+- Ensure proper resource limits to prevent OOM kills
+
+---
+
+### Task polling stuck at "processing"
+
+#### Symptom
+`miroir-ctl task status` shows a task stuck in "processing" state indefinitely, even though the operation appears complete.
+
+#### Cause
+The task coordinator lost track of per-node task status. This can happen when:
+- A node crashes during task execution
+- Network partition prevents status updates
+- Task registry checkpoint is delayed
+
+#### Fix
+1. Check per-node task status:
+   ```bash
+   miroir-ctl task status --task-id <miroir_task_id> --verbose
+   ```
+
+2. Identify which node(s) have incomplete status:
+   ```bash
+   kubectl logs miroir-0 --tail=100 | grep "<miroir_task_id>"
+   kubectl logs miroir-1 --tail=100 | grep "<miroir_task_id>"
+   ```
+
+3. If all nodes have completed but the task is stuck, force-complete the task:
+   ```bash
+   miroir-ctl task complete --task-id <miroir_task_id>
+   ```
+
+4. If a node crashed and cannot recover, mark its tasks as failed:
+   ```bash
+   miroir-ctl task fail --task-id <miroir_task_id> --node <node_id> --reason "node crashed"
+   ```
+
+#### Prevention
+- Enable task registry checkpointing (default: every 100 tasks)
+- Monitor task queue depth via `miroir_task_queue_depth` metric
+- Set task timeouts appropriate to your workload
+
+---
+
+### Node drain blocked: "insufficient replicas"
+
+#### Symptom
+```bash
+$ miroir-ctl node drain node-1
+Error: Cannot drain node-1: removing it would drop replication factor below minimum
+```
+
+#### Cause
+Draining a node would leave some shards with fewer replicas than the minimum RF. This is a safety check to prevent data loss.
+
+#### Fix
+1. Check current RF configuration:
+   ```bash
+   curl -s https://miroir/_miroir/topology | jq '.replication_factor'
+   ```
+
+2. Add a new node first:
+   ```bash
+   kubectl scale statefulset miroir --replicas=4
+   # Wait for node-3 to be ready
+   kubectl wait --for=condition=ready pod/miroir-3
+   ```
+
+3. Then retry the drain:
+   ```bash
+   miroir-ctl node drain node-1
+   ```
+
+#### Alternative: Force drain (dangerous)
+If you must drain without sufficient replicas, use `--force`:
+```bash
+miroir-ctl node drain node-1 --force
+```
+This will reduce RF for affected shards during migration. Only use this if:
+- You can tolerate reduced redundancy temporarily
+- Anti-entropy is enabled to repair divergence later
+
+---
+
+### Migration stuck after coordinator crash
+
+#### Symptom
+A shard migration (reshard, rebalance, node drain) was in progress when the coordinator pod crashed. After restart, the migration is stuck and cannot complete or rollback.
+
+#### Cause
+The coordinator stores migration state in the task store. If it crashes during state transitions, the migration may be left in an inconsistent state.
+
+#### Fix
+1. Check migration status:
+   ```bash
+   miroir-ctl reshard status --operation-id <operation_id>
+   ```
+
+2. If stuck in "in_progress" with no activity, recover the migration:
+   ```bash
+   miroir-ctl reshard recover --operation-id <operation_id>
+   ```
+
+3. If recovery fails, you may need to force-complete:
+   ```bash
+   # This skips remaining delta pass and anti-entropy
+   miroir-ctl reshard complete --operation-id <operation_id> --force
+   ```
+
+4. Run anti-entropy manually to repair any divergence:
+   ```bash
+   miroir-ctl anti-entropy run --index-uid <affected_index>
+   ```
+
+#### Prevention
+- Enable task store persistence (Redis mode for HA)
+- Set coordinator leader election timeout appropriately
+- Monitor coordinator pod health via liveness probes
+
+---
+
+### High memory usage on Redis
+
+#### Symptom
+Redis memory usage grows continuously, potentially triggering OOM kills.
+
+#### Cause
+The most common causes are:
+1. Idempotency cache entries not expiring
+2. Task registry not pruning terminal tasks
+3. Session entries not being cleaned up
+
+#### Fix
+1. Check Redis memory breakdown:
+   ```bash
+   redis-cli INFO memory | grep used_memory_human
+   redis-cli --bigkeys --pattern "miroir:*"
+   ```
+
+2. Check largest key categories:
+   ```bash
+   redis-cli --scan --pattern "miroir:tasks:*" | wc -l  # task count
+   redis-cli --scan --pattern "miroir:idemp:*" | wc -l   # idempotency entries
+   redis-cli --scan --pattern "miroir:session:*" | wc -l # sessions
+   ```
+
+3. Manually trigger cleanup if pruner is stuck:
+   ```bash
+   # Prune old terminal tasks
+   miroir-ctl task prune --older-than 24h
+
+   # Clear expired idempotency entries
+   redis-cli --scan --pattern "miroir:idemp:*" | xargs redis-cli DEL
+   ```
+
+4. Adjust pruner intervals if needed:
+   ```yaml
+   # config.toml
+   [task_store.prune]
+   interval_seconds = 300  # run every 5 minutes
+   task_retention_days = 7
+   ```
+
+#### Prevention
+- Monitor Redis memory usage via `redis_used_memory` metric
+- Set `maxmemory` and `maxmemory-policy allkeys-lru` on Redis
+- Ensure pruner is running (check logs for "Pruning terminal tasks" messages)
+
+---
+
+### Index creation fails with "hash routing error"
+
+#### Symptom
+```bash
+$ curl -X POST https://miroir/indexes -d '{"uid": "test", "primaryKey": "id"}'
+HTTP 500 {"code": "hash_routing_error", "message": "unable to determine shard assignment"}
+```
+
+#### Cause
+This typically happens when:
+1. The topology view is inconsistent across nodes
+2. The shard count is 0 or not configured
+3. The primary key field is missing from schema validation
+
+#### Fix
+1. Check topology consistency:
+   ```bash
+   curl -s https://miroir/_miroir/topology | jq '.shards, .replication_factor, .nodes | length'
+   ```
+
+2. Verify all nodes agree on shard count:
+   ```bash
+   for pod in miroir-0 miroir-1 miroir-2; do
+     echo "$pod:"
+     kubectl exec $pod -- curl -s localhost:7700/_miroir/topology | jq '.shards'
+   done
+   ```
+
+3. If nodes disagree, restart the coordinator to force topology reconciliation:
+   ```bash
+   kubectl delete pod -l app=miroir,role=coordinator
+   ```
+
+#### Prevention
+- Use leader election to ensure single coordinator writer
+- Monitor topology change log for conflicts
+
+---
+
+### Alias flip returns "wrong kind"
+
+#### Symptom
+```bash
+$ miroir-ctl alias flip prod-logs logs-v2
+Error: Alias 'prod-logs' is a multi-target alias, cannot flip
+```
+
+#### Cause
+You're trying to flip an alias that was created as a "multi" alias (for cross-index search) rather than a "single" alias (for atomic index swap).
+
+#### Fix
+1. Check the alias type:
+   ```bash
+   miroir-ctl alias get prod-logs
+   ```
+
+2. If you need a swappable pointer, delete and recreate as a single alias:
+   ```bash
+   miroir-ctl alias delete prod-logs
+   miroir-ctl alias create prod-logs --kind single --current-uid logs-v1
+   ```
+
+3. For cross-index search, use a separate multi alias:
+   ```bash
+   miroir-ctl alias create search-all --kind multi --target-uids logs-v1,metrics-v1
+   ```
+
+#### Prevention
+- Use descriptive alias names to distinguish single vs multi
+- Document alias conventions in your runbooks
+
+---
+
+### Search timeout during shard migration
+
+#### Symptom
+Search queries timeout or return 503 errors during active shard migrations, especially for large indexes.
+
+#### Cause
+During migration, some queries may be routed to nodes that are still warming up migrated shards, or to nodes under heavy load from migration work.
+
+#### Fix
+1. Check if migration is active:
+   ```bash
+   miroir-ctl reshard list --status in_progress
+   ```
+
+2. Temporarily increase query timeout:
+   ```bash
+   curl -X POST https://miroir/indexes/myindex/search \
+     -H "Query-Timeout: 30" \
+     -d '{"q": "test"}'
+   ```
+
+3. If timeouts persist, pause the migration:
+   ```bash
+   miroir-ctl reshard pause --operation-id <operation_id>
+   ```
+
+4. Resume during off-peak hours:
+   ```bash
+   miroir-ctl reshard resume --operation-id <operation_id>
+   ```
+
+#### Prevention
+- Schedule large migrations during low-traffic periods
+- Use `--throttle` flag on reshard to limit CPU usage
+- Monitor search latency during migrations
+
+---
+
+### CDC cursor out of sync
+
+#### Symptom
+CDC events arrive with stale or duplicate sequence numbers, or events are missing entirely.
+
+#### Cause
+The CDC cursor stored in Redis is out of sync with the actual event stream. This can happen if:
+- The sink was down during a period of high write activity
+- A cursor update failed silently
+- The sink was reset without clearing the cursor
+
+#### Fix
+1. Check current cursor position:
+   ```bash
+   miroir-ctl cdc cursor --sink-name elasticsearch --index-uid myindex
+   ```
+
+2. Compare to Meilisearch event stream:
+   ```bash
+   # On a Meilisearch node
+   curl -s http://localhost:7700/indexes/myindex/cdc/events | jq '.events | length'
+   ```
+
+3. If cursor is behind, reset it to force re-sync from a checkpoint:
+   ```bash
+   miroir-ctl cdc reset-cursor --sink-name elasticsearch --index-uid myindex --confirm
+   ```
+
+4. For large gaps, consider a full re-index:
+   ```bash
+   miroir-ctl dump export --index-uid myindex --output /data/myindex.dump
+   miroir-ctl dump import --sink-name elasticsearch --input /data/myindex.dump
+   ```
+
+#### Prevention
+- Monitor CDC lag via `miroir_cdc_lag_seconds` metric
+- Set up alerts for cursor stall detection
+- Use idempotent sinks to handle duplicate events gracefully
+
+---
+
+## Getting Help
+
+If you don't see your issue listed here:
+
+1. Run the [diagnostic playbook](diagnostics.md) and gather the output
+2. Search [existing GitHub issues](https://github.com/jedarden/miroir/issues)
+3. Open a new issue with:
+   - Miroir version
+   - Diagnostic output
+   - Relevant logs (sanitized)
+   - Steps to reproduce
--- a/docs/troubleshooting/diagnostics.md
+++ b/docs/troubleshooting/diagnostics.md
@ -0,0 +1,315 @@
+# Diagnostic Playbook
+
+This playbook provides a systematic approach to diagnosing issues in a Miroir cluster. Run these steps in order when investigating any problem.
+
+## Prerequisites
+
+Set up your environment:
+```bash
+export MIROIR_URL="https://miroir.example.com"
+export MIROIR_KEY="your-admin-key"
+export NAMESPACE="search"  # adjust if needed
+```
+
+## Step 1: Check Cluster Health
+
+### 1.1 Verify all pods are running
+```bash
+kubectl get pods -n $NAMESPACE
+```
+
+**Expected output**: All pods in `Running` state, Ready 1/1 or 2/2.
+
+**Common issues**:
+- Pods in `Pending` → resource constraints, scheduler issues
+- Pods in `CrashLoopBackOff` → config errors, OOM kills
+- Pods with `Ready: 0/1` → startup probe failing, dependency unavailable
+
+### 1.2 Check recent pod restarts
+```bash
+kubectl get pods -n $NAMESPACE -o json | jq -r '.items[] | "\(.metadata.name): \(.status.containerStatuses[0].restartCount) restarts"'
+```
+
+**Action**: Investigate pods with > 3 restarts in the last hour.
+
+### 1.3 Check resource usage
+```bash
+kubectl top pods -n $NAMESPACE
+kubectl top nodes
+```
+
+**Action**: If CPU/memory limits are hit, consider scaling up or adjusting limits.
+
+## Step 2: Check Miroir Topology
+
+### 2.1 Get topology overview
+```bash
+curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '.'
+```
+
+**Expected output**:
+```json
+{
+  "shards": 128,
+  "replication_factor": 2,
+  "nodes": [
+    {"node_id": "node-0", "status": "active", "shards": [...]},
+    {"node_id": "node-1", "status": "active", "shards": [...]},
+    {"node_id": "node-2", "status": "active", "shards": [...]}
+  ]
+}
+```
+
+**Common issues**:
+- `status: "degraded"` → node is unreachable or unhealthy
+- `status: "draining"` → node migration in progress
+- `shards: []` → node has no assigned shards (newly added)
+
+### 2.2 Check for degraded shards
+```bash
+curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '
+  .nodes as $nodes |
+  .shards as $total |
+  ($nodes | map(.shards | length) | add) as $assigned |
+  "Assigned: \($assigned)/\($total*3) (RF × \($nodes | length))",
+  "Degraded nodes: \([.nodes[] | select(.status != "active")] | length)"
+'
+```
+
+**Action**: Any degraded nodes need investigation (see Step 4).
+
+### 2.3 Verify node agreement on topology
+```bash
+for i in 0 1 2; do
+  echo "=== node-$i ==="
+  kubectl exec -n $NAMESPACE miroir-$i -- \
+    curl -s localhost:7700/_miroir/topology | jq '.shards, .replication_factor'
+done
+```
+
+**Expected**: All nodes report the same shard count and RF.
+
+**Action**: If nodes disagree, restart coordinator pod to force reconciliation.
+
+## Step 3: Check Metrics
+
+### 3.1 Get metrics summary
+```bash
+curl -s "$MIROIR_URL/_miroir/metrics?key=$MIROIR_KEY" | jq '
+{
+  degraded_shards: .degraded_shards // 0,
+  task_queue_depth: .task_queue_depth // 0,
+  search_latency_p99: .search_latency_p99_ms // 0,
+  write_latency_p99: .write_latency_p99_ms // 0,
+  cdc_lag_seconds: .cdc_lag_seconds // 0
+}
+'
+```
+
+**Key thresholds**:
+- `degraded_shards > 0` → investigate node health
+- `task_queue_depth > 1000` → task processing bottleneck
+- `search_latency_p99 > 1000` → slow queries, need optimization
+- `cdc_lag_seconds > 300` → CDC falling behind
+
+### 3.2 Check Prometheus metrics (if available)
+```bash
+# Via Prometheus API
+curl -s "http://prometheus:9090/api/v1/query?query=miroir_degraded_shards" | jq '.data.result[0].value[1]'
+
+# Via pod metrics endpoint
+kubectl exec -n $NAMESPACE miroir-0 -- curl -s localhost:9091/metrics | grep miroir_
+```
+
+## Step 4: Check Logs for Errors
+
+### 4.1 Get recent errors from all pods
+```bash
+for pod in $(kubectl get pods -n $NAMESPACE -l app=miroir -o name); do
+  echo "=== $pod ==="
+  kubectl logs -n $NAMESPACE $pod --tail=100 | jq -rc 'select(.level=="ERROR")' || true
+  echo ""
+done
+```
+
+**Common error patterns**:
+- `connection refused` → peer pod down or network issue
+- `timeout` → slow query, overloaded node
+- `hash mismatch` → potential data corruption (run anti-entropy)
+- `lease expired` → leader election contention
+
+### 4.2 Check coordinator logs for topology changes
+```bash
+kubectl logs -n $NAMESPACE -l app=miroir,role=coordinator --tail=200 | \
+  jq -rc 'select(.message | test("topology|node|shard"))'
+```
+
+### 4.3 Check for crash loop patterns
+```bash
+kubectl logs -n $NAMESPACE miroir-0 --previous --tail=100 | \
+  jq -rc 'select(.level=="ERROR" or .level=="FATAL")' || true
+```
+
+## Step 5: Check Task Status
+
+### 5.1 List stuck or long-running tasks
+```bash
+curl -s "$MIROIR_URL/_miroir/tasks?key=$MIROIR_KEY&status=processing" | \
+  jq -r '.tasks[] | "\(.miroir_id) (\(.task_type // "unknown")): \(.created_at)"'
+```
+
+**Action**: Investigate tasks running > 1 hour.
+
+### 5.2 Get detailed task status
+```bash
+miroir-ctl task status --task-id <miroir_task_id> --verbose
+```
+
+### 5.3 Check task registry health
+```bash
+# SQLite mode
+kubectl exec -n $NAMESPACE miroir-0 -- \
+  sqlite3 /data/miroir.db "SELECT status, COUNT(*) FROM tasks GROUP BY status;"
+
+# Redis mode
+kubectl exec -n $NAMESPACE redis-0 -- \
+  redis-cli --scan --pattern "miroir:tasks:*" | wc -l
+```
+
+## Step 6: Check Anti-Entropy Status
+
+### 6.1 Last AE run time
+```bash
+curl -s "$MIROIR_URL/_miroir/anti-entropy/status?key=$MIROIR_KEY" | \
+  jq '{last_run: .last_run_at, next_run: .next_run_at, divergences_found: .divergences_found}'
+```
+
+**Action**: If `last_run_at` is > 24 hours ago, AE may be stuck.
+
+### 6.2 Check for divergence
+```bash
+curl -s "$MIROIR_URL/_miroir/anti-entropy/divergence?key=$MIROIR_KEY" | \
+  jq '.divergent_shards | length'
+```
+
+**Action**: Any divergent shards should trigger an AE run.
+
+## Step 7: Check External Dependencies
+
+### 7.1 Check Redis connectivity
+```bash
+kubectl exec -n $NAMESPACE miroir-0 -- \
+  redis-cli -h redis-headless ping
+```
+
+**Expected**: `PONG`
+
+### 7.2 Check Meilisearch backend connectivity
+```bash
+for i in 0 1 2; do
+  echo "=== miroir-$i ==="
+  kubectl exec -n $NAMESPACE miroir-$i -- \
+    curl -s http://localhost:7700/health | jq '.status'
+done
+```
+
+**Expected**: `"available"`
+
+### 7.3 Check network policies
+```bash
+kubectl get networkpolicy -n $NAMESPACE
+kubectl describe networkpolicy miroir-allow-peer -n $NAMESPACE
+```
+
+## Step 8: Run Self-Diagnostics
+
+### 8.1 Miroir self-check endpoint
+```bash
+curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY" | jq '.'
+```
+
+**Expected output**:
+```json
+{
+  "status": "healthy",
+  "checks": {
+    "topology": "ok",
+    "task_store": "ok",
+    "coordinator_leader": "ok",
+    "peers_connected": "ok"
+  }
+}
+```
+
+### 8.2 Run canary tests
+```bash
+# List configured canaries
+curl -s "$MIROIR_URL/_miroir/canaries?key=$MIROIR_KEY" | \
+  jq -r '.canaries[] | .id'
+
+# Trigger a canary run
+curl -X POST "$MIROIR_URL/_miroir/canaries/search-health/run?key=$MIROIR_KEY"
+```
+
+## Decision Tree
+
+Based on findings, follow this tree:
+
+```
+Are any pods not running?
+├─ Yes → Check pod logs (Step 4), describe pod for events
+└─ No → Continue
+
+Are any nodes degraded?
+├─ Yes → Check node logs, verify network, restart if needed
+└─ No → Continue
+
+Is task queue depth > 1000?
+├─ Yes → Check for stuck tasks (Step 5), scale workers if needed
+└─ No → Continue
+
+Is search latency high?
+├─ Yes → Check query patterns, consider query optimization
+└─ No → Continue
+
+Any errors in logs?
+├─ Yes → Investigate specific error pattern
+└─ No → Issue may be external, check dependencies (Step 7)
+```
+
+## Escalation Checklist
+
+Before escalating, gather:
+
+1. **Topology output** (Step 2.1)
+2. **Recent errors** (Step 4.1)
+3. **Stuck tasks** (Step 5.1)
+4. **Metrics snapshot** (Step 3.1)
+5. **Pod status** (Step 1.1)
+
+Attach these to your GitHub issue or support ticket.
+
+## Prevention: Regular Health Checks
+
+Set up a cron job or monitoring alert to run this daily:
+
+```bash
+#!/bin/bash
+# daily-health-check.sh
+
+# Quick health check
+HEALTH=$(curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY")
+STATUS=$(echo $HEALTH | jq -r '.status')
+
+if [ "$STATUS" != "healthy" ]; then
+  echo "UNHEALTHY: $HEALTH"
+  # Send alert
+fi
+```
+
+## Related Documentation
+
+- [Common Issues Guide](../troubleshooting.md)
+- [Node Drain Runbook](../runbooks/node-drain.md)
+- [Migration Runbook](../migration_runbook.md)
+- [Metrics Reference](../operations/metrics.md)