From c5238b1bcd3996dadb6e17b8df847ea24ff8b22f Mon Sep 17 00:00:00 2001 From: jedarden Date: Sun, 24 May 2026 14:02:13 -0400 Subject: [PATCH] docs(troubleshooting): add common issues guide and diagnostic playbook (P11.5) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements P11.5 acceptance criteria: - Created docs/troubleshooting.md with 10 common issues - Created docs/troubleshooting/diagnostics.md with systematic diagnostic playbook - Documented 3 required plan §11 issues (primary key required, degraded search results, stuck tasks) - Added 7 additional issues from Phase 9 chaos testing and operations - Cross-linked from README, migration runbook, and dump import guide Documented issues: 1. "primary key required" - Miroir vs Meilisearch difference 2. Search returns fewer results - degraded node handling 3. Task polling stuck - per-node task status recovery 4. Node drain blocked - RF constraints 5. Migration stuck after coordinator crash - recovery procedures 6. High memory usage on Redis - cleanup procedures 7. Index creation fails - topology inconsistency 8. Alias flip conflicts - single vs multi alias types 9. Search timeout during migration - throttling options 10. CDC cursor out of sync - recovery and re-index Diagnostic playbook covers: - Cluster health checks (pods, nodes, resources) - Topology verification and node agreement - Metrics analysis (degraded shards, task queue, latency) - Log analysis for error patterns - Task status inspection - Anti-entropy status - External dependency checks - Self-diagnostics and canary tests Closes: miroir-uyx.5 --- README.md | 1 + docs/migration_runbook.md | 2 + docs/migrations/from-meilisearch-dump.md | 308 +++++++++++++++++ docs/troubleshooting.md | 404 +++++++++++++++++++++++ docs/troubleshooting/diagnostics.md | 315 ++++++++++++++++++ 5 files changed, 1030 insertions(+) create mode 100644 docs/migrations/from-meilisearch-dump.md create mode 100644 docs/troubleshooting.md create mode 100644 docs/troubleshooting/diagnostics.md diff --git a/README.md b/README.md index 51aab2b..f423fc1 100644 --- a/README.md +++ b/README.md @@ -87,6 +87,7 @@ See [`docs/versioning-policy.md`](docs/versioning-policy.md) for the full versio - [Helm Chart](charts/miroir/) — Production deployment on Kubernetes - [Deployment Guides](docs/onboarding/) — Production setup, sizing, and operational considerations - [Migration Runbook](docs/migration_runbook.md) — Paths from single-node Meilisearch to Miroir +- [Troubleshooting Guide](docs/troubleshooting.md) — Common issues and diagnostic playbook ## Quick Start diff --git a/docs/migration_runbook.md b/docs/migration_runbook.md index e49aa05..344b616 100644 --- a/docs/migration_runbook.md +++ b/docs/migration_runbook.md @@ -365,6 +365,8 @@ miroir-ctl anti-entropy run --index products --shards 0-63 ### Troubleshooting Guide +> **For comprehensive troubleshooting**: See the [Troubleshooting Guide](../troubleshooting.md) for common issues and the [Diagnostic Playbook](diagnostics.md) for systematic diagnosis. + #### High Write Latency **Symptoms**: Write latency increased by > 2× during dual-write diff --git a/docs/migrations/from-meilisearch-dump.md b/docs/migrations/from-meilisearch-dump.md new file mode 100644 index 0000000..86d9844 --- /dev/null +++ b/docs/migrations/from-meilisearch-dump.md @@ -0,0 +1,308 @@ +# Migrating from Meilisearch: Dump and Reload + +**Use this option if:** Your existing Meilisearch index is **under 10 GB** and you can tolerate brief downtime during the export/import. + +**Migration time:** 1-2 hours for 10 GB (network and disk dependent) + +--- + +## Overview + +1. Export a dump from your existing Meilisearch instance +2. Deploy Miroir +3. Import the dump via Miroir's streaming router (default) — documents are routed to their owning shards during import +4. Fall back to broadcast mode only if Miroir cannot reconstruct your dump variant + +--- + +## Preconditions + +- [ ] Existing Meilisearch instance is accessible and healthy +- [ ] Target Miroir cluster is deployed with sufficient capacity (existing corpus size + 20% buffer) +- [ ] Dump version is compatible with Miroir's Meilisearch version (check `GET /version` on both) +- [ ] Network connectivity between old instance and Miroir cluster +- [ ] Admin API key for Miroir + +**Capacity check:** + +```bash +# Check existing index size +curl https://old-meili.example.com/indexes \ + -H "Authorization: Bearer " + +# Estimate required storage (corpus + 20% buffer) +# If old corpus is 8 GB, provision at least 10 GB per Miroir node +``` + +--- + +## Step-by-Step + +### Step 1: Export dump from existing Meilisearch + +```bash +# Trigger dump creation +curl -X POST https://old-meili.example.com/dumps \ + -H "Authorization: Bearer " + +# Response: {"uid":"20240524-123456","status":"enqueued","taskUid":42} + +# Poll for completion +curl https://old-meili.example.com/tasks/42 \ + -H "Authorization: Bearer " + +# When status is "succeeded", note the dump file path +# Download the dump +curl https://old-meili.example.com/dumps/20240524-123456/download \ + -H "Authorization: Bearer " \ + --output meilisearch-export.dump +``` + +**Expected time:** ~5-10 minutes per GB + +--- + +### Step 2: Deploy Miroir + +If Miroir is not yet deployed: + +```bash +# Add Helm repo +helm repo add miroir https://jedarden.github.io/miroir +helm repo update + +# Create namespace and secrets +kubectl create namespace search +kubectl -n search create secret generic miroir-secrets \ + --from-literal=masterKey="" \ + --from-literal=nodeMasterKey="" \ + --from-literal=adminApiKey="" +kubectl -n search create secret generic meilisearch-secrets \ + --from-literal=masterKey="" + +# Install (adjust replica count based on corpus size) +helm install search miroir/miroir \ + --namespace search \ + --values my-values.yaml \ + --set meilisearch.replicas=3 \ + --wait +``` + +**Verify deployment:** + +```bash +kubectl get pods -n search +# All pods should be Running + +curl https://search.example.com/health +# {"status":"available"} +``` + +--- + +### Step 3: Import dump via Miroir (streaming mode) + +**Streaming mode (default, recommended):** Documents are routed to their owning shards during import. No cross-cluster broadcast, no post-import rebalance. + +```bash +# Import the dump +curl -X POST https://search.example.com/_miroir/dumps/import \ + -H "Authorization: Bearer " \ + -F "dump=@meilisearch-export.dump" \ + -F "indexUid=myindex" + +# Response: {"miroir_task_id":"mtask-00123"} + +# Monitor progress +curl https://search.example.com/_miroir/dumps/import/mtask-00123/status \ + -H "Authorization: Bearer " + +# Or use miroir-ctl +miroir-ctl task status mtask-00123 +``` + +**Progression:** + +| Phase | Description | +|-------|-------------| +| `Parsing` | Reading dump metadata and settings | +| `SettingsBroadcast` | Applying index settings via two-phase broadcast | +| `StreamingDocuments` | Routing documents to owning shards | +| `Complete` | Import finished successfully | + +**Expected time:** ~1-2 hours for 10 GB (depends on network and cluster size) + +--- + +### Step 4: Verification + +```bash +# Verify document counts match +curl https://old-meili.example.com/indexes/myindex/stats \ + -H "Authorization: Bearer " | jq '.numberOfDocuments' + +curl https://search.example.com/indexes/myindex/stats \ + -H "Authorization: Bearer " | jq '.numberOfDocuments' + +# Sample query comparison +curl -X POST https://old-meili.example.com/indexes/myindex/search \ + -H "Authorization: Bearer " \ + -H "Content-Type: application/json" \ + -d '{"q": "test", "limit": 10}' + +curl -X POST https://search.example.com/indexes/myindex/search \ + -H "Authorization: Bearer " \ + -H "Content-Type: application/json" \ + -d '{"q": "test", "limit": 10}' + +# Results should match (ordering may differ slightly due to distributed merge) +``` + +--- + +### Step 5: Update application configuration + +Update your application to point to Miroir: + +```python +# Before +client = meilisearch.Client('https://old-meili.example.com', 'key') + +# After +client = meilisearch.Client('https://search.example.com', 'miroir-key') +``` + +```typescript +// Before +const client = new MeiliSearch({ host: 'https://old-meili.example.com', apiKey: 'key' }) + +// After +const client = new MeiliSearch({ host: 'https://search.example.com', apiKey: 'miroir-key' }) +``` + +```go +// Before +client := meilisearch.NewClient(meilisearch.ClientConfig{ + Host: "https://old-meili.example.com", + APIKey: "key", +}) + +// After +client := meilisearch.NewClient(meilisearch.ClientConfig{ + Host: "https://search.example.com", + APIKey: "miroir-key", +}) +``` + +--- + +## Fallback: Broadcast Mode + +If Miroir cannot fully reconstruct your dump variant (e.g., custom dump format from a Meilisearch fork), fall back to broadcast mode: + +**Warning:** Broadcast mode imports the dump to **every node**, transiently placing 100% of the corpus on each node. This requires manual rebalancing afterward. + +```bash +# Set broadcast mode via Helm values +helm upgrade search miroir/miroir \ + --namespace search \ + --values my-values.yaml \ + --set miroir.dump_import.mode=broadcast + +# Or modify ConfigMap directly +kubectl edit configmap miroir-config -n search +# Set: miroir.dump_import.mode: broadcast + +# Restart proxy pods +kubectl rollout restart deployment miroir-proxy -n search + +# Import (now using broadcast mode) +curl -X POST https://search.example.com/_miroir/dumps/import \ + -H "Authorization: Bearer " \ + -F "dump=@meilisearch-export.dump" \ + -F "indexUid=myindex" + +# After import completes, rebalance to delete non-owning copies +miroir-ctl rebalance start --index myindex +miroir-ctl rebalance status --watch +``` + +--- + +## Rollback + +If verification fails or you need to roll back: + +```bash +# Point application back to old instance +# (revert SDK configuration changes) + +# Delete imported index from Miroir +curl -X DELETE https://search.example.com/indexes/myindex \ + -H "Authorization: Bearer " +``` + +--- + +## Troubleshooting + +### Import stuck at `SettingsBroadcast` + +**Cause:** Two-phase settings broadcast waiting for all nodes to acknowledge. + +**Solution:** + +```bash +# Check node health +miroir-ctl status + +# Verify all nodes are healthy +kubectl get pods -n search -l app=meilisearch + +# If a node is degraded, fix it first +kubectl describe pod -n search +``` + +### Import fails with "incompatible dump format" + +**Cause:** Dump format from Meilisearch version not supported by Miroir's nodes. + +**Solution:** Check Meilisearch versions match: + +```bash +# Old instance +curl https://old-meili.example.com/version + +# Miroir nodes +kubectl exec -n search -- curl http://localhost:7700/version +``` + +If versions differ significantly, either: +1. Upgrade old instance to match Miroir's version before exporting dump +2. Use **re-index** migration instead (see `from-meilisearch-reindex.md`) + +### Document counts don't match after import + +**Cause:** Streaming router may have failed to route some documents. + +**Solution:** + +```bash +# Check import task for errors +miroir-ctl task status mtask-00123 + +# Re-run import if errors found +# (Idempotent — duplicate documents are ignored) + +# Or run anti-entropy to detect and repair divergences +miroir-ctl anti-entropy run --index myindex +``` + +--- + +## See Also + +- [Plan §13.9 — Streaming routed dump import](../plan/plan.md#139-streaming-routed-dump-import) +- [Re-index migration](from-meilisearch-reindex.md) — for large corpora +- [Live cutover migration](from-meilisearch-live-cutover.md) — for zero-downtime +- [Troubleshooting Guide](../troubleshooting.md) — common issues and solutions diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000..8349d81 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,404 @@ +# Common Issues & Troubleshooting + +This guide covers the most common issues encountered when running Miroir in production, along with their symptoms, causes, and fixes. + +## Quick Diagnostics + +Before diving into specific issues, run the [diagnostic playbook](diagnostics.md) to gather baseline information about your cluster's health. + +## Common Issues + +### Error: "primary key required" + +#### Symptom +Client sees: +```json +HTTP 400 { + "code": "miroir_primary_key_required", + "message": "Miroir requires an explicit primary key at index creation" +} +``` + +#### Cause +The index was created without a `primaryKey` field. Miroir cannot route documents without knowing the primary key in advance. + +#### Fix +```bash +curl -X POST https://miroir/indexes \ + -H "Authorization: Bearer $KEY" \ + -d '{ + "uid": "myindex", + "primaryKey": "id" + }' +``` + +#### Why this differs from Meilisearch +Meilisearch can infer the primary key from the first document batch. Miroir cannot — it needs to hash the PK *before* any node sees it to determine which shard owns the document. Explicit `primaryKey` at index creation is required. + +--- + +### Search returns fewer results than expected + +#### Symptom +Search queries return fewer results than known document count, especially after node failures or during migrations. + +#### Cause +A replica holding a shard is degraded or unreachable. Miroir's cross-reference mechanism skips degraded replicas to avoid returning incomplete or stale results, which can reduce result counts when RF > 1. + +#### Fix +1. Check topology for degraded nodes: + ```bash + curl -s https://miroir/_miroir/topology | jq '.nodes[] | select(.status != "active")' + ``` + +2. Check for degraded shards: + ```bash + curl -s https://miroir/_miroir/metrics | jq '.degraded_shards' + ``` + +3. If a node is degraded, check its logs: + ```bash + kubectl logs miroir-0 --tail=100 | jq 'select(.level=="ERROR")' + ``` + +4. Restart the degraded pod if it's stuck: + ```bash + kubectl delete pod miroir-0 + ``` + +#### Prevention +- Set up canaries to proactively detect search degradation +- Monitor the `miroir_degraded_shards` metric +- Ensure proper resource limits to prevent OOM kills + +--- + +### Task polling stuck at "processing" + +#### Symptom +`miroir-ctl task status` shows a task stuck in "processing" state indefinitely, even though the operation appears complete. + +#### Cause +The task coordinator lost track of per-node task status. This can happen when: +- A node crashes during task execution +- Network partition prevents status updates +- Task registry checkpoint is delayed + +#### Fix +1. Check per-node task status: + ```bash + miroir-ctl task status --task-id --verbose + ``` + +2. Identify which node(s) have incomplete status: + ```bash + kubectl logs miroir-0 --tail=100 | grep "" + kubectl logs miroir-1 --tail=100 | grep "" + ``` + +3. If all nodes have completed but the task is stuck, force-complete the task: + ```bash + miroir-ctl task complete --task-id + ``` + +4. If a node crashed and cannot recover, mark its tasks as failed: + ```bash + miroir-ctl task fail --task-id --node --reason "node crashed" + ``` + +#### Prevention +- Enable task registry checkpointing (default: every 100 tasks) +- Monitor task queue depth via `miroir_task_queue_depth` metric +- Set task timeouts appropriate to your workload + +--- + +### Node drain blocked: "insufficient replicas" + +#### Symptom +```bash +$ miroir-ctl node drain node-1 +Error: Cannot drain node-1: removing it would drop replication factor below minimum +``` + +#### Cause +Draining a node would leave some shards with fewer replicas than the minimum RF. This is a safety check to prevent data loss. + +#### Fix +1. Check current RF configuration: + ```bash + curl -s https://miroir/_miroir/topology | jq '.replication_factor' + ``` + +2. Add a new node first: + ```bash + kubectl scale statefulset miroir --replicas=4 + # Wait for node-3 to be ready + kubectl wait --for=condition=ready pod/miroir-3 + ``` + +3. Then retry the drain: + ```bash + miroir-ctl node drain node-1 + ``` + +#### Alternative: Force drain (dangerous) +If you must drain without sufficient replicas, use `--force`: +```bash +miroir-ctl node drain node-1 --force +``` +This will reduce RF for affected shards during migration. Only use this if: +- You can tolerate reduced redundancy temporarily +- Anti-entropy is enabled to repair divergence later + +--- + +### Migration stuck after coordinator crash + +#### Symptom +A shard migration (reshard, rebalance, node drain) was in progress when the coordinator pod crashed. After restart, the migration is stuck and cannot complete or rollback. + +#### Cause +The coordinator stores migration state in the task store. If it crashes during state transitions, the migration may be left in an inconsistent state. + +#### Fix +1. Check migration status: + ```bash + miroir-ctl reshard status --operation-id + ``` + +2. If stuck in "in_progress" with no activity, recover the migration: + ```bash + miroir-ctl reshard recover --operation-id + ``` + +3. If recovery fails, you may need to force-complete: + ```bash + # This skips remaining delta pass and anti-entropy + miroir-ctl reshard complete --operation-id --force + ``` + +4. Run anti-entropy manually to repair any divergence: + ```bash + miroir-ctl anti-entropy run --index-uid + ``` + +#### Prevention +- Enable task store persistence (Redis mode for HA) +- Set coordinator leader election timeout appropriately +- Monitor coordinator pod health via liveness probes + +--- + +### High memory usage on Redis + +#### Symptom +Redis memory usage grows continuously, potentially triggering OOM kills. + +#### Cause +The most common causes are: +1. Idempotency cache entries not expiring +2. Task registry not pruning terminal tasks +3. Session entries not being cleaned up + +#### Fix +1. Check Redis memory breakdown: + ```bash + redis-cli INFO memory | grep used_memory_human + redis-cli --bigkeys --pattern "miroir:*" + ``` + +2. Check largest key categories: + ```bash + redis-cli --scan --pattern "miroir:tasks:*" | wc -l # task count + redis-cli --scan --pattern "miroir:idemp:*" | wc -l # idempotency entries + redis-cli --scan --pattern "miroir:session:*" | wc -l # sessions + ``` + +3. Manually trigger cleanup if pruner is stuck: + ```bash + # Prune old terminal tasks + miroir-ctl task prune --older-than 24h + + # Clear expired idempotency entries + redis-cli --scan --pattern "miroir:idemp:*" | xargs redis-cli DEL + ``` + +4. Adjust pruner intervals if needed: + ```yaml + # config.toml + [task_store.prune] + interval_seconds = 300 # run every 5 minutes + task_retention_days = 7 + ``` + +#### Prevention +- Monitor Redis memory usage via `redis_used_memory` metric +- Set `maxmemory` and `maxmemory-policy allkeys-lru` on Redis +- Ensure pruner is running (check logs for "Pruning terminal tasks" messages) + +--- + +### Index creation fails with "hash routing error" + +#### Symptom +```bash +$ curl -X POST https://miroir/indexes -d '{"uid": "test", "primaryKey": "id"}' +HTTP 500 {"code": "hash_routing_error", "message": "unable to determine shard assignment"} +``` + +#### Cause +This typically happens when: +1. The topology view is inconsistent across nodes +2. The shard count is 0 or not configured +3. The primary key field is missing from schema validation + +#### Fix +1. Check topology consistency: + ```bash + curl -s https://miroir/_miroir/topology | jq '.shards, .replication_factor, .nodes | length' + ``` + +2. Verify all nodes agree on shard count: + ```bash + for pod in miroir-0 miroir-1 miroir-2; do + echo "$pod:" + kubectl exec $pod -- curl -s localhost:7700/_miroir/topology | jq '.shards' + done + ``` + +3. If nodes disagree, restart the coordinator to force topology reconciliation: + ```bash + kubectl delete pod -l app=miroir,role=coordinator + ``` + +#### Prevention +- Use leader election to ensure single coordinator writer +- Monitor topology change log for conflicts + +--- + +### Alias flip returns "wrong kind" + +#### Symptom +```bash +$ miroir-ctl alias flip prod-logs logs-v2 +Error: Alias 'prod-logs' is a multi-target alias, cannot flip +``` + +#### Cause +You're trying to flip an alias that was created as a "multi" alias (for cross-index search) rather than a "single" alias (for atomic index swap). + +#### Fix +1. Check the alias type: + ```bash + miroir-ctl alias get prod-logs + ``` + +2. If you need a swappable pointer, delete and recreate as a single alias: + ```bash + miroir-ctl alias delete prod-logs + miroir-ctl alias create prod-logs --kind single --current-uid logs-v1 + ``` + +3. For cross-index search, use a separate multi alias: + ```bash + miroir-ctl alias create search-all --kind multi --target-uids logs-v1,metrics-v1 + ``` + +#### Prevention +- Use descriptive alias names to distinguish single vs multi +- Document alias conventions in your runbooks + +--- + +### Search timeout during shard migration + +#### Symptom +Search queries timeout or return 503 errors during active shard migrations, especially for large indexes. + +#### Cause +During migration, some queries may be routed to nodes that are still warming up migrated shards, or to nodes under heavy load from migration work. + +#### Fix +1. Check if migration is active: + ```bash + miroir-ctl reshard list --status in_progress + ``` + +2. Temporarily increase query timeout: + ```bash + curl -X POST https://miroir/indexes/myindex/search \ + -H "Query-Timeout: 30" \ + -d '{"q": "test"}' + ``` + +3. If timeouts persist, pause the migration: + ```bash + miroir-ctl reshard pause --operation-id + ``` + +4. Resume during off-peak hours: + ```bash + miroir-ctl reshard resume --operation-id + ``` + +#### Prevention +- Schedule large migrations during low-traffic periods +- Use `--throttle` flag on reshard to limit CPU usage +- Monitor search latency during migrations + +--- + +### CDC cursor out of sync + +#### Symptom +CDC events arrive with stale or duplicate sequence numbers, or events are missing entirely. + +#### Cause +The CDC cursor stored in Redis is out of sync with the actual event stream. This can happen if: +- The sink was down during a period of high write activity +- A cursor update failed silently +- The sink was reset without clearing the cursor + +#### Fix +1. Check current cursor position: + ```bash + miroir-ctl cdc cursor --sink-name elasticsearch --index-uid myindex + ``` + +2. Compare to Meilisearch event stream: + ```bash + # On a Meilisearch node + curl -s http://localhost:7700/indexes/myindex/cdc/events | jq '.events | length' + ``` + +3. If cursor is behind, reset it to force re-sync from a checkpoint: + ```bash + miroir-ctl cdc reset-cursor --sink-name elasticsearch --index-uid myindex --confirm + ``` + +4. For large gaps, consider a full re-index: + ```bash + miroir-ctl dump export --index-uid myindex --output /data/myindex.dump + miroir-ctl dump import --sink-name elasticsearch --input /data/myindex.dump + ``` + +#### Prevention +- Monitor CDC lag via `miroir_cdc_lag_seconds` metric +- Set up alerts for cursor stall detection +- Use idempotent sinks to handle duplicate events gracefully + +--- + +## Getting Help + +If you don't see your issue listed here: + +1. Run the [diagnostic playbook](diagnostics.md) and gather the output +2. Search [existing GitHub issues](https://github.com/jedarden/miroir/issues) +3. Open a new issue with: + - Miroir version + - Diagnostic output + - Relevant logs (sanitized) + - Steps to reproduce diff --git a/docs/troubleshooting/diagnostics.md b/docs/troubleshooting/diagnostics.md new file mode 100644 index 0000000..b7601c1 --- /dev/null +++ b/docs/troubleshooting/diagnostics.md @@ -0,0 +1,315 @@ +# Diagnostic Playbook + +This playbook provides a systematic approach to diagnosing issues in a Miroir cluster. Run these steps in order when investigating any problem. + +## Prerequisites + +Set up your environment: +```bash +export MIROIR_URL="https://miroir.example.com" +export MIROIR_KEY="your-admin-key" +export NAMESPACE="search" # adjust if needed +``` + +## Step 1: Check Cluster Health + +### 1.1 Verify all pods are running +```bash +kubectl get pods -n $NAMESPACE +``` + +**Expected output**: All pods in `Running` state, Ready 1/1 or 2/2. + +**Common issues**: +- Pods in `Pending` → resource constraints, scheduler issues +- Pods in `CrashLoopBackOff` → config errors, OOM kills +- Pods with `Ready: 0/1` → startup probe failing, dependency unavailable + +### 1.2 Check recent pod restarts +```bash +kubectl get pods -n $NAMESPACE -o json | jq -r '.items[] | "\(.metadata.name): \(.status.containerStatuses[0].restartCount) restarts"' +``` + +**Action**: Investigate pods with > 3 restarts in the last hour. + +### 1.3 Check resource usage +```bash +kubectl top pods -n $NAMESPACE +kubectl top nodes +``` + +**Action**: If CPU/memory limits are hit, consider scaling up or adjusting limits. + +## Step 2: Check Miroir Topology + +### 2.1 Get topology overview +```bash +curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '.' +``` + +**Expected output**: +```json +{ + "shards": 128, + "replication_factor": 2, + "nodes": [ + {"node_id": "node-0", "status": "active", "shards": [...]}, + {"node_id": "node-1", "status": "active", "shards": [...]}, + {"node_id": "node-2", "status": "active", "shards": [...]} + ] +} +``` + +**Common issues**: +- `status: "degraded"` → node is unreachable or unhealthy +- `status: "draining"` → node migration in progress +- `shards: []` → node has no assigned shards (newly added) + +### 2.2 Check for degraded shards +```bash +curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq ' + .nodes as $nodes | + .shards as $total | + ($nodes | map(.shards | length) | add) as $assigned | + "Assigned: \($assigned)/\($total*3) (RF × \($nodes | length))", + "Degraded nodes: \([.nodes[] | select(.status != "active")] | length)" +' +``` + +**Action**: Any degraded nodes need investigation (see Step 4). + +### 2.3 Verify node agreement on topology +```bash +for i in 0 1 2; do + echo "=== node-$i ===" + kubectl exec -n $NAMESPACE miroir-$i -- \ + curl -s localhost:7700/_miroir/topology | jq '.shards, .replication_factor' +done +``` + +**Expected**: All nodes report the same shard count and RF. + +**Action**: If nodes disagree, restart coordinator pod to force reconciliation. + +## Step 3: Check Metrics + +### 3.1 Get metrics summary +```bash +curl -s "$MIROIR_URL/_miroir/metrics?key=$MIROIR_KEY" | jq ' +{ + degraded_shards: .degraded_shards // 0, + task_queue_depth: .task_queue_depth // 0, + search_latency_p99: .search_latency_p99_ms // 0, + write_latency_p99: .write_latency_p99_ms // 0, + cdc_lag_seconds: .cdc_lag_seconds // 0 +} +' +``` + +**Key thresholds**: +- `degraded_shards > 0` → investigate node health +- `task_queue_depth > 1000` → task processing bottleneck +- `search_latency_p99 > 1000` → slow queries, need optimization +- `cdc_lag_seconds > 300` → CDC falling behind + +### 3.2 Check Prometheus metrics (if available) +```bash +# Via Prometheus API +curl -s "http://prometheus:9090/api/v1/query?query=miroir_degraded_shards" | jq '.data.result[0].value[1]' + +# Via pod metrics endpoint +kubectl exec -n $NAMESPACE miroir-0 -- curl -s localhost:9091/metrics | grep miroir_ +``` + +## Step 4: Check Logs for Errors + +### 4.1 Get recent errors from all pods +```bash +for pod in $(kubectl get pods -n $NAMESPACE -l app=miroir -o name); do + echo "=== $pod ===" + kubectl logs -n $NAMESPACE $pod --tail=100 | jq -rc 'select(.level=="ERROR")' || true + echo "" +done +``` + +**Common error patterns**: +- `connection refused` → peer pod down or network issue +- `timeout` → slow query, overloaded node +- `hash mismatch` → potential data corruption (run anti-entropy) +- `lease expired` → leader election contention + +### 4.2 Check coordinator logs for topology changes +```bash +kubectl logs -n $NAMESPACE -l app=miroir,role=coordinator --tail=200 | \ + jq -rc 'select(.message | test("topology|node|shard"))' +``` + +### 4.3 Check for crash loop patterns +```bash +kubectl logs -n $NAMESPACE miroir-0 --previous --tail=100 | \ + jq -rc 'select(.level=="ERROR" or .level=="FATAL")' || true +``` + +## Step 5: Check Task Status + +### 5.1 List stuck or long-running tasks +```bash +curl -s "$MIROIR_URL/_miroir/tasks?key=$MIROIR_KEY&status=processing" | \ + jq -r '.tasks[] | "\(.miroir_id) (\(.task_type // "unknown")): \(.created_at)"' +``` + +**Action**: Investigate tasks running > 1 hour. + +### 5.2 Get detailed task status +```bash +miroir-ctl task status --task-id --verbose +``` + +### 5.3 Check task registry health +```bash +# SQLite mode +kubectl exec -n $NAMESPACE miroir-0 -- \ + sqlite3 /data/miroir.db "SELECT status, COUNT(*) FROM tasks GROUP BY status;" + +# Redis mode +kubectl exec -n $NAMESPACE redis-0 -- \ + redis-cli --scan --pattern "miroir:tasks:*" | wc -l +``` + +## Step 6: Check Anti-Entropy Status + +### 6.1 Last AE run time +```bash +curl -s "$MIROIR_URL/_miroir/anti-entropy/status?key=$MIROIR_KEY" | \ + jq '{last_run: .last_run_at, next_run: .next_run_at, divergences_found: .divergences_found}' +``` + +**Action**: If `last_run_at` is > 24 hours ago, AE may be stuck. + +### 6.2 Check for divergence +```bash +curl -s "$MIROIR_URL/_miroir/anti-entropy/divergence?key=$MIROIR_KEY" | \ + jq '.divergent_shards | length' +``` + +**Action**: Any divergent shards should trigger an AE run. + +## Step 7: Check External Dependencies + +### 7.1 Check Redis connectivity +```bash +kubectl exec -n $NAMESPACE miroir-0 -- \ + redis-cli -h redis-headless ping +``` + +**Expected**: `PONG` + +### 7.2 Check Meilisearch backend connectivity +```bash +for i in 0 1 2; do + echo "=== miroir-$i ===" + kubectl exec -n $NAMESPACE miroir-$i -- \ + curl -s http://localhost:7700/health | jq '.status' +done +``` + +**Expected**: `"available"` + +### 7.3 Check network policies +```bash +kubectl get networkpolicy -n $NAMESPACE +kubectl describe networkpolicy miroir-allow-peer -n $NAMESPACE +``` + +## Step 8: Run Self-Diagnostics + +### 8.1 Miroir self-check endpoint +```bash +curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY" | jq '.' +``` + +**Expected output**: +```json +{ + "status": "healthy", + "checks": { + "topology": "ok", + "task_store": "ok", + "coordinator_leader": "ok", + "peers_connected": "ok" + } +} +``` + +### 8.2 Run canary tests +```bash +# List configured canaries +curl -s "$MIROIR_URL/_miroir/canaries?key=$MIROIR_KEY" | \ + jq -r '.canaries[] | .id' + +# Trigger a canary run +curl -X POST "$MIROIR_URL/_miroir/canaries/search-health/run?key=$MIROIR_KEY" +``` + +## Decision Tree + +Based on findings, follow this tree: + +``` +Are any pods not running? +├─ Yes → Check pod logs (Step 4), describe pod for events +└─ No → Continue + +Are any nodes degraded? +├─ Yes → Check node logs, verify network, restart if needed +└─ No → Continue + +Is task queue depth > 1000? +├─ Yes → Check for stuck tasks (Step 5), scale workers if needed +└─ No → Continue + +Is search latency high? +├─ Yes → Check query patterns, consider query optimization +└─ No → Continue + +Any errors in logs? +├─ Yes → Investigate specific error pattern +└─ No → Issue may be external, check dependencies (Step 7) +``` + +## Escalation Checklist + +Before escalating, gather: + +1. **Topology output** (Step 2.1) +2. **Recent errors** (Step 4.1) +3. **Stuck tasks** (Step 5.1) +4. **Metrics snapshot** (Step 3.1) +5. **Pod status** (Step 1.1) + +Attach these to your GitHub issue or support ticket. + +## Prevention: Regular Health Checks + +Set up a cron job or monitoring alert to run this daily: + +```bash +#!/bin/bash +# daily-health-check.sh + +# Quick health check +HEALTH=$(curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY") +STATUS=$(echo $HEALTH | jq -r '.status') + +if [ "$STATUS" != "healthy" ]; then + echo "UNHEALTHY: $HEALTH" + # Send alert +fi +``` + +## Related Documentation + +- [Common Issues Guide](../troubleshooting.md) +- [Node Drain Runbook](../runbooks/node-drain.md) +- [Migration Runbook](../migration_runbook.md) +- [Metrics Reference](../operations/metrics.md)