miroir/docs/troubleshooting.md
jedarden c5238b1bcd docs(troubleshooting): add common issues guide and diagnostic playbook (P11.5)
Implements P11.5 acceptance criteria:
- Created docs/troubleshooting.md with 10 common issues
- Created docs/troubleshooting/diagnostics.md with systematic diagnostic playbook
- Documented 3 required plan §11 issues (primary key required, degraded search results, stuck tasks)
- Added 7 additional issues from Phase 9 chaos testing and operations
- Cross-linked from README, migration runbook, and dump import guide

Documented issues:
1. "primary key required" - Miroir vs Meilisearch difference
2. Search returns fewer results - degraded node handling
3. Task polling stuck - per-node task status recovery
4. Node drain blocked - RF constraints
5. Migration stuck after coordinator crash - recovery procedures
6. High memory usage on Redis - cleanup procedures
7. Index creation fails - topology inconsistency
8. Alias flip conflicts - single vs multi alias types
9. Search timeout during migration - throttling options
10. CDC cursor out of sync - recovery and re-index

Diagnostic playbook covers:
- Cluster health checks (pods, nodes, resources)
- Topology verification and node agreement
- Metrics analysis (degraded shards, task queue, latency)
- Log analysis for error patterns
- Task status inspection
- Anti-entropy status
- External dependency checks
- Self-diagnostics and canary tests

Closes: miroir-uyx.5
2026-05-24 14:02:13 -04:00

11 KiB

Common Issues & Troubleshooting

This guide covers the most common issues encountered when running Miroir in production, along with their symptoms, causes, and fixes.

Quick Diagnostics

Before diving into specific issues, run the diagnostic playbook to gather baseline information about your cluster's health.

Common Issues

Error: "primary key required"

Symptom

Client sees:

HTTP 400 {
  "code": "miroir_primary_key_required",
  "message": "Miroir requires an explicit primary key at index creation"
}

Cause

The index was created without a primaryKey field. Miroir cannot route documents without knowing the primary key in advance.

Fix

curl -X POST https://miroir/indexes \
  -H "Authorization: Bearer $KEY" \
  -d '{
    "uid": "myindex",
    "primaryKey": "id"
  }'

Why this differs from Meilisearch

Meilisearch can infer the primary key from the first document batch. Miroir cannot — it needs to hash the PK before any node sees it to determine which shard owns the document. Explicit primaryKey at index creation is required.


Search returns fewer results than expected

Symptom

Search queries return fewer results than known document count, especially after node failures or during migrations.

Cause

A replica holding a shard is degraded or unreachable. Miroir's cross-reference mechanism skips degraded replicas to avoid returning incomplete or stale results, which can reduce result counts when RF > 1.

Fix

  1. Check topology for degraded nodes:

    curl -s https://miroir/_miroir/topology | jq '.nodes[] | select(.status != "active")'
    
  2. Check for degraded shards:

    curl -s https://miroir/_miroir/metrics | jq '.degraded_shards'
    
  3. If a node is degraded, check its logs:

    kubectl logs miroir-0 --tail=100 | jq 'select(.level=="ERROR")'
    
  4. Restart the degraded pod if it's stuck:

    kubectl delete pod miroir-0
    

Prevention

  • Set up canaries to proactively detect search degradation
  • Monitor the miroir_degraded_shards metric
  • Ensure proper resource limits to prevent OOM kills

Task polling stuck at "processing"

Symptom

miroir-ctl task status shows a task stuck in "processing" state indefinitely, even though the operation appears complete.

Cause

The task coordinator lost track of per-node task status. This can happen when:

  • A node crashes during task execution
  • Network partition prevents status updates
  • Task registry checkpoint is delayed

Fix

  1. Check per-node task status:

    miroir-ctl task status --task-id <miroir_task_id> --verbose
    
  2. Identify which node(s) have incomplete status:

    kubectl logs miroir-0 --tail=100 | grep "<miroir_task_id>"
    kubectl logs miroir-1 --tail=100 | grep "<miroir_task_id>"
    
  3. If all nodes have completed but the task is stuck, force-complete the task:

    miroir-ctl task complete --task-id <miroir_task_id>
    
  4. If a node crashed and cannot recover, mark its tasks as failed:

    miroir-ctl task fail --task-id <miroir_task_id> --node <node_id> --reason "node crashed"
    

Prevention

  • Enable task registry checkpointing (default: every 100 tasks)
  • Monitor task queue depth via miroir_task_queue_depth metric
  • Set task timeouts appropriate to your workload

Node drain blocked: "insufficient replicas"

Symptom

$ miroir-ctl node drain node-1
Error: Cannot drain node-1: removing it would drop replication factor below minimum

Cause

Draining a node would leave some shards with fewer replicas than the minimum RF. This is a safety check to prevent data loss.

Fix

  1. Check current RF configuration:

    curl -s https://miroir/_miroir/topology | jq '.replication_factor'
    
  2. Add a new node first:

    kubectl scale statefulset miroir --replicas=4
    # Wait for node-3 to be ready
    kubectl wait --for=condition=ready pod/miroir-3
    
  3. Then retry the drain:

    miroir-ctl node drain node-1
    

Alternative: Force drain (dangerous)

If you must drain without sufficient replicas, use --force:

miroir-ctl node drain node-1 --force

This will reduce RF for affected shards during migration. Only use this if:

  • You can tolerate reduced redundancy temporarily
  • Anti-entropy is enabled to repair divergence later

Migration stuck after coordinator crash

Symptom

A shard migration (reshard, rebalance, node drain) was in progress when the coordinator pod crashed. After restart, the migration is stuck and cannot complete or rollback.

Cause

The coordinator stores migration state in the task store. If it crashes during state transitions, the migration may be left in an inconsistent state.

Fix

  1. Check migration status:

    miroir-ctl reshard status --operation-id <operation_id>
    
  2. If stuck in "in_progress" with no activity, recover the migration:

    miroir-ctl reshard recover --operation-id <operation_id>
    
  3. If recovery fails, you may need to force-complete:

    # This skips remaining delta pass and anti-entropy
    miroir-ctl reshard complete --operation-id <operation_id> --force
    
  4. Run anti-entropy manually to repair any divergence:

    miroir-ctl anti-entropy run --index-uid <affected_index>
    

Prevention

  • Enable task store persistence (Redis mode for HA)
  • Set coordinator leader election timeout appropriately
  • Monitor coordinator pod health via liveness probes

High memory usage on Redis

Symptom

Redis memory usage grows continuously, potentially triggering OOM kills.

Cause

The most common causes are:

  1. Idempotency cache entries not expiring
  2. Task registry not pruning terminal tasks
  3. Session entries not being cleaned up

Fix

  1. Check Redis memory breakdown:

    redis-cli INFO memory | grep used_memory_human
    redis-cli --bigkeys --pattern "miroir:*"
    
  2. Check largest key categories:

    redis-cli --scan --pattern "miroir:tasks:*" | wc -l  # task count
    redis-cli --scan --pattern "miroir:idemp:*" | wc -l   # idempotency entries
    redis-cli --scan --pattern "miroir:session:*" | wc -l # sessions
    
  3. Manually trigger cleanup if pruner is stuck:

    # Prune old terminal tasks
    miroir-ctl task prune --older-than 24h
    
    # Clear expired idempotency entries
    redis-cli --scan --pattern "miroir:idemp:*" | xargs redis-cli DEL
    
  4. Adjust pruner intervals if needed:

    # config.toml
    [task_store.prune]
    interval_seconds = 300  # run every 5 minutes
    task_retention_days = 7
    

Prevention

  • Monitor Redis memory usage via redis_used_memory metric
  • Set maxmemory and maxmemory-policy allkeys-lru on Redis
  • Ensure pruner is running (check logs for "Pruning terminal tasks" messages)

Index creation fails with "hash routing error"

Symptom

$ curl -X POST https://miroir/indexes -d '{"uid": "test", "primaryKey": "id"}'
HTTP 500 {"code": "hash_routing_error", "message": "unable to determine shard assignment"}

Cause

This typically happens when:

  1. The topology view is inconsistent across nodes
  2. The shard count is 0 or not configured
  3. The primary key field is missing from schema validation

Fix

  1. Check topology consistency:

    curl -s https://miroir/_miroir/topology | jq '.shards, .replication_factor, .nodes | length'
    
  2. Verify all nodes agree on shard count:

    for pod in miroir-0 miroir-1 miroir-2; do
      echo "$pod:"
      kubectl exec $pod -- curl -s localhost:7700/_miroir/topology | jq '.shards'
    done
    
  3. If nodes disagree, restart the coordinator to force topology reconciliation:

    kubectl delete pod -l app=miroir,role=coordinator
    

Prevention

  • Use leader election to ensure single coordinator writer
  • Monitor topology change log for conflicts

Alias flip returns "wrong kind"

Symptom

$ miroir-ctl alias flip prod-logs logs-v2
Error: Alias 'prod-logs' is a multi-target alias, cannot flip

Cause

You're trying to flip an alias that was created as a "multi" alias (for cross-index search) rather than a "single" alias (for atomic index swap).

Fix

  1. Check the alias type:

    miroir-ctl alias get prod-logs
    
  2. If you need a swappable pointer, delete and recreate as a single alias:

    miroir-ctl alias delete prod-logs
    miroir-ctl alias create prod-logs --kind single --current-uid logs-v1
    
  3. For cross-index search, use a separate multi alias:

    miroir-ctl alias create search-all --kind multi --target-uids logs-v1,metrics-v1
    

Prevention

  • Use descriptive alias names to distinguish single vs multi
  • Document alias conventions in your runbooks

Search timeout during shard migration

Symptom

Search queries timeout or return 503 errors during active shard migrations, especially for large indexes.

Cause

During migration, some queries may be routed to nodes that are still warming up migrated shards, or to nodes under heavy load from migration work.

Fix

  1. Check if migration is active:

    miroir-ctl reshard list --status in_progress
    
  2. Temporarily increase query timeout:

    curl -X POST https://miroir/indexes/myindex/search \
      -H "Query-Timeout: 30" \
      -d '{"q": "test"}'
    
  3. If timeouts persist, pause the migration:

    miroir-ctl reshard pause --operation-id <operation_id>
    
  4. Resume during off-peak hours:

    miroir-ctl reshard resume --operation-id <operation_id>
    

Prevention

  • Schedule large migrations during low-traffic periods
  • Use --throttle flag on reshard to limit CPU usage
  • Monitor search latency during migrations

CDC cursor out of sync

Symptom

CDC events arrive with stale or duplicate sequence numbers, or events are missing entirely.

Cause

The CDC cursor stored in Redis is out of sync with the actual event stream. This can happen if:

  • The sink was down during a period of high write activity
  • A cursor update failed silently
  • The sink was reset without clearing the cursor

Fix

  1. Check current cursor position:

    miroir-ctl cdc cursor --sink-name elasticsearch --index-uid myindex
    
  2. Compare to Meilisearch event stream:

    # On a Meilisearch node
    curl -s http://localhost:7700/indexes/myindex/cdc/events | jq '.events | length'
    
  3. If cursor is behind, reset it to force re-sync from a checkpoint:

    miroir-ctl cdc reset-cursor --sink-name elasticsearch --index-uid myindex --confirm
    
  4. For large gaps, consider a full re-index:

    miroir-ctl dump export --index-uid myindex --output /data/myindex.dump
    miroir-ctl dump import --sink-name elasticsearch --input /data/myindex.dump
    

Prevention

  • Monitor CDC lag via miroir_cdc_lag_seconds metric
  • Set up alerts for cursor stall detection
  • Use idempotent sinks to handle duplicate events gracefully

Getting Help

If you don't see your issue listed here:

  1. Run the diagnostic playbook and gather the output
  2. Search existing GitHub issues
  3. Open a new issue with:
    • Miroir version
    • Diagnostic output
    • Relevant logs (sanitized)
    • Steps to reproduce