Implements P11.5 acceptance criteria: - Created docs/troubleshooting.md with 10 common issues - Created docs/troubleshooting/diagnostics.md with systematic diagnostic playbook - Documented 3 required plan §11 issues (primary key required, degraded search results, stuck tasks) - Added 7 additional issues from Phase 9 chaos testing and operations - Cross-linked from README, migration runbook, and dump import guide Documented issues: 1. "primary key required" - Miroir vs Meilisearch difference 2. Search returns fewer results - degraded node handling 3. Task polling stuck - per-node task status recovery 4. Node drain blocked - RF constraints 5. Migration stuck after coordinator crash - recovery procedures 6. High memory usage on Redis - cleanup procedures 7. Index creation fails - topology inconsistency 8. Alias flip conflicts - single vs multi alias types 9. Search timeout during migration - throttling options 10. CDC cursor out of sync - recovery and re-index Diagnostic playbook covers: - Cluster health checks (pods, nodes, resources) - Topology verification and node agreement - Metrics analysis (degraded shards, task queue, latency) - Log analysis for error patterns - Task status inspection - Anti-entropy status - External dependency checks - Self-diagnostics and canary tests Closes: miroir-uyx.5
11 KiB
Common Issues & Troubleshooting
This guide covers the most common issues encountered when running Miroir in production, along with their symptoms, causes, and fixes.
Quick Diagnostics
Before diving into specific issues, run the diagnostic playbook to gather baseline information about your cluster's health.
Common Issues
Error: "primary key required"
Symptom
Client sees:
HTTP 400 {
"code": "miroir_primary_key_required",
"message": "Miroir requires an explicit primary key at index creation"
}
Cause
The index was created without a primaryKey field. Miroir cannot route documents without knowing the primary key in advance.
Fix
curl -X POST https://miroir/indexes \
-H "Authorization: Bearer $KEY" \
-d '{
"uid": "myindex",
"primaryKey": "id"
}'
Why this differs from Meilisearch
Meilisearch can infer the primary key from the first document batch. Miroir cannot — it needs to hash the PK before any node sees it to determine which shard owns the document. Explicit primaryKey at index creation is required.
Search returns fewer results than expected
Symptom
Search queries return fewer results than known document count, especially after node failures or during migrations.
Cause
A replica holding a shard is degraded or unreachable. Miroir's cross-reference mechanism skips degraded replicas to avoid returning incomplete or stale results, which can reduce result counts when RF > 1.
Fix
-
Check topology for degraded nodes:
curl -s https://miroir/_miroir/topology | jq '.nodes[] | select(.status != "active")' -
Check for degraded shards:
curl -s https://miroir/_miroir/metrics | jq '.degraded_shards' -
If a node is degraded, check its logs:
kubectl logs miroir-0 --tail=100 | jq 'select(.level=="ERROR")' -
Restart the degraded pod if it's stuck:
kubectl delete pod miroir-0
Prevention
- Set up canaries to proactively detect search degradation
- Monitor the
miroir_degraded_shardsmetric - Ensure proper resource limits to prevent OOM kills
Task polling stuck at "processing"
Symptom
miroir-ctl task status shows a task stuck in "processing" state indefinitely, even though the operation appears complete.
Cause
The task coordinator lost track of per-node task status. This can happen when:
- A node crashes during task execution
- Network partition prevents status updates
- Task registry checkpoint is delayed
Fix
-
Check per-node task status:
miroir-ctl task status --task-id <miroir_task_id> --verbose -
Identify which node(s) have incomplete status:
kubectl logs miroir-0 --tail=100 | grep "<miroir_task_id>" kubectl logs miroir-1 --tail=100 | grep "<miroir_task_id>" -
If all nodes have completed but the task is stuck, force-complete the task:
miroir-ctl task complete --task-id <miroir_task_id> -
If a node crashed and cannot recover, mark its tasks as failed:
miroir-ctl task fail --task-id <miroir_task_id> --node <node_id> --reason "node crashed"
Prevention
- Enable task registry checkpointing (default: every 100 tasks)
- Monitor task queue depth via
miroir_task_queue_depthmetric - Set task timeouts appropriate to your workload
Node drain blocked: "insufficient replicas"
Symptom
$ miroir-ctl node drain node-1
Error: Cannot drain node-1: removing it would drop replication factor below minimum
Cause
Draining a node would leave some shards with fewer replicas than the minimum RF. This is a safety check to prevent data loss.
Fix
-
Check current RF configuration:
curl -s https://miroir/_miroir/topology | jq '.replication_factor' -
Add a new node first:
kubectl scale statefulset miroir --replicas=4 # Wait for node-3 to be ready kubectl wait --for=condition=ready pod/miroir-3 -
Then retry the drain:
miroir-ctl node drain node-1
Alternative: Force drain (dangerous)
If you must drain without sufficient replicas, use --force:
miroir-ctl node drain node-1 --force
This will reduce RF for affected shards during migration. Only use this if:
- You can tolerate reduced redundancy temporarily
- Anti-entropy is enabled to repair divergence later
Migration stuck after coordinator crash
Symptom
A shard migration (reshard, rebalance, node drain) was in progress when the coordinator pod crashed. After restart, the migration is stuck and cannot complete or rollback.
Cause
The coordinator stores migration state in the task store. If it crashes during state transitions, the migration may be left in an inconsistent state.
Fix
-
Check migration status:
miroir-ctl reshard status --operation-id <operation_id> -
If stuck in "in_progress" with no activity, recover the migration:
miroir-ctl reshard recover --operation-id <operation_id> -
If recovery fails, you may need to force-complete:
# This skips remaining delta pass and anti-entropy miroir-ctl reshard complete --operation-id <operation_id> --force -
Run anti-entropy manually to repair any divergence:
miroir-ctl anti-entropy run --index-uid <affected_index>
Prevention
- Enable task store persistence (Redis mode for HA)
- Set coordinator leader election timeout appropriately
- Monitor coordinator pod health via liveness probes
High memory usage on Redis
Symptom
Redis memory usage grows continuously, potentially triggering OOM kills.
Cause
The most common causes are:
- Idempotency cache entries not expiring
- Task registry not pruning terminal tasks
- Session entries not being cleaned up
Fix
-
Check Redis memory breakdown:
redis-cli INFO memory | grep used_memory_human redis-cli --bigkeys --pattern "miroir:*" -
Check largest key categories:
redis-cli --scan --pattern "miroir:tasks:*" | wc -l # task count redis-cli --scan --pattern "miroir:idemp:*" | wc -l # idempotency entries redis-cli --scan --pattern "miroir:session:*" | wc -l # sessions -
Manually trigger cleanup if pruner is stuck:
# Prune old terminal tasks miroir-ctl task prune --older-than 24h # Clear expired idempotency entries redis-cli --scan --pattern "miroir:idemp:*" | xargs redis-cli DEL -
Adjust pruner intervals if needed:
# config.toml [task_store.prune] interval_seconds = 300 # run every 5 minutes task_retention_days = 7
Prevention
- Monitor Redis memory usage via
redis_used_memorymetric - Set
maxmemoryandmaxmemory-policy allkeys-lruon Redis - Ensure pruner is running (check logs for "Pruning terminal tasks" messages)
Index creation fails with "hash routing error"
Symptom
$ curl -X POST https://miroir/indexes -d '{"uid": "test", "primaryKey": "id"}'
HTTP 500 {"code": "hash_routing_error", "message": "unable to determine shard assignment"}
Cause
This typically happens when:
- The topology view is inconsistent across nodes
- The shard count is 0 or not configured
- The primary key field is missing from schema validation
Fix
-
Check topology consistency:
curl -s https://miroir/_miroir/topology | jq '.shards, .replication_factor, .nodes | length' -
Verify all nodes agree on shard count:
for pod in miroir-0 miroir-1 miroir-2; do echo "$pod:" kubectl exec $pod -- curl -s localhost:7700/_miroir/topology | jq '.shards' done -
If nodes disagree, restart the coordinator to force topology reconciliation:
kubectl delete pod -l app=miroir,role=coordinator
Prevention
- Use leader election to ensure single coordinator writer
- Monitor topology change log for conflicts
Alias flip returns "wrong kind"
Symptom
$ miroir-ctl alias flip prod-logs logs-v2
Error: Alias 'prod-logs' is a multi-target alias, cannot flip
Cause
You're trying to flip an alias that was created as a "multi" alias (for cross-index search) rather than a "single" alias (for atomic index swap).
Fix
-
Check the alias type:
miroir-ctl alias get prod-logs -
If you need a swappable pointer, delete and recreate as a single alias:
miroir-ctl alias delete prod-logs miroir-ctl alias create prod-logs --kind single --current-uid logs-v1 -
For cross-index search, use a separate multi alias:
miroir-ctl alias create search-all --kind multi --target-uids logs-v1,metrics-v1
Prevention
- Use descriptive alias names to distinguish single vs multi
- Document alias conventions in your runbooks
Search timeout during shard migration
Symptom
Search queries timeout or return 503 errors during active shard migrations, especially for large indexes.
Cause
During migration, some queries may be routed to nodes that are still warming up migrated shards, or to nodes under heavy load from migration work.
Fix
-
Check if migration is active:
miroir-ctl reshard list --status in_progress -
Temporarily increase query timeout:
curl -X POST https://miroir/indexes/myindex/search \ -H "Query-Timeout: 30" \ -d '{"q": "test"}' -
If timeouts persist, pause the migration:
miroir-ctl reshard pause --operation-id <operation_id> -
Resume during off-peak hours:
miroir-ctl reshard resume --operation-id <operation_id>
Prevention
- Schedule large migrations during low-traffic periods
- Use
--throttleflag on reshard to limit CPU usage - Monitor search latency during migrations
CDC cursor out of sync
Symptom
CDC events arrive with stale or duplicate sequence numbers, or events are missing entirely.
Cause
The CDC cursor stored in Redis is out of sync with the actual event stream. This can happen if:
- The sink was down during a period of high write activity
- A cursor update failed silently
- The sink was reset without clearing the cursor
Fix
-
Check current cursor position:
miroir-ctl cdc cursor --sink-name elasticsearch --index-uid myindex -
Compare to Meilisearch event stream:
# On a Meilisearch node curl -s http://localhost:7700/indexes/myindex/cdc/events | jq '.events | length' -
If cursor is behind, reset it to force re-sync from a checkpoint:
miroir-ctl cdc reset-cursor --sink-name elasticsearch --index-uid myindex --confirm -
For large gaps, consider a full re-index:
miroir-ctl dump export --index-uid myindex --output /data/myindex.dump miroir-ctl dump import --sink-name elasticsearch --input /data/myindex.dump
Prevention
- Monitor CDC lag via
miroir_cdc_lag_secondsmetric - Set up alerts for cursor stall detection
- Use idempotent sinks to handle duplicate events gracefully
Getting Help
If you don't see your issue listed here:
- Run the diagnostic playbook and gather the output
- Search existing GitHub issues
- Open a new issue with:
- Miroir version
- Diagnostic output
- Relevant logs (sanitized)
- Steps to reproduce