jedarden c5238b1bcd docs(troubleshooting): add common issues guide and diagnostic playbook (P11.5)

Implements P11.5 acceptance criteria:
- Created docs/troubleshooting.md with 10 common issues
- Created docs/troubleshooting/diagnostics.md with systematic diagnostic playbook
- Documented 3 required plan §11 issues (primary key required, degraded search results, stuck tasks)
- Added 7 additional issues from Phase 9 chaos testing and operations
- Cross-linked from README, migration runbook, and dump import guide

Documented issues:
1. "primary key required" - Miroir vs Meilisearch difference
2. Search returns fewer results - degraded node handling
3. Task polling stuck - per-node task status recovery
4. Node drain blocked - RF constraints
5. Migration stuck after coordinator crash - recovery procedures
6. High memory usage on Redis - cleanup procedures
7. Index creation fails - topology inconsistency
8. Alias flip conflicts - single vs multi alias types
9. Search timeout during migration - throttling options
10. CDC cursor out of sync - recovery and re-index

Diagnostic playbook covers:
- Cluster health checks (pods, nodes, resources)
- Topology verification and node agreement
- Metrics analysis (degraded shards, task queue, latency)
- Log analysis for error patterns
- Task status inspection
- Anti-entropy status
- External dependency checks
- Self-diagnostics and canary tests

Closes: miroir-uyx.5

2026-05-24 14:02:13 -04:00

11 KiB

Raw Blame History

Common Issues & Troubleshooting

This guide covers the most common issues encountered when running Miroir in production, along with their symptoms, causes, and fixes.

Quick Diagnostics

Before diving into specific issues, run the diagnostic playbook to gather baseline information about your cluster's health.

Common Issues

Error: "primary key required"

Symptom

Client sees:

HTTP 400 {
  "code": "miroir_primary_key_required",
  "message": "Miroir requires an explicit primary key at index creation"
}

Cause

The index was created without a primaryKey field. Miroir cannot route documents without knowing the primary key in advance.

Fix

curl -X POST https://miroir/indexes \
  -H "Authorization: Bearer $KEY" \
  -d '{
    "uid": "myindex",
    "primaryKey": "id"
  }'

Why this differs from Meilisearch

Meilisearch can infer the primary key from the first document batch. Miroir cannot — it needs to hash the PK before any node sees it to determine which shard owns the document. Explicit primaryKey at index creation is required.

Search returns fewer results than expected

Symptom

Search queries return fewer results than known document count, especially after node failures or during migrations.

Cause

A replica holding a shard is degraded or unreachable. Miroir's cross-reference mechanism skips degraded replicas to avoid returning incomplete or stale results, which can reduce result counts when RF > 1.

Fix

Check topology for degraded nodes:

curl -s https://miroir/_miroir/topology | jq '.nodes[] | select(.status != "active")'

Check for degraded shards:

curl -s https://miroir/_miroir/metrics | jq '.degraded_shards'

If a node is degraded, check its logs:

kubectl logs miroir-0 --tail=100 | jq 'select(.level=="ERROR")'

Restart the degraded pod if it's stuck:
```
kubectl delete pod miroir-0
```

Prevention

Set up canaries to proactively detect search degradation
Monitor the miroir_degraded_shards metric
Ensure proper resource limits to prevent OOM kills

Task polling stuck at "processing"

Symptom

miroir-ctl task status shows a task stuck in "processing" state indefinitely, even though the operation appears complete.

Cause

The task coordinator lost track of per-node task status. This can happen when:

A node crashes during task execution
Network partition prevents status updates
Task registry checkpoint is delayed

Fix

Check per-node task status:

miroir-ctl task status --task-id <miroir_task_id> --verbose

Identify which node(s) have incomplete status:

kubectl logs miroir-0 --tail=100 | grep "<miroir_task_id>"
kubectl logs miroir-1 --tail=100 | grep "<miroir_task_id>"

If all nodes have completed but the task is stuck, force-complete the task:
```
miroir-ctl task complete --task-id <miroir_task_id>
```

If a node crashed and cannot recover, mark its tasks as failed:

miroir-ctl task fail --task-id <miroir_task_id> --node <node_id> --reason "node crashed"

Prevention

Enable task registry checkpointing (default: every 100 tasks)
Monitor task queue depth via miroir_task_queue_depth metric
Set task timeouts appropriate to your workload

Node drain blocked: "insufficient replicas"

Symptom

$ miroir-ctl node drain node-1
Error: Cannot drain node-1: removing it would drop replication factor below minimum

Cause

Draining a node would leave some shards with fewer replicas than the minimum RF. This is a safety check to prevent data loss.

Fix

Check current RF configuration:

curl -s https://miroir/_miroir/topology | jq '.replication_factor'

Add a new node first:

kubectl scale statefulset miroir --replicas=4
# Wait for node-3 to be ready
kubectl wait --for=condition=ready pod/miroir-3

Then retry the drain:
```
miroir-ctl node drain node-1
```

Alternative: Force drain (dangerous)

If you must drain without sufficient replicas, use --force:

miroir-ctl node drain node-1 --force

This will reduce RF for affected shards during migration. Only use this if:

You can tolerate reduced redundancy temporarily
Anti-entropy is enabled to repair divergence later

Migration stuck after coordinator crash

Symptom

A shard migration (reshard, rebalance, node drain) was in progress when the coordinator pod crashed. After restart, the migration is stuck and cannot complete or rollback.

Cause

The coordinator stores migration state in the task store. If it crashes during state transitions, the migration may be left in an inconsistent state.

Fix

Check migration status:

miroir-ctl reshard status --operation-id <operation_id>

If stuck in "in_progress" with no activity, recover the migration:
```
miroir-ctl reshard recover --operation-id <operation_id>
```

If recovery fails, you may need to force-complete:

# This skips remaining delta pass and anti-entropy
miroir-ctl reshard complete --operation-id <operation_id> --force

Run anti-entropy manually to repair any divergence:

miroir-ctl anti-entropy run --index-uid <affected_index>

Prevention

Enable task store persistence (Redis mode for HA)
Set coordinator leader election timeout appropriately
Monitor coordinator pod health via liveness probes

High memory usage on Redis

Symptom

Redis memory usage grows continuously, potentially triggering OOM kills.

Cause

The most common causes are:

Idempotency cache entries not expiring
Task registry not pruning terminal tasks
Session entries not being cleaned up

Fix

Check Redis memory breakdown:

redis-cli INFO memory | grep used_memory_human
redis-cli --bigkeys --pattern "miroir:*"

Check largest key categories:

redis-cli --scan --pattern "miroir:tasks:*" | wc -l  # task count
redis-cli --scan --pattern "miroir:idemp:*" | wc -l   # idempotency entries
redis-cli --scan --pattern "miroir:session:*" | wc -l # sessions

Manually trigger cleanup if pruner is stuck:

# Prune old terminal tasks
miroir-ctl task prune --older-than 24h

# Clear expired idempotency entries
redis-cli --scan --pattern "miroir:idemp:*" | xargs redis-cli DEL

Adjust pruner intervals if needed:

# config.toml
[task_store.prune]
interval_seconds = 300  # run every 5 minutes
task_retention_days = 7

Prevention

Monitor Redis memory usage via redis_used_memory metric
Set maxmemory and maxmemory-policy allkeys-lru on Redis
Ensure pruner is running (check logs for "Pruning terminal tasks" messages)

Index creation fails with "hash routing error"

Symptom

$ curl -X POST https://miroir/indexes -d '{"uid": "test", "primaryKey": "id"}'
HTTP 500 {"code": "hash_routing_error", "message": "unable to determine shard assignment"}

Cause

This typically happens when:

The topology view is inconsistent across nodes
The shard count is 0 or not configured
The primary key field is missing from schema validation

Fix

Check topology consistency:

curl -s https://miroir/_miroir/topology | jq '.shards, .replication_factor, .nodes | length'

Verify all nodes agree on shard count:

for pod in miroir-0 miroir-1 miroir-2; do
  echo "$pod:"
  kubectl exec $pod -- curl -s localhost:7700/_miroir/topology | jq '.shards'
done

If nodes disagree, restart the coordinator to force topology reconciliation:
```
kubectl delete pod -l app=miroir,role=coordinator
```

Prevention

Use leader election to ensure single coordinator writer
Monitor topology change log for conflicts

Alias flip returns "wrong kind"

Symptom

$ miroir-ctl alias flip prod-logs logs-v2
Error: Alias 'prod-logs' is a multi-target alias, cannot flip

Cause

You're trying to flip an alias that was created as a "multi" alias (for cross-index search) rather than a "single" alias (for atomic index swap).

Fix

Check the alias type:
```
miroir-ctl alias get prod-logs
```

If you need a swappable pointer, delete and recreate as a single alias:

miroir-ctl alias delete prod-logs
miroir-ctl alias create prod-logs --kind single --current-uid logs-v1

For cross-index search, use a separate multi alias:

miroir-ctl alias create search-all --kind multi --target-uids logs-v1,metrics-v1

Prevention

Use descriptive alias names to distinguish single vs multi
Document alias conventions in your runbooks

Search timeout during shard migration

Symptom

Search queries timeout or return 503 errors during active shard migrations, especially for large indexes.

Cause

During migration, some queries may be routed to nodes that are still warming up migrated shards, or to nodes under heavy load from migration work.

Fix

Check if migration is active:

miroir-ctl reshard list --status in_progress

Temporarily increase query timeout:

curl -X POST https://miroir/indexes/myindex/search \
  -H "Query-Timeout: 30" \
  -d '{"q": "test"}'

If timeouts persist, pause the migration:

miroir-ctl reshard pause --operation-id <operation_id>

Resume during off-peak hours:

miroir-ctl reshard resume --operation-id <operation_id>

Prevention

Schedule large migrations during low-traffic periods
Use --throttle flag on reshard to limit CPU usage
Monitor search latency during migrations

CDC cursor out of sync

Symptom

CDC events arrive with stale or duplicate sequence numbers, or events are missing entirely.

Cause

The CDC cursor stored in Redis is out of sync with the actual event stream. This can happen if:

The sink was down during a period of high write activity
A cursor update failed silently
The sink was reset without clearing the cursor

Fix

Check current cursor position:

miroir-ctl cdc cursor --sink-name elasticsearch --index-uid myindex

Compare to Meilisearch event stream:

# On a Meilisearch node
curl -s http://localhost:7700/indexes/myindex/cdc/events | jq '.events | length'

If cursor is behind, reset it to force re-sync from a checkpoint:

miroir-ctl cdc reset-cursor --sink-name elasticsearch --index-uid myindex --confirm

For large gaps, consider a full re-index:

miroir-ctl dump export --index-uid myindex --output /data/myindex.dump
miroir-ctl dump import --sink-name elasticsearch --input /data/myindex.dump

Prevention

Monitor CDC lag via miroir_cdc_lag_seconds metric
Set up alerts for cursor stall detection
Use idempotent sinks to handle duplicate events gracefully

Getting Help

If you don't see your issue listed here:

Run the diagnostic playbook and gather the output
Search existing GitHub issues
Open a new issue with:
- Miroir version
- Diagnostic output
- Relevant logs (sanitized)
- Steps to reproduce

11 KiB Raw Blame History

Common Issues & Troubleshooting

Quick Diagnostics

Common Issues

Error: "primary key required"

Symptom

Cause

Fix

Why this differs from Meilisearch

Search returns fewer results than expected

Symptom

Cause

Fix

Prevention

Task polling stuck at "processing"

Symptom

Cause

Fix

Prevention

Node drain blocked: "insufficient replicas"

Symptom

Cause

Fix

Alternative: Force drain (dangerous)

Migration stuck after coordinator crash

Symptom

Cause

Fix

Prevention

High memory usage on Redis

Symptom

Cause

Fix

Prevention

Index creation fails with "hash routing error"

Symptom

Cause

Fix

Prevention

Alias flip returns "wrong kind"

Symptom

Cause

Fix

Prevention

Search timeout during shard migration

Symptom

Cause

Fix

Prevention

CDC cursor out of sync

Symptom

Cause

Fix

Prevention

Getting Help

11 KiB

Raw Blame History