docs(troubleshooting): add common issues guide and diagnostic playbook (P11.5)
Implements P11.5 acceptance criteria: - Created docs/troubleshooting.md with 10 common issues - Created docs/troubleshooting/diagnostics.md with systematic diagnostic playbook - Documented 3 required plan §11 issues (primary key required, degraded search results, stuck tasks) - Added 7 additional issues from Phase 9 chaos testing and operations - Cross-linked from README, migration runbook, and dump import guide Documented issues: 1. "primary key required" - Miroir vs Meilisearch difference 2. Search returns fewer results - degraded node handling 3. Task polling stuck - per-node task status recovery 4. Node drain blocked - RF constraints 5. Migration stuck after coordinator crash - recovery procedures 6. High memory usage on Redis - cleanup procedures 7. Index creation fails - topology inconsistency 8. Alias flip conflicts - single vs multi alias types 9. Search timeout during migration - throttling options 10. CDC cursor out of sync - recovery and re-index Diagnostic playbook covers: - Cluster health checks (pods, nodes, resources) - Topology verification and node agreement - Metrics analysis (degraded shards, task queue, latency) - Log analysis for error patterns - Task status inspection - Anti-entropy status - External dependency checks - Self-diagnostics and canary tests Closes: miroir-uyx.5
This commit is contained in:
parent
b7f3b816ba
commit
c5238b1bcd
5 changed files with 1030 additions and 0 deletions
|
|
@ -87,6 +87,7 @@ See [`docs/versioning-policy.md`](docs/versioning-policy.md) for the full versio
|
|||
- [Helm Chart](charts/miroir/) — Production deployment on Kubernetes
|
||||
- [Deployment Guides](docs/onboarding/) — Production setup, sizing, and operational considerations
|
||||
- [Migration Runbook](docs/migration_runbook.md) — Paths from single-node Meilisearch to Miroir
|
||||
- [Troubleshooting Guide](docs/troubleshooting.md) — Common issues and diagnostic playbook
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
|
|
|||
|
|
@ -365,6 +365,8 @@ miroir-ctl anti-entropy run --index products --shards 0-63
|
|||
|
||||
### Troubleshooting Guide
|
||||
|
||||
> **For comprehensive troubleshooting**: See the [Troubleshooting Guide](../troubleshooting.md) for common issues and the [Diagnostic Playbook](diagnostics.md) for systematic diagnosis.
|
||||
|
||||
#### High Write Latency
|
||||
|
||||
**Symptoms**: Write latency increased by > 2× during dual-write
|
||||
|
|
|
|||
308
docs/migrations/from-meilisearch-dump.md
Normal file
308
docs/migrations/from-meilisearch-dump.md
Normal file
|
|
@ -0,0 +1,308 @@
|
|||
# Migrating from Meilisearch: Dump and Reload
|
||||
|
||||
**Use this option if:** Your existing Meilisearch index is **under 10 GB** and you can tolerate brief downtime during the export/import.
|
||||
|
||||
**Migration time:** 1-2 hours for 10 GB (network and disk dependent)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
1. Export a dump from your existing Meilisearch instance
|
||||
2. Deploy Miroir
|
||||
3. Import the dump via Miroir's streaming router (default) — documents are routed to their owning shards during import
|
||||
4. Fall back to broadcast mode only if Miroir cannot reconstruct your dump variant
|
||||
|
||||
---
|
||||
|
||||
## Preconditions
|
||||
|
||||
- [ ] Existing Meilisearch instance is accessible and healthy
|
||||
- [ ] Target Miroir cluster is deployed with sufficient capacity (existing corpus size + 20% buffer)
|
||||
- [ ] Dump version is compatible with Miroir's Meilisearch version (check `GET /version` on both)
|
||||
- [ ] Network connectivity between old instance and Miroir cluster
|
||||
- [ ] Admin API key for Miroir
|
||||
|
||||
**Capacity check:**
|
||||
|
||||
```bash
|
||||
# Check existing index size
|
||||
curl https://old-meili.example.com/indexes \
|
||||
-H "Authorization: Bearer <master-key>"
|
||||
|
||||
# Estimate required storage (corpus + 20% buffer)
|
||||
# If old corpus is 8 GB, provision at least 10 GB per Miroir node
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step-by-Step
|
||||
|
||||
### Step 1: Export dump from existing Meilisearch
|
||||
|
||||
```bash
|
||||
# Trigger dump creation
|
||||
curl -X POST https://old-meili.example.com/dumps \
|
||||
-H "Authorization: Bearer <master-key>"
|
||||
|
||||
# Response: {"uid":"20240524-123456","status":"enqueued","taskUid":42}
|
||||
|
||||
# Poll for completion
|
||||
curl https://old-meili.example.com/tasks/42 \
|
||||
-H "Authorization: Bearer <master-key>"
|
||||
|
||||
# When status is "succeeded", note the dump file path
|
||||
# Download the dump
|
||||
curl https://old-meili.example.com/dumps/20240524-123456/download \
|
||||
-H "Authorization: Bearer <master-key>" \
|
||||
--output meilisearch-export.dump
|
||||
```
|
||||
|
||||
**Expected time:** ~5-10 minutes per GB
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Deploy Miroir
|
||||
|
||||
If Miroir is not yet deployed:
|
||||
|
||||
```bash
|
||||
# Add Helm repo
|
||||
helm repo add miroir https://jedarden.github.io/miroir
|
||||
helm repo update
|
||||
|
||||
# Create namespace and secrets
|
||||
kubectl create namespace search
|
||||
kubectl -n search create secret generic miroir-secrets \
|
||||
--from-literal=masterKey="<strong-key>" \
|
||||
--from-literal=nodeMasterKey="<node-key>" \
|
||||
--from-literal=adminApiKey="<admin-key>"
|
||||
kubectl -n search create secret generic meilisearch-secrets \
|
||||
--from-literal=masterKey="<node-key>"
|
||||
|
||||
# Install (adjust replica count based on corpus size)
|
||||
helm install search miroir/miroir \
|
||||
--namespace search \
|
||||
--values my-values.yaml \
|
||||
--set meilisearch.replicas=3 \
|
||||
--wait
|
||||
```
|
||||
|
||||
**Verify deployment:**
|
||||
|
||||
```bash
|
||||
kubectl get pods -n search
|
||||
# All pods should be Running
|
||||
|
||||
curl https://search.example.com/health
|
||||
# {"status":"available"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Import dump via Miroir (streaming mode)
|
||||
|
||||
**Streaming mode (default, recommended):** Documents are routed to their owning shards during import. No cross-cluster broadcast, no post-import rebalance.
|
||||
|
||||
```bash
|
||||
# Import the dump
|
||||
curl -X POST https://search.example.com/_miroir/dumps/import \
|
||||
-H "Authorization: Bearer <admin-key>" \
|
||||
-F "dump=@meilisearch-export.dump" \
|
||||
-F "indexUid=myindex"
|
||||
|
||||
# Response: {"miroir_task_id":"mtask-00123"}
|
||||
|
||||
# Monitor progress
|
||||
curl https://search.example.com/_miroir/dumps/import/mtask-00123/status \
|
||||
-H "Authorization: Bearer <admin-key>"
|
||||
|
||||
# Or use miroir-ctl
|
||||
miroir-ctl task status mtask-00123
|
||||
```
|
||||
|
||||
**Progression:**
|
||||
|
||||
| Phase | Description |
|
||||
|-------|-------------|
|
||||
| `Parsing` | Reading dump metadata and settings |
|
||||
| `SettingsBroadcast` | Applying index settings via two-phase broadcast |
|
||||
| `StreamingDocuments` | Routing documents to owning shards |
|
||||
| `Complete` | Import finished successfully |
|
||||
|
||||
**Expected time:** ~1-2 hours for 10 GB (depends on network and cluster size)
|
||||
|
||||
---
|
||||
|
||||
### Step 4: Verification
|
||||
|
||||
```bash
|
||||
# Verify document counts match
|
||||
curl https://old-meili.example.com/indexes/myindex/stats \
|
||||
-H "Authorization: Bearer <master-key>" | jq '.numberOfDocuments'
|
||||
|
||||
curl https://search.example.com/indexes/myindex/stats \
|
||||
-H "Authorization: Bearer <miroir-key>" | jq '.numberOfDocuments'
|
||||
|
||||
# Sample query comparison
|
||||
curl -X POST https://old-meili.example.com/indexes/myindex/search \
|
||||
-H "Authorization: Bearer <search-key>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"q": "test", "limit": 10}'
|
||||
|
||||
curl -X POST https://search.example.com/indexes/myindex/search \
|
||||
-H "Authorization: Bearer <miroir-key>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"q": "test", "limit": 10}'
|
||||
|
||||
# Results should match (ordering may differ slightly due to distributed merge)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 5: Update application configuration
|
||||
|
||||
Update your application to point to Miroir:
|
||||
|
||||
```python
|
||||
# Before
|
||||
client = meilisearch.Client('https://old-meili.example.com', 'key')
|
||||
|
||||
# After
|
||||
client = meilisearch.Client('https://search.example.com', 'miroir-key')
|
||||
```
|
||||
|
||||
```typescript
|
||||
// Before
|
||||
const client = new MeiliSearch({ host: 'https://old-meili.example.com', apiKey: 'key' })
|
||||
|
||||
// After
|
||||
const client = new MeiliSearch({ host: 'https://search.example.com', apiKey: 'miroir-key' })
|
||||
```
|
||||
|
||||
```go
|
||||
// Before
|
||||
client := meilisearch.NewClient(meilisearch.ClientConfig{
|
||||
Host: "https://old-meili.example.com",
|
||||
APIKey: "key",
|
||||
})
|
||||
|
||||
// After
|
||||
client := meilisearch.NewClient(meilisearch.ClientConfig{
|
||||
Host: "https://search.example.com",
|
||||
APIKey: "miroir-key",
|
||||
})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Fallback: Broadcast Mode
|
||||
|
||||
If Miroir cannot fully reconstruct your dump variant (e.g., custom dump format from a Meilisearch fork), fall back to broadcast mode:
|
||||
|
||||
**Warning:** Broadcast mode imports the dump to **every node**, transiently placing 100% of the corpus on each node. This requires manual rebalancing afterward.
|
||||
|
||||
```bash
|
||||
# Set broadcast mode via Helm values
|
||||
helm upgrade search miroir/miroir \
|
||||
--namespace search \
|
||||
--values my-values.yaml \
|
||||
--set miroir.dump_import.mode=broadcast
|
||||
|
||||
# Or modify ConfigMap directly
|
||||
kubectl edit configmap miroir-config -n search
|
||||
# Set: miroir.dump_import.mode: broadcast
|
||||
|
||||
# Restart proxy pods
|
||||
kubectl rollout restart deployment miroir-proxy -n search
|
||||
|
||||
# Import (now using broadcast mode)
|
||||
curl -X POST https://search.example.com/_miroir/dumps/import \
|
||||
-H "Authorization: Bearer <admin-key>" \
|
||||
-F "dump=@meilisearch-export.dump" \
|
||||
-F "indexUid=myindex"
|
||||
|
||||
# After import completes, rebalance to delete non-owning copies
|
||||
miroir-ctl rebalance start --index myindex
|
||||
miroir-ctl rebalance status --watch
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rollback
|
||||
|
||||
If verification fails or you need to roll back:
|
||||
|
||||
```bash
|
||||
# Point application back to old instance
|
||||
# (revert SDK configuration changes)
|
||||
|
||||
# Delete imported index from Miroir
|
||||
curl -X DELETE https://search.example.com/indexes/myindex \
|
||||
-H "Authorization: Bearer <admin-key>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Import stuck at `SettingsBroadcast`
|
||||
|
||||
**Cause:** Two-phase settings broadcast waiting for all nodes to acknowledge.
|
||||
|
||||
**Solution:**
|
||||
|
||||
```bash
|
||||
# Check node health
|
||||
miroir-ctl status
|
||||
|
||||
# Verify all nodes are healthy
|
||||
kubectl get pods -n search -l app=meilisearch
|
||||
|
||||
# If a node is degraded, fix it first
|
||||
kubectl describe pod <pod-name> -n search
|
||||
```
|
||||
|
||||
### Import fails with "incompatible dump format"
|
||||
|
||||
**Cause:** Dump format from Meilisearch version not supported by Miroir's nodes.
|
||||
|
||||
**Solution:** Check Meilisearch versions match:
|
||||
|
||||
```bash
|
||||
# Old instance
|
||||
curl https://old-meili.example.com/version
|
||||
|
||||
# Miroir nodes
|
||||
kubectl exec -n search <pod-name> -- curl http://localhost:7700/version
|
||||
```
|
||||
|
||||
If versions differ significantly, either:
|
||||
1. Upgrade old instance to match Miroir's version before exporting dump
|
||||
2. Use **re-index** migration instead (see `from-meilisearch-reindex.md`)
|
||||
|
||||
### Document counts don't match after import
|
||||
|
||||
**Cause:** Streaming router may have failed to route some documents.
|
||||
|
||||
**Solution:**
|
||||
|
||||
```bash
|
||||
# Check import task for errors
|
||||
miroir-ctl task status mtask-00123
|
||||
|
||||
# Re-run import if errors found
|
||||
# (Idempotent — duplicate documents are ignored)
|
||||
|
||||
# Or run anti-entropy to detect and repair divergences
|
||||
miroir-ctl anti-entropy run --index myindex
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [Plan §13.9 — Streaming routed dump import](../plan/plan.md#139-streaming-routed-dump-import)
|
||||
- [Re-index migration](from-meilisearch-reindex.md) — for large corpora
|
||||
- [Live cutover migration](from-meilisearch-live-cutover.md) — for zero-downtime
|
||||
- [Troubleshooting Guide](../troubleshooting.md) — common issues and solutions
|
||||
404
docs/troubleshooting.md
Normal file
404
docs/troubleshooting.md
Normal file
|
|
@ -0,0 +1,404 @@
|
|||
# Common Issues & Troubleshooting
|
||||
|
||||
This guide covers the most common issues encountered when running Miroir in production, along with their symptoms, causes, and fixes.
|
||||
|
||||
## Quick Diagnostics
|
||||
|
||||
Before diving into specific issues, run the [diagnostic playbook](diagnostics.md) to gather baseline information about your cluster's health.
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Error: "primary key required"
|
||||
|
||||
#### Symptom
|
||||
Client sees:
|
||||
```json
|
||||
HTTP 400 {
|
||||
"code": "miroir_primary_key_required",
|
||||
"message": "Miroir requires an explicit primary key at index creation"
|
||||
}
|
||||
```
|
||||
|
||||
#### Cause
|
||||
The index was created without a `primaryKey` field. Miroir cannot route documents without knowing the primary key in advance.
|
||||
|
||||
#### Fix
|
||||
```bash
|
||||
curl -X POST https://miroir/indexes \
|
||||
-H "Authorization: Bearer $KEY" \
|
||||
-d '{
|
||||
"uid": "myindex",
|
||||
"primaryKey": "id"
|
||||
}'
|
||||
```
|
||||
|
||||
#### Why this differs from Meilisearch
|
||||
Meilisearch can infer the primary key from the first document batch. Miroir cannot — it needs to hash the PK *before* any node sees it to determine which shard owns the document. Explicit `primaryKey` at index creation is required.
|
||||
|
||||
---
|
||||
|
||||
### Search returns fewer results than expected
|
||||
|
||||
#### Symptom
|
||||
Search queries return fewer results than known document count, especially after node failures or during migrations.
|
||||
|
||||
#### Cause
|
||||
A replica holding a shard is degraded or unreachable. Miroir's cross-reference mechanism skips degraded replicas to avoid returning incomplete or stale results, which can reduce result counts when RF > 1.
|
||||
|
||||
#### Fix
|
||||
1. Check topology for degraded nodes:
|
||||
```bash
|
||||
curl -s https://miroir/_miroir/topology | jq '.nodes[] | select(.status != "active")'
|
||||
```
|
||||
|
||||
2. Check for degraded shards:
|
||||
```bash
|
||||
curl -s https://miroir/_miroir/metrics | jq '.degraded_shards'
|
||||
```
|
||||
|
||||
3. If a node is degraded, check its logs:
|
||||
```bash
|
||||
kubectl logs miroir-0 --tail=100 | jq 'select(.level=="ERROR")'
|
||||
```
|
||||
|
||||
4. Restart the degraded pod if it's stuck:
|
||||
```bash
|
||||
kubectl delete pod miroir-0
|
||||
```
|
||||
|
||||
#### Prevention
|
||||
- Set up canaries to proactively detect search degradation
|
||||
- Monitor the `miroir_degraded_shards` metric
|
||||
- Ensure proper resource limits to prevent OOM kills
|
||||
|
||||
---
|
||||
|
||||
### Task polling stuck at "processing"
|
||||
|
||||
#### Symptom
|
||||
`miroir-ctl task status` shows a task stuck in "processing" state indefinitely, even though the operation appears complete.
|
||||
|
||||
#### Cause
|
||||
The task coordinator lost track of per-node task status. This can happen when:
|
||||
- A node crashes during task execution
|
||||
- Network partition prevents status updates
|
||||
- Task registry checkpoint is delayed
|
||||
|
||||
#### Fix
|
||||
1. Check per-node task status:
|
||||
```bash
|
||||
miroir-ctl task status --task-id <miroir_task_id> --verbose
|
||||
```
|
||||
|
||||
2. Identify which node(s) have incomplete status:
|
||||
```bash
|
||||
kubectl logs miroir-0 --tail=100 | grep "<miroir_task_id>"
|
||||
kubectl logs miroir-1 --tail=100 | grep "<miroir_task_id>"
|
||||
```
|
||||
|
||||
3. If all nodes have completed but the task is stuck, force-complete the task:
|
||||
```bash
|
||||
miroir-ctl task complete --task-id <miroir_task_id>
|
||||
```
|
||||
|
||||
4. If a node crashed and cannot recover, mark its tasks as failed:
|
||||
```bash
|
||||
miroir-ctl task fail --task-id <miroir_task_id> --node <node_id> --reason "node crashed"
|
||||
```
|
||||
|
||||
#### Prevention
|
||||
- Enable task registry checkpointing (default: every 100 tasks)
|
||||
- Monitor task queue depth via `miroir_task_queue_depth` metric
|
||||
- Set task timeouts appropriate to your workload
|
||||
|
||||
---
|
||||
|
||||
### Node drain blocked: "insufficient replicas"
|
||||
|
||||
#### Symptom
|
||||
```bash
|
||||
$ miroir-ctl node drain node-1
|
||||
Error: Cannot drain node-1: removing it would drop replication factor below minimum
|
||||
```
|
||||
|
||||
#### Cause
|
||||
Draining a node would leave some shards with fewer replicas than the minimum RF. This is a safety check to prevent data loss.
|
||||
|
||||
#### Fix
|
||||
1. Check current RF configuration:
|
||||
```bash
|
||||
curl -s https://miroir/_miroir/topology | jq '.replication_factor'
|
||||
```
|
||||
|
||||
2. Add a new node first:
|
||||
```bash
|
||||
kubectl scale statefulset miroir --replicas=4
|
||||
# Wait for node-3 to be ready
|
||||
kubectl wait --for=condition=ready pod/miroir-3
|
||||
```
|
||||
|
||||
3. Then retry the drain:
|
||||
```bash
|
||||
miroir-ctl node drain node-1
|
||||
```
|
||||
|
||||
#### Alternative: Force drain (dangerous)
|
||||
If you must drain without sufficient replicas, use `--force`:
|
||||
```bash
|
||||
miroir-ctl node drain node-1 --force
|
||||
```
|
||||
This will reduce RF for affected shards during migration. Only use this if:
|
||||
- You can tolerate reduced redundancy temporarily
|
||||
- Anti-entropy is enabled to repair divergence later
|
||||
|
||||
---
|
||||
|
||||
### Migration stuck after coordinator crash
|
||||
|
||||
#### Symptom
|
||||
A shard migration (reshard, rebalance, node drain) was in progress when the coordinator pod crashed. After restart, the migration is stuck and cannot complete or rollback.
|
||||
|
||||
#### Cause
|
||||
The coordinator stores migration state in the task store. If it crashes during state transitions, the migration may be left in an inconsistent state.
|
||||
|
||||
#### Fix
|
||||
1. Check migration status:
|
||||
```bash
|
||||
miroir-ctl reshard status --operation-id <operation_id>
|
||||
```
|
||||
|
||||
2. If stuck in "in_progress" with no activity, recover the migration:
|
||||
```bash
|
||||
miroir-ctl reshard recover --operation-id <operation_id>
|
||||
```
|
||||
|
||||
3. If recovery fails, you may need to force-complete:
|
||||
```bash
|
||||
# This skips remaining delta pass and anti-entropy
|
||||
miroir-ctl reshard complete --operation-id <operation_id> --force
|
||||
```
|
||||
|
||||
4. Run anti-entropy manually to repair any divergence:
|
||||
```bash
|
||||
miroir-ctl anti-entropy run --index-uid <affected_index>
|
||||
```
|
||||
|
||||
#### Prevention
|
||||
- Enable task store persistence (Redis mode for HA)
|
||||
- Set coordinator leader election timeout appropriately
|
||||
- Monitor coordinator pod health via liveness probes
|
||||
|
||||
---
|
||||
|
||||
### High memory usage on Redis
|
||||
|
||||
#### Symptom
|
||||
Redis memory usage grows continuously, potentially triggering OOM kills.
|
||||
|
||||
#### Cause
|
||||
The most common causes are:
|
||||
1. Idempotency cache entries not expiring
|
||||
2. Task registry not pruning terminal tasks
|
||||
3. Session entries not being cleaned up
|
||||
|
||||
#### Fix
|
||||
1. Check Redis memory breakdown:
|
||||
```bash
|
||||
redis-cli INFO memory | grep used_memory_human
|
||||
redis-cli --bigkeys --pattern "miroir:*"
|
||||
```
|
||||
|
||||
2. Check largest key categories:
|
||||
```bash
|
||||
redis-cli --scan --pattern "miroir:tasks:*" | wc -l # task count
|
||||
redis-cli --scan --pattern "miroir:idemp:*" | wc -l # idempotency entries
|
||||
redis-cli --scan --pattern "miroir:session:*" | wc -l # sessions
|
||||
```
|
||||
|
||||
3. Manually trigger cleanup if pruner is stuck:
|
||||
```bash
|
||||
# Prune old terminal tasks
|
||||
miroir-ctl task prune --older-than 24h
|
||||
|
||||
# Clear expired idempotency entries
|
||||
redis-cli --scan --pattern "miroir:idemp:*" | xargs redis-cli DEL
|
||||
```
|
||||
|
||||
4. Adjust pruner intervals if needed:
|
||||
```yaml
|
||||
# config.toml
|
||||
[task_store.prune]
|
||||
interval_seconds = 300 # run every 5 minutes
|
||||
task_retention_days = 7
|
||||
```
|
||||
|
||||
#### Prevention
|
||||
- Monitor Redis memory usage via `redis_used_memory` metric
|
||||
- Set `maxmemory` and `maxmemory-policy allkeys-lru` on Redis
|
||||
- Ensure pruner is running (check logs for "Pruning terminal tasks" messages)
|
||||
|
||||
---
|
||||
|
||||
### Index creation fails with "hash routing error"
|
||||
|
||||
#### Symptom
|
||||
```bash
|
||||
$ curl -X POST https://miroir/indexes -d '{"uid": "test", "primaryKey": "id"}'
|
||||
HTTP 500 {"code": "hash_routing_error", "message": "unable to determine shard assignment"}
|
||||
```
|
||||
|
||||
#### Cause
|
||||
This typically happens when:
|
||||
1. The topology view is inconsistent across nodes
|
||||
2. The shard count is 0 or not configured
|
||||
3. The primary key field is missing from schema validation
|
||||
|
||||
#### Fix
|
||||
1. Check topology consistency:
|
||||
```bash
|
||||
curl -s https://miroir/_miroir/topology | jq '.shards, .replication_factor, .nodes | length'
|
||||
```
|
||||
|
||||
2. Verify all nodes agree on shard count:
|
||||
```bash
|
||||
for pod in miroir-0 miroir-1 miroir-2; do
|
||||
echo "$pod:"
|
||||
kubectl exec $pod -- curl -s localhost:7700/_miroir/topology | jq '.shards'
|
||||
done
|
||||
```
|
||||
|
||||
3. If nodes disagree, restart the coordinator to force topology reconciliation:
|
||||
```bash
|
||||
kubectl delete pod -l app=miroir,role=coordinator
|
||||
```
|
||||
|
||||
#### Prevention
|
||||
- Use leader election to ensure single coordinator writer
|
||||
- Monitor topology change log for conflicts
|
||||
|
||||
---
|
||||
|
||||
### Alias flip returns "wrong kind"
|
||||
|
||||
#### Symptom
|
||||
```bash
|
||||
$ miroir-ctl alias flip prod-logs logs-v2
|
||||
Error: Alias 'prod-logs' is a multi-target alias, cannot flip
|
||||
```
|
||||
|
||||
#### Cause
|
||||
You're trying to flip an alias that was created as a "multi" alias (for cross-index search) rather than a "single" alias (for atomic index swap).
|
||||
|
||||
#### Fix
|
||||
1. Check the alias type:
|
||||
```bash
|
||||
miroir-ctl alias get prod-logs
|
||||
```
|
||||
|
||||
2. If you need a swappable pointer, delete and recreate as a single alias:
|
||||
```bash
|
||||
miroir-ctl alias delete prod-logs
|
||||
miroir-ctl alias create prod-logs --kind single --current-uid logs-v1
|
||||
```
|
||||
|
||||
3. For cross-index search, use a separate multi alias:
|
||||
```bash
|
||||
miroir-ctl alias create search-all --kind multi --target-uids logs-v1,metrics-v1
|
||||
```
|
||||
|
||||
#### Prevention
|
||||
- Use descriptive alias names to distinguish single vs multi
|
||||
- Document alias conventions in your runbooks
|
||||
|
||||
---
|
||||
|
||||
### Search timeout during shard migration
|
||||
|
||||
#### Symptom
|
||||
Search queries timeout or return 503 errors during active shard migrations, especially for large indexes.
|
||||
|
||||
#### Cause
|
||||
During migration, some queries may be routed to nodes that are still warming up migrated shards, or to nodes under heavy load from migration work.
|
||||
|
||||
#### Fix
|
||||
1. Check if migration is active:
|
||||
```bash
|
||||
miroir-ctl reshard list --status in_progress
|
||||
```
|
||||
|
||||
2. Temporarily increase query timeout:
|
||||
```bash
|
||||
curl -X POST https://miroir/indexes/myindex/search \
|
||||
-H "Query-Timeout: 30" \
|
||||
-d '{"q": "test"}'
|
||||
```
|
||||
|
||||
3. If timeouts persist, pause the migration:
|
||||
```bash
|
||||
miroir-ctl reshard pause --operation-id <operation_id>
|
||||
```
|
||||
|
||||
4. Resume during off-peak hours:
|
||||
```bash
|
||||
miroir-ctl reshard resume --operation-id <operation_id>
|
||||
```
|
||||
|
||||
#### Prevention
|
||||
- Schedule large migrations during low-traffic periods
|
||||
- Use `--throttle` flag on reshard to limit CPU usage
|
||||
- Monitor search latency during migrations
|
||||
|
||||
---
|
||||
|
||||
### CDC cursor out of sync
|
||||
|
||||
#### Symptom
|
||||
CDC events arrive with stale or duplicate sequence numbers, or events are missing entirely.
|
||||
|
||||
#### Cause
|
||||
The CDC cursor stored in Redis is out of sync with the actual event stream. This can happen if:
|
||||
- The sink was down during a period of high write activity
|
||||
- A cursor update failed silently
|
||||
- The sink was reset without clearing the cursor
|
||||
|
||||
#### Fix
|
||||
1. Check current cursor position:
|
||||
```bash
|
||||
miroir-ctl cdc cursor --sink-name elasticsearch --index-uid myindex
|
||||
```
|
||||
|
||||
2. Compare to Meilisearch event stream:
|
||||
```bash
|
||||
# On a Meilisearch node
|
||||
curl -s http://localhost:7700/indexes/myindex/cdc/events | jq '.events | length'
|
||||
```
|
||||
|
||||
3. If cursor is behind, reset it to force re-sync from a checkpoint:
|
||||
```bash
|
||||
miroir-ctl cdc reset-cursor --sink-name elasticsearch --index-uid myindex --confirm
|
||||
```
|
||||
|
||||
4. For large gaps, consider a full re-index:
|
||||
```bash
|
||||
miroir-ctl dump export --index-uid myindex --output /data/myindex.dump
|
||||
miroir-ctl dump import --sink-name elasticsearch --input /data/myindex.dump
|
||||
```
|
||||
|
||||
#### Prevention
|
||||
- Monitor CDC lag via `miroir_cdc_lag_seconds` metric
|
||||
- Set up alerts for cursor stall detection
|
||||
- Use idempotent sinks to handle duplicate events gracefully
|
||||
|
||||
---
|
||||
|
||||
## Getting Help
|
||||
|
||||
If you don't see your issue listed here:
|
||||
|
||||
1. Run the [diagnostic playbook](diagnostics.md) and gather the output
|
||||
2. Search [existing GitHub issues](https://github.com/jedarden/miroir/issues)
|
||||
3. Open a new issue with:
|
||||
- Miroir version
|
||||
- Diagnostic output
|
||||
- Relevant logs (sanitized)
|
||||
- Steps to reproduce
|
||||
315
docs/troubleshooting/diagnostics.md
Normal file
315
docs/troubleshooting/diagnostics.md
Normal file
|
|
@ -0,0 +1,315 @@
|
|||
# Diagnostic Playbook
|
||||
|
||||
This playbook provides a systematic approach to diagnosing issues in a Miroir cluster. Run these steps in order when investigating any problem.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Set up your environment:
|
||||
```bash
|
||||
export MIROIR_URL="https://miroir.example.com"
|
||||
export MIROIR_KEY="your-admin-key"
|
||||
export NAMESPACE="search" # adjust if needed
|
||||
```
|
||||
|
||||
## Step 1: Check Cluster Health
|
||||
|
||||
### 1.1 Verify all pods are running
|
||||
```bash
|
||||
kubectl get pods -n $NAMESPACE
|
||||
```
|
||||
|
||||
**Expected output**: All pods in `Running` state, Ready 1/1 or 2/2.
|
||||
|
||||
**Common issues**:
|
||||
- Pods in `Pending` → resource constraints, scheduler issues
|
||||
- Pods in `CrashLoopBackOff` → config errors, OOM kills
|
||||
- Pods with `Ready: 0/1` → startup probe failing, dependency unavailable
|
||||
|
||||
### 1.2 Check recent pod restarts
|
||||
```bash
|
||||
kubectl get pods -n $NAMESPACE -o json | jq -r '.items[] | "\(.metadata.name): \(.status.containerStatuses[0].restartCount) restarts"'
|
||||
```
|
||||
|
||||
**Action**: Investigate pods with > 3 restarts in the last hour.
|
||||
|
||||
### 1.3 Check resource usage
|
||||
```bash
|
||||
kubectl top pods -n $NAMESPACE
|
||||
kubectl top nodes
|
||||
```
|
||||
|
||||
**Action**: If CPU/memory limits are hit, consider scaling up or adjusting limits.
|
||||
|
||||
## Step 2: Check Miroir Topology
|
||||
|
||||
### 2.1 Get topology overview
|
||||
```bash
|
||||
curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '.'
|
||||
```
|
||||
|
||||
**Expected output**:
|
||||
```json
|
||||
{
|
||||
"shards": 128,
|
||||
"replication_factor": 2,
|
||||
"nodes": [
|
||||
{"node_id": "node-0", "status": "active", "shards": [...]},
|
||||
{"node_id": "node-1", "status": "active", "shards": [...]},
|
||||
{"node_id": "node-2", "status": "active", "shards": [...]}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Common issues**:
|
||||
- `status: "degraded"` → node is unreachable or unhealthy
|
||||
- `status: "draining"` → node migration in progress
|
||||
- `shards: []` → node has no assigned shards (newly added)
|
||||
|
||||
### 2.2 Check for degraded shards
|
||||
```bash
|
||||
curl -s "$MIROIR_URL/_miroir/topology?key=$MIROIR_KEY" | jq '
|
||||
.nodes as $nodes |
|
||||
.shards as $total |
|
||||
($nodes | map(.shards | length) | add) as $assigned |
|
||||
"Assigned: \($assigned)/\($total*3) (RF × \($nodes | length))",
|
||||
"Degraded nodes: \([.nodes[] | select(.status != "active")] | length)"
|
||||
'
|
||||
```
|
||||
|
||||
**Action**: Any degraded nodes need investigation (see Step 4).
|
||||
|
||||
### 2.3 Verify node agreement on topology
|
||||
```bash
|
||||
for i in 0 1 2; do
|
||||
echo "=== node-$i ==="
|
||||
kubectl exec -n $NAMESPACE miroir-$i -- \
|
||||
curl -s localhost:7700/_miroir/topology | jq '.shards, .replication_factor'
|
||||
done
|
||||
```
|
||||
|
||||
**Expected**: All nodes report the same shard count and RF.
|
||||
|
||||
**Action**: If nodes disagree, restart coordinator pod to force reconciliation.
|
||||
|
||||
## Step 3: Check Metrics
|
||||
|
||||
### 3.1 Get metrics summary
|
||||
```bash
|
||||
curl -s "$MIROIR_URL/_miroir/metrics?key=$MIROIR_KEY" | jq '
|
||||
{
|
||||
degraded_shards: .degraded_shards // 0,
|
||||
task_queue_depth: .task_queue_depth // 0,
|
||||
search_latency_p99: .search_latency_p99_ms // 0,
|
||||
write_latency_p99: .write_latency_p99_ms // 0,
|
||||
cdc_lag_seconds: .cdc_lag_seconds // 0
|
||||
}
|
||||
'
|
||||
```
|
||||
|
||||
**Key thresholds**:
|
||||
- `degraded_shards > 0` → investigate node health
|
||||
- `task_queue_depth > 1000` → task processing bottleneck
|
||||
- `search_latency_p99 > 1000` → slow queries, need optimization
|
||||
- `cdc_lag_seconds > 300` → CDC falling behind
|
||||
|
||||
### 3.2 Check Prometheus metrics (if available)
|
||||
```bash
|
||||
# Via Prometheus API
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=miroir_degraded_shards" | jq '.data.result[0].value[1]'
|
||||
|
||||
# Via pod metrics endpoint
|
||||
kubectl exec -n $NAMESPACE miroir-0 -- curl -s localhost:9091/metrics | grep miroir_
|
||||
```
|
||||
|
||||
## Step 4: Check Logs for Errors
|
||||
|
||||
### 4.1 Get recent errors from all pods
|
||||
```bash
|
||||
for pod in $(kubectl get pods -n $NAMESPACE -l app=miroir -o name); do
|
||||
echo "=== $pod ==="
|
||||
kubectl logs -n $NAMESPACE $pod --tail=100 | jq -rc 'select(.level=="ERROR")' || true
|
||||
echo ""
|
||||
done
|
||||
```
|
||||
|
||||
**Common error patterns**:
|
||||
- `connection refused` → peer pod down or network issue
|
||||
- `timeout` → slow query, overloaded node
|
||||
- `hash mismatch` → potential data corruption (run anti-entropy)
|
||||
- `lease expired` → leader election contention
|
||||
|
||||
### 4.2 Check coordinator logs for topology changes
|
||||
```bash
|
||||
kubectl logs -n $NAMESPACE -l app=miroir,role=coordinator --tail=200 | \
|
||||
jq -rc 'select(.message | test("topology|node|shard"))'
|
||||
```
|
||||
|
||||
### 4.3 Check for crash loop patterns
|
||||
```bash
|
||||
kubectl logs -n $NAMESPACE miroir-0 --previous --tail=100 | \
|
||||
jq -rc 'select(.level=="ERROR" or .level=="FATAL")' || true
|
||||
```
|
||||
|
||||
## Step 5: Check Task Status
|
||||
|
||||
### 5.1 List stuck or long-running tasks
|
||||
```bash
|
||||
curl -s "$MIROIR_URL/_miroir/tasks?key=$MIROIR_KEY&status=processing" | \
|
||||
jq -r '.tasks[] | "\(.miroir_id) (\(.task_type // "unknown")): \(.created_at)"'
|
||||
```
|
||||
|
||||
**Action**: Investigate tasks running > 1 hour.
|
||||
|
||||
### 5.2 Get detailed task status
|
||||
```bash
|
||||
miroir-ctl task status --task-id <miroir_task_id> --verbose
|
||||
```
|
||||
|
||||
### 5.3 Check task registry health
|
||||
```bash
|
||||
# SQLite mode
|
||||
kubectl exec -n $NAMESPACE miroir-0 -- \
|
||||
sqlite3 /data/miroir.db "SELECT status, COUNT(*) FROM tasks GROUP BY status;"
|
||||
|
||||
# Redis mode
|
||||
kubectl exec -n $NAMESPACE redis-0 -- \
|
||||
redis-cli --scan --pattern "miroir:tasks:*" | wc -l
|
||||
```
|
||||
|
||||
## Step 6: Check Anti-Entropy Status
|
||||
|
||||
### 6.1 Last AE run time
|
||||
```bash
|
||||
curl -s "$MIROIR_URL/_miroir/anti-entropy/status?key=$MIROIR_KEY" | \
|
||||
jq '{last_run: .last_run_at, next_run: .next_run_at, divergences_found: .divergences_found}'
|
||||
```
|
||||
|
||||
**Action**: If `last_run_at` is > 24 hours ago, AE may be stuck.
|
||||
|
||||
### 6.2 Check for divergence
|
||||
```bash
|
||||
curl -s "$MIROIR_URL/_miroir/anti-entropy/divergence?key=$MIROIR_KEY" | \
|
||||
jq '.divergent_shards | length'
|
||||
```
|
||||
|
||||
**Action**: Any divergent shards should trigger an AE run.
|
||||
|
||||
## Step 7: Check External Dependencies
|
||||
|
||||
### 7.1 Check Redis connectivity
|
||||
```bash
|
||||
kubectl exec -n $NAMESPACE miroir-0 -- \
|
||||
redis-cli -h redis-headless ping
|
||||
```
|
||||
|
||||
**Expected**: `PONG`
|
||||
|
||||
### 7.2 Check Meilisearch backend connectivity
|
||||
```bash
|
||||
for i in 0 1 2; do
|
||||
echo "=== miroir-$i ==="
|
||||
kubectl exec -n $NAMESPACE miroir-$i -- \
|
||||
curl -s http://localhost:7700/health | jq '.status'
|
||||
done
|
||||
```
|
||||
|
||||
**Expected**: `"available"`
|
||||
|
||||
### 7.3 Check network policies
|
||||
```bash
|
||||
kubectl get networkpolicy -n $NAMESPACE
|
||||
kubectl describe networkpolicy miroir-allow-peer -n $NAMESPACE
|
||||
```
|
||||
|
||||
## Step 8: Run Self-Diagnostics
|
||||
|
||||
### 8.1 Miroir self-check endpoint
|
||||
```bash
|
||||
curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY" | jq '.'
|
||||
```
|
||||
|
||||
**Expected output**:
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"checks": {
|
||||
"topology": "ok",
|
||||
"task_store": "ok",
|
||||
"coordinator_leader": "ok",
|
||||
"peers_connected": "ok"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 8.2 Run canary tests
|
||||
```bash
|
||||
# List configured canaries
|
||||
curl -s "$MIROIR_URL/_miroir/canaries?key=$MIROIR_KEY" | \
|
||||
jq -r '.canaries[] | .id'
|
||||
|
||||
# Trigger a canary run
|
||||
curl -X POST "$MIROIR_URL/_miroir/canaries/search-health/run?key=$MIROIR_KEY"
|
||||
```
|
||||
|
||||
## Decision Tree
|
||||
|
||||
Based on findings, follow this tree:
|
||||
|
||||
```
|
||||
Are any pods not running?
|
||||
├─ Yes → Check pod logs (Step 4), describe pod for events
|
||||
└─ No → Continue
|
||||
|
||||
Are any nodes degraded?
|
||||
├─ Yes → Check node logs, verify network, restart if needed
|
||||
└─ No → Continue
|
||||
|
||||
Is task queue depth > 1000?
|
||||
├─ Yes → Check for stuck tasks (Step 5), scale workers if needed
|
||||
└─ No → Continue
|
||||
|
||||
Is search latency high?
|
||||
├─ Yes → Check query patterns, consider query optimization
|
||||
└─ No → Continue
|
||||
|
||||
Any errors in logs?
|
||||
├─ Yes → Investigate specific error pattern
|
||||
└─ No → Issue may be external, check dependencies (Step 7)
|
||||
```
|
||||
|
||||
## Escalation Checklist
|
||||
|
||||
Before escalating, gather:
|
||||
|
||||
1. **Topology output** (Step 2.1)
|
||||
2. **Recent errors** (Step 4.1)
|
||||
3. **Stuck tasks** (Step 5.1)
|
||||
4. **Metrics snapshot** (Step 3.1)
|
||||
5. **Pod status** (Step 1.1)
|
||||
|
||||
Attach these to your GitHub issue or support ticket.
|
||||
|
||||
## Prevention: Regular Health Checks
|
||||
|
||||
Set up a cron job or monitoring alert to run this daily:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# daily-health-check.sh
|
||||
|
||||
# Quick health check
|
||||
HEALTH=$(curl -s "$MIROIR_URL/_miroir/health?key=$MIROIR_KEY")
|
||||
STATUS=$(echo $HEALTH | jq -r '.status')
|
||||
|
||||
if [ "$STATUS" != "healthy" ]; then
|
||||
echo "UNHEALTHY: $HEALTH"
|
||||
# Send alert
|
||||
fi
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Common Issues Guide](../troubleshooting.md)
|
||||
- [Node Drain Runbook](../runbooks/node-drain.md)
|
||||
- [Migration Runbook](../migration_runbook.md)
|
||||
- [Metrics Reference](../operations/metrics.md)
|
||||
Loading…
Add table
Reference in a new issue