miroir/docs/migration_runbook.md
jedarden c5238b1bcd docs(troubleshooting): add common issues guide and diagnostic playbook (P11.5)
Implements P11.5 acceptance criteria:
- Created docs/troubleshooting.md with 10 common issues
- Created docs/troubleshooting/diagnostics.md with systematic diagnostic playbook
- Documented 3 required plan §11 issues (primary key required, degraded search results, stuck tasks)
- Added 7 additional issues from Phase 9 chaos testing and operations
- Cross-linked from README, migration runbook, and dump import guide

Documented issues:
1. "primary key required" - Miroir vs Meilisearch difference
2. Search returns fewer results - degraded node handling
3. Task polling stuck - per-node task status recovery
4. Node drain blocked - RF constraints
5. Migration stuck after coordinator crash - recovery procedures
6. High memory usage on Redis - cleanup procedures
7. Index creation fails - topology inconsistency
8. Alias flip conflicts - single vs multi alias types
9. Search timeout during migration - throttling options
10. CDC cursor out of sync - recovery and re-index

Diagnostic playbook covers:
- Cluster health checks (pods, nodes, resources)
- Topology verification and node agreement
- Metrics analysis (degraded shards, task queue, latency)
- Log analysis for error patterns
- Task status inspection
- Anti-entropy status
- External dependency checks
- Self-diagnostics and canary tests

Closes: miroir-uyx.5
2026-05-24 14:02:13 -04:00

460 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Shard Migration Runbook
This runbook provides operational guidance for safely performing shard migrations in the miroir system.
## Table of Contents
1. [Prerequisites](#prerequisites)
2. [Pre-Migration Checklist](#pre-migration-checklist)
3. [Migration Procedure](#migration-procedure)
4. [Anti-Entropy Configurations](#anti-entropy-configurations)
5. [Rollback Procedures](#rollback-procedures)
6. [Monitoring and Troubleshooting](#monitoring-and-troubleshooting)
---
## Prerequisites
### System Requirements
- **Cluster Health**: All nodes must be healthy before starting migration
- **Capacity**: New node must have sufficient capacity for migrated shards
- **Network**: Stable network between all nodes
- **Anti-Entropy**: Recommended to be enabled (see [Configurations](#anti-entropy-configurations))
### Configuration
```toml
# Migration configuration
[migration]
drain_timeout = "30s" # Maximum time to wait for in-flight writes
skip_delta_pass = false # Always false for safety
anti_entropy_enabled = true # Recommended: true
# Anti-entropy configuration
[anti_entropy]
enabled = true # Recommended: true
schedule_cron = "0 */6 * * *" # Every 6 hours
shards_per_pass = 0 # 0 = all shards
max_read_concurrency = 2
fingerprint_batch_size = 1000
auto_repair = true
```
---
## Pre-Migration Checklist
### 1. Cluster Health Check
```bash
# Check cluster health
miroir-ctl cluster health
# Verify all nodes are healthy
miroir-ctl nodes list
# Check current shard distribution
miroir-ctl shards list
```
**Expected Result**: All nodes show `Healthy` status, no failed shards.
### 2. Capacity Planning
```bash
# Estimate storage requirements
miroir-ctl reshard simulate \
--index products \
--new-shards 256 \
--docs-avg-size 10kb
```
**Expected Result**: New node has sufficient capacity for migrated shards + 20% buffer.
### 3. Backup Verification
```bash
# Verify backups are current
miroir-ctl backup status
# Check last backup time
miroir-ctl backup list | tail -1
```
**Expected Result**: Last backup completed within RPO window.
### 4. Anti-Entropy Status
```bash
# Check anti-entropy status
miroir-ctl anti-entropy status
# Verify last run
miroir-ctl anti-entropy history | tail -5
```
**Expected Result**: Anti-entropy enabled, last run completed successfully.
### 5. Schedule Window Check
```bash
# Verify current time is within allowed window
miroir-ctl reshard check-window \
--schedule-window off-peak
```
**Expected Result**: Current time is within allowed window (or use `--force` if emergency).
---
## Migration Procedure
### Step 1: Initiate Migration
```bash
miroir-ctl reshard start \
--index products \
--new-shards 256 \
--throttle 10000 \
--schedule-window off-peak
```
**Expected Output**:
```
Migration started: ID=42
Phase: ComputingAssignments
Affected shards: 64 (old nodes: old-0, old-1)
```
### Step 2: Monitor Dual-Write Phase
```bash
# Watch migration progress
miroir-ctl reshard watch --index products
# Check in-flight writes
miroir-ctl reshard stats --index products
```
**Expected Behavior**:
- Dual-write active to both old and new nodes
- Background migration copying documents
- Storage amplification = 2.0× (expected)
- Write latency increased by ~10-20%
**Warning Signs**:
- Write latency > 2× baseline
- High failure rate on new node
- Background migration stuck
### Step 3: Initiate Cutover
```bash
# When background migration completes, initiate cutover
miroir-ctl reshard cutover --index products
```
**Expected Behavior**:
1. **CutoverBegin**: Background migration complete
2. **CutoverDraining**: Waiting for in-flight writes (≤ 30s)
3. **CutoverDeltaPass**: Re-reading source shards for stragglers
4. **CutoverActivate**: New node active, routing switched
5. **CutoverCleanup**: Old shard data deleted
### Step 4: Verify Migration
```bash
# Verify migration completed
miroir-ctl reshard status --index products
# Check for data loss
miroir-ctl anti-entropy verify --index products --shards 0-63
# Verify routing
miroir-ctl routing test --index products --samples 1000
```
**Expected Result**:
- Status: `Complete`
- Data loss: 0 documents
- Routing: 100% to new node for migrated shards
### Step 5: Post-Migration Cleanup
```bash
# Trigger anti-entropy pass to verify
miroir-ctl anti-entropy run --index products --shards 0-63
# Monitor cluster health
miroir-ctl cluster health
# Verify storage reclaimed
miroir-ctl nodes stats --node old-0
```
**Expected Result**:
- Anti-entropy finds 0 divergences
- Cluster healthy
- Old node storage decreased by migrated shard size
---
## Anti-Entropy Configurations
### Configuration A: Anti-Entropy Enabled (Recommended)
**Safety**: 0-loss with defense-in-depth
**Performance**: Minor overhead (6-hourly reconciliation)
```toml
[migration]
drain_timeout = "30s"
skip_delta_pass = false # Delta pass provides primary safety
anti_entropy_enabled = true # AE provides defense-in-depth
[anti_entropy]
enabled = true
schedule_cron = "0 */6 * * *"
auto_repair = true
```
**Migration Flow**:
1. Dual-write + background migration
2. Stop dual-write, drain in-flight writes
3. Delta pass catches stragglers → 0 loss
4. Anti-entropy scheduled to catch any bugs in delta pass
5. New node active, routing switched
**Recovery**: If delta pass has bugs, anti-entropy will repair within 6 hours.
### Configuration B: Anti-Entropy Disabled (Not Recommended)
**Safety**: 0-loss IF delta pass works correctly
**Performance**: No background reconciliation overhead
```toml
[migration]
drain_timeout = "30s"
skip_delta_pass = false # Delta pass is ONLY safety mechanism
anti_entropy_enabled = false # No defense-in-depth
```
**Migration Flow**:
1. Dual-write + background migration
2. Stop dual-write, drain in-flight writes
3. Delta pass catches stragglers → 0 loss (IF no bugs)
4. New node active, routing switched
5. NO background reconciliation
**Warning**: Any bugs in delta pass logic will cause permanent data loss.
**Recommendation**: Only use this configuration if:
- You have comprehensive test coverage
- You can tolerate potential data loss
- You run chaos tests before every deployment
### Configuration C: Skip Delta Pass (Only with AE Enabled)
**Safety**: 0-loss after anti-entropy runs (up to 6 hours)
**Performance**: Faster cutover, but immediate data loss until AE runs
```toml
[migration]
drain_timeout = "30s"
skip_delta_pass = true # Skip delta pass
anti_entropy_enabled = true # AE is ONLY safety mechanism
[anti_entropy]
enabled = true
schedule_cron = "0 */6 * * *" # Or more frequent
auto_repair = true
```
**Migration Flow**:
1. Dual-write + background migration
2. Stop dual-write, drain in-flight writes
3. NO delta pass → stragglers lost
4. New node active, routing switched
5. Anti-entropy repairs within 6 hours
**Warning**: Documents will be lost for up to 6 hours (until AE runs).
**Recommendation**: Only use this configuration if:
- You can tolerate temporary data loss
- You need faster cutover
- You increase AE frequency to hourly or less
---
## Rollback Procedures
### Scenario 1: Migration Failed During Dual-Write
**Symptoms**: High failure rate on new node, migration stuck
**Action**: Abort and retry
```bash
# Abort migration
miroir-ctl reshard abort --index products
# Verify old node still serving
miroir-ctl routing test --index products
# Retry after fixing issue
miroir-ctl reshard start --index products ...
```
**Data Loss**: 0 (old node still serving)
### Scenario 2: Migration Failed During Cutover
**Symptoms**: Drain timeout, delta pass failed
**Action**: Manual intervention required
```bash
# Check migration state
miroir-ctl reshard status --index products
# If drain timeout, check for stuck writes
miroir-ctl writes list --stuck
# Mark stuck writes as failed
miroir-ctl writes fail --doc-id <id> --reason "timeout"
# Retry cutover
miroir-ctl reshard cutover --index products
```
**Data Loss**: 0 (delta pass will catch stragglers)
### Scenario 3: Migration Failed After Activation
**Symptoms**: New node not serving, routing issues
**Action**: Emergency rollback
```bash
# Stop new node
miroir-ctl nodes drain --node new-3
# Revert routing to old node
miroir-ctl routing revert --index products --shards 0-63
# Verify data integrity
miroir-ctl anti-entropy run --index products --shards 0-63
```
**Data Loss**: Potential (if delta pass missed stragglers)
---
## Monitoring and Troubleshooting
### Key Metrics
| Metric | Healthy | Warning | Critical |
|--------|---------|---------|----------|
| Write latency | < 2× baseline | 2-5× baseline | > 5× baseline |
| In-flight writes | < 1000 | 1000-10000 | > 10000 |
| Drain time | < 10s | 10-30s | > 30s |
| Delta pass docs | < 100 | 100-1000 | > 1000 |
| AE divergences | 0 | 1-10 | > 10 |
### Troubleshooting Guide
> **For comprehensive troubleshooting**: See the [Troubleshooting Guide](../troubleshooting.md) for common issues and the [Diagnostic Playbook](diagnostics.md) for systematic diagnosis.
#### High Write Latency
**Symptoms**: Write latency increased by > 2× during dual-write
**Diagnosis**:
```bash
# Check write paths
miroir-ctl tracing list --operation write
# Check node health
miroir-ctl nodes health --detailed
```
**Solutions**:
- Reduce throttle rate
- Check network latency
- Verify new node capacity
#### Drain Timeout
**Symptoms**: Migration stuck at CutoverDraining phase
**Diagnosis**:
```bash
# Check stuck writes
miroir-ctl writes list --stuck
# Check drain timeout
miroir-ctl reshard config --show drain_timeout
```
**Solutions**:
- Mark stuck writes as failed
- Increase drain timeout
- Check for network issues
#### High Delta Pass Count
**Symptoms**: Delta pass copying > 1000 documents
**Diagnosis**:
```bash
# Check delta pass details
miroir-ctl reshard status --index products --show delta
# Check dual-write failure rate
miroir-ctl reshard stats --index products --show failures
```
**Solutions**:
- Investigate dual-write failures
- Check new node health
- Verify network stability
#### Anti-Entropy Divergences
**Symptoms**: Anti-entropy finding divergences after migration
**Diagnosis**:
```bash
# Check AE details
miroir-ctl anti-entropy history --index products --detailed
# Check specific shards
miroir-ctl anti-entropy verify --index products --shards 0-63
```
**Solutions**:
- Run AE with auto-repair
- Investigate delta pass logic
- Review migration logs
---
## Emergency Contacts
| Role | Contact |
|------|---------|
| On-call Engineer | on-call@miroir.io |
| Database Lead | db-lead@miroir.io |
| Infrastructure Lead | infra-lead@miroir.io |
---
## Related Documentation
- [Chaos Testing Report](chaos_testing_report.md)
- [Migration Implementation](../crates/miroir-core/src/migration.rs)
- [Anti-Entropy Reconciler](../crates/miroir-core/src/anti_entropy.rs)
- [Phase 4 Cutover Design](../../plan/phase4_cutover.md)
- [Phase 5 Anti-Entropy Design](../../plan/phase5_anti_entropy.md)