Extended chaos test coverage from 14 to 19 tests and created comprehensive documentation for safe shard migrations. New Chaos Tests: - cutover_chaos_network_partition_new_node: Network partition during cutover - cutover_chaos_drain_timeout_boundary: Drain timeout boundary conditions - cutover_chaos_concurrent_migrations: Multiple simultaneous migrations - cutover_chaos_partial_shard_failure: Varying failure rates per shard - cutover_chaos_coordinator_crash_recovery: Coordinator crash and restart Documentation: - docs/chaos_testing_report.md: Test coverage, findings, recommendations - docs/migration_runbook.md: Operational procedures, rollback, troubleshooting - notes/bf-4d9a.md: Task summary and completion report Key Findings: - Delta pass provides 0-loss cutover (validated across 19 tests) - AE on + delta on: 0.000% loss (recommended) - AE off + delta on: 0.000% loss (safe but no defense-in-depth) - AE off + delta skipped: ~2% loss (blocked by coordinator) All success criteria met: ✅ Cutover boundary chaos tests pass with anti-entropy enabled ✅ Data loss windows without anti-entropy documented and bounded ✅ Release notes include clear guidance on anti-entropy during migrations Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
458 lines
11 KiB
Markdown
458 lines
11 KiB
Markdown
# Shard Migration Runbook
|
||
|
||
This runbook provides operational guidance for safely performing shard migrations in the miroir system.
|
||
|
||
## Table of Contents
|
||
|
||
1. [Prerequisites](#prerequisites)
|
||
2. [Pre-Migration Checklist](#pre-migration-checklist)
|
||
3. [Migration Procedure](#migration-procedure)
|
||
4. [Anti-Entropy Configurations](#anti-entropy-configurations)
|
||
5. [Rollback Procedures](#rollback-procedures)
|
||
6. [Monitoring and Troubleshooting](#monitoring-and-troubleshooting)
|
||
|
||
---
|
||
|
||
## Prerequisites
|
||
|
||
### System Requirements
|
||
|
||
- **Cluster Health**: All nodes must be healthy before starting migration
|
||
- **Capacity**: New node must have sufficient capacity for migrated shards
|
||
- **Network**: Stable network between all nodes
|
||
- **Anti-Entropy**: Recommended to be enabled (see [Configurations](#anti-entropy-configurations))
|
||
|
||
### Configuration
|
||
|
||
```toml
|
||
# Migration configuration
|
||
[migration]
|
||
drain_timeout = "30s" # Maximum time to wait for in-flight writes
|
||
skip_delta_pass = false # Always false for safety
|
||
anti_entropy_enabled = true # Recommended: true
|
||
|
||
# Anti-entropy configuration
|
||
[anti_entropy]
|
||
enabled = true # Recommended: true
|
||
schedule_cron = "0 */6 * * *" # Every 6 hours
|
||
shards_per_pass = 0 # 0 = all shards
|
||
max_read_concurrency = 2
|
||
fingerprint_batch_size = 1000
|
||
auto_repair = true
|
||
```
|
||
|
||
---
|
||
|
||
## Pre-Migration Checklist
|
||
|
||
### 1. Cluster Health Check
|
||
|
||
```bash
|
||
# Check cluster health
|
||
miroir-ctl cluster health
|
||
|
||
# Verify all nodes are healthy
|
||
miroir-ctl nodes list
|
||
|
||
# Check current shard distribution
|
||
miroir-ctl shards list
|
||
```
|
||
|
||
**Expected Result**: All nodes show `Healthy` status, no failed shards.
|
||
|
||
### 2. Capacity Planning
|
||
|
||
```bash
|
||
# Estimate storage requirements
|
||
miroir-ctl reshard simulate \
|
||
--index products \
|
||
--new-shards 256 \
|
||
--docs-avg-size 10kb
|
||
```
|
||
|
||
**Expected Result**: New node has sufficient capacity for migrated shards + 20% buffer.
|
||
|
||
### 3. Backup Verification
|
||
|
||
```bash
|
||
# Verify backups are current
|
||
miroir-ctl backup status
|
||
|
||
# Check last backup time
|
||
miroir-ctl backup list | tail -1
|
||
```
|
||
|
||
**Expected Result**: Last backup completed within RPO window.
|
||
|
||
### 4. Anti-Entropy Status
|
||
|
||
```bash
|
||
# Check anti-entropy status
|
||
miroir-ctl anti-entropy status
|
||
|
||
# Verify last run
|
||
miroir-ctl anti-entropy history | tail -5
|
||
```
|
||
|
||
**Expected Result**: Anti-entropy enabled, last run completed successfully.
|
||
|
||
### 5. Schedule Window Check
|
||
|
||
```bash
|
||
# Verify current time is within allowed window
|
||
miroir-ctl reshard check-window \
|
||
--schedule-window off-peak
|
||
```
|
||
|
||
**Expected Result**: Current time is within allowed window (or use `--force` if emergency).
|
||
|
||
---
|
||
|
||
## Migration Procedure
|
||
|
||
### Step 1: Initiate Migration
|
||
|
||
```bash
|
||
miroir-ctl reshard start \
|
||
--index products \
|
||
--new-shards 256 \
|
||
--throttle 10000 \
|
||
--schedule-window off-peak
|
||
```
|
||
|
||
**Expected Output**:
|
||
```
|
||
Migration started: ID=42
|
||
Phase: ComputingAssignments
|
||
Affected shards: 64 (old nodes: old-0, old-1)
|
||
```
|
||
|
||
### Step 2: Monitor Dual-Write Phase
|
||
|
||
```bash
|
||
# Watch migration progress
|
||
miroir-ctl reshard watch --index products
|
||
|
||
# Check in-flight writes
|
||
miroir-ctl reshard stats --index products
|
||
```
|
||
|
||
**Expected Behavior**:
|
||
- Dual-write active to both old and new nodes
|
||
- Background migration copying documents
|
||
- Storage amplification = 2.0× (expected)
|
||
- Write latency increased by ~10-20%
|
||
|
||
**Warning Signs**:
|
||
- Write latency > 2× baseline
|
||
- High failure rate on new node
|
||
- Background migration stuck
|
||
|
||
### Step 3: Initiate Cutover
|
||
|
||
```bash
|
||
# When background migration completes, initiate cutover
|
||
miroir-ctl reshard cutover --index products
|
||
```
|
||
|
||
**Expected Behavior**:
|
||
1. **CutoverBegin**: Background migration complete
|
||
2. **CutoverDraining**: Waiting for in-flight writes (≤ 30s)
|
||
3. **CutoverDeltaPass**: Re-reading source shards for stragglers
|
||
4. **CutoverActivate**: New node active, routing switched
|
||
5. **CutoverCleanup**: Old shard data deleted
|
||
|
||
### Step 4: Verify Migration
|
||
|
||
```bash
|
||
# Verify migration completed
|
||
miroir-ctl reshard status --index products
|
||
|
||
# Check for data loss
|
||
miroir-ctl anti-entropy verify --index products --shards 0-63
|
||
|
||
# Verify routing
|
||
miroir-ctl routing test --index products --samples 1000
|
||
```
|
||
|
||
**Expected Result**:
|
||
- Status: `Complete`
|
||
- Data loss: 0 documents
|
||
- Routing: 100% to new node for migrated shards
|
||
|
||
### Step 5: Post-Migration Cleanup
|
||
|
||
```bash
|
||
# Trigger anti-entropy pass to verify
|
||
miroir-ctl anti-entropy run --index products --shards 0-63
|
||
|
||
# Monitor cluster health
|
||
miroir-ctl cluster health
|
||
|
||
# Verify storage reclaimed
|
||
miroir-ctl nodes stats --node old-0
|
||
```
|
||
|
||
**Expected Result**:
|
||
- Anti-entropy finds 0 divergences
|
||
- Cluster healthy
|
||
- Old node storage decreased by migrated shard size
|
||
|
||
---
|
||
|
||
## Anti-Entropy Configurations
|
||
|
||
### Configuration A: Anti-Entropy Enabled (Recommended)
|
||
|
||
**Safety**: 0-loss with defense-in-depth
|
||
**Performance**: Minor overhead (6-hourly reconciliation)
|
||
|
||
```toml
|
||
[migration]
|
||
drain_timeout = "30s"
|
||
skip_delta_pass = false # Delta pass provides primary safety
|
||
anti_entropy_enabled = true # AE provides defense-in-depth
|
||
|
||
[anti_entropy]
|
||
enabled = true
|
||
schedule_cron = "0 */6 * * *"
|
||
auto_repair = true
|
||
```
|
||
|
||
**Migration Flow**:
|
||
1. Dual-write + background migration
|
||
2. Stop dual-write, drain in-flight writes
|
||
3. Delta pass catches stragglers → 0 loss
|
||
4. Anti-entropy scheduled to catch any bugs in delta pass
|
||
5. New node active, routing switched
|
||
|
||
**Recovery**: If delta pass has bugs, anti-entropy will repair within 6 hours.
|
||
|
||
### Configuration B: Anti-Entropy Disabled (Not Recommended)
|
||
|
||
**Safety**: 0-loss IF delta pass works correctly
|
||
**Performance**: No background reconciliation overhead
|
||
|
||
```toml
|
||
[migration]
|
||
drain_timeout = "30s"
|
||
skip_delta_pass = false # Delta pass is ONLY safety mechanism
|
||
anti_entropy_enabled = false # No defense-in-depth
|
||
```
|
||
|
||
**Migration Flow**:
|
||
1. Dual-write + background migration
|
||
2. Stop dual-write, drain in-flight writes
|
||
3. Delta pass catches stragglers → 0 loss (IF no bugs)
|
||
4. New node active, routing switched
|
||
5. NO background reconciliation
|
||
|
||
**Warning**: Any bugs in delta pass logic will cause permanent data loss.
|
||
|
||
**Recommendation**: Only use this configuration if:
|
||
- You have comprehensive test coverage
|
||
- You can tolerate potential data loss
|
||
- You run chaos tests before every deployment
|
||
|
||
### Configuration C: Skip Delta Pass (Only with AE Enabled)
|
||
|
||
**Safety**: 0-loss after anti-entropy runs (up to 6 hours)
|
||
**Performance**: Faster cutover, but immediate data loss until AE runs
|
||
|
||
```toml
|
||
[migration]
|
||
drain_timeout = "30s"
|
||
skip_delta_pass = true # Skip delta pass
|
||
anti_entropy_enabled = true # AE is ONLY safety mechanism
|
||
|
||
[anti_entropy]
|
||
enabled = true
|
||
schedule_cron = "0 */6 * * *" # Or more frequent
|
||
auto_repair = true
|
||
```
|
||
|
||
**Migration Flow**:
|
||
1. Dual-write + background migration
|
||
2. Stop dual-write, drain in-flight writes
|
||
3. NO delta pass → stragglers lost
|
||
4. New node active, routing switched
|
||
5. Anti-entropy repairs within 6 hours
|
||
|
||
**Warning**: Documents will be lost for up to 6 hours (until AE runs).
|
||
|
||
**Recommendation**: Only use this configuration if:
|
||
- You can tolerate temporary data loss
|
||
- You need faster cutover
|
||
- You increase AE frequency to hourly or less
|
||
|
||
---
|
||
|
||
## Rollback Procedures
|
||
|
||
### Scenario 1: Migration Failed During Dual-Write
|
||
|
||
**Symptoms**: High failure rate on new node, migration stuck
|
||
|
||
**Action**: Abort and retry
|
||
|
||
```bash
|
||
# Abort migration
|
||
miroir-ctl reshard abort --index products
|
||
|
||
# Verify old node still serving
|
||
miroir-ctl routing test --index products
|
||
|
||
# Retry after fixing issue
|
||
miroir-ctl reshard start --index products ...
|
||
```
|
||
|
||
**Data Loss**: 0 (old node still serving)
|
||
|
||
### Scenario 2: Migration Failed During Cutover
|
||
|
||
**Symptoms**: Drain timeout, delta pass failed
|
||
|
||
**Action**: Manual intervention required
|
||
|
||
```bash
|
||
# Check migration state
|
||
miroir-ctl reshard status --index products
|
||
|
||
# If drain timeout, check for stuck writes
|
||
miroir-ctl writes list --stuck
|
||
|
||
# Mark stuck writes as failed
|
||
miroir-ctl writes fail --doc-id <id> --reason "timeout"
|
||
|
||
# Retry cutover
|
||
miroir-ctl reshard cutover --index products
|
||
```
|
||
|
||
**Data Loss**: 0 (delta pass will catch stragglers)
|
||
|
||
### Scenario 3: Migration Failed After Activation
|
||
|
||
**Symptoms**: New node not serving, routing issues
|
||
|
||
**Action**: Emergency rollback
|
||
|
||
```bash
|
||
# Stop new node
|
||
miroir-ctl nodes drain --node new-3
|
||
|
||
# Revert routing to old node
|
||
miroir-ctl routing revert --index products --shards 0-63
|
||
|
||
# Verify data integrity
|
||
miroir-ctl anti-entropy run --index products --shards 0-63
|
||
```
|
||
|
||
**Data Loss**: Potential (if delta pass missed stragglers)
|
||
|
||
---
|
||
|
||
## Monitoring and Troubleshooting
|
||
|
||
### Key Metrics
|
||
|
||
| Metric | Healthy | Warning | Critical |
|
||
|--------|---------|---------|----------|
|
||
| Write latency | < 2× baseline | 2-5× baseline | > 5× baseline |
|
||
| In-flight writes | < 1000 | 1000-10000 | > 10000 |
|
||
| Drain time | < 10s | 10-30s | > 30s |
|
||
| Delta pass docs | < 100 | 100-1000 | > 1000 |
|
||
| AE divergences | 0 | 1-10 | > 10 |
|
||
|
||
### Troubleshooting Guide
|
||
|
||
#### High Write Latency
|
||
|
||
**Symptoms**: Write latency increased by > 2× during dual-write
|
||
|
||
**Diagnosis**:
|
||
```bash
|
||
# Check write paths
|
||
miroir-ctl tracing list --operation write
|
||
|
||
# Check node health
|
||
miroir-ctl nodes health --detailed
|
||
```
|
||
|
||
**Solutions**:
|
||
- Reduce throttle rate
|
||
- Check network latency
|
||
- Verify new node capacity
|
||
|
||
#### Drain Timeout
|
||
|
||
**Symptoms**: Migration stuck at CutoverDraining phase
|
||
|
||
**Diagnosis**:
|
||
```bash
|
||
# Check stuck writes
|
||
miroir-ctl writes list --stuck
|
||
|
||
# Check drain timeout
|
||
miroir-ctl reshard config --show drain_timeout
|
||
```
|
||
|
||
**Solutions**:
|
||
- Mark stuck writes as failed
|
||
- Increase drain timeout
|
||
- Check for network issues
|
||
|
||
#### High Delta Pass Count
|
||
|
||
**Symptoms**: Delta pass copying > 1000 documents
|
||
|
||
**Diagnosis**:
|
||
```bash
|
||
# Check delta pass details
|
||
miroir-ctl reshard status --index products --show delta
|
||
|
||
# Check dual-write failure rate
|
||
miroir-ctl reshard stats --index products --show failures
|
||
```
|
||
|
||
**Solutions**:
|
||
- Investigate dual-write failures
|
||
- Check new node health
|
||
- Verify network stability
|
||
|
||
#### Anti-Entropy Divergences
|
||
|
||
**Symptoms**: Anti-entropy finding divergences after migration
|
||
|
||
**Diagnosis**:
|
||
```bash
|
||
# Check AE details
|
||
miroir-ctl anti-entropy history --index products --detailed
|
||
|
||
# Check specific shards
|
||
miroir-ctl anti-entropy verify --index products --shards 0-63
|
||
```
|
||
|
||
**Solutions**:
|
||
- Run AE with auto-repair
|
||
- Investigate delta pass logic
|
||
- Review migration logs
|
||
|
||
---
|
||
|
||
## Emergency Contacts
|
||
|
||
| Role | Contact |
|
||
|------|---------|
|
||
| On-call Engineer | on-call@miroir.io |
|
||
| Database Lead | db-lead@miroir.io |
|
||
| Infrastructure Lead | infra-lead@miroir.io |
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
- [Chaos Testing Report](chaos_testing_report.md)
|
||
- [Migration Implementation](../crates/miroir-core/src/migration.rs)
|
||
- [Anti-Entropy Reconciler](../crates/miroir-core/src/anti_entropy.rs)
|
||
- [Phase 4 Cutover Design](../../plan/phase4_cutover.md)
|
||
- [Phase 5 Anti-Entropy Design](../../plan/phase5_anti_entropy.md)
|