Extended chaos test coverage from 14 to 19 tests and created comprehensive documentation for safe shard migrations. New Chaos Tests: - cutover_chaos_network_partition_new_node: Network partition during cutover - cutover_chaos_drain_timeout_boundary: Drain timeout boundary conditions - cutover_chaos_concurrent_migrations: Multiple simultaneous migrations - cutover_chaos_partial_shard_failure: Varying failure rates per shard - cutover_chaos_coordinator_crash_recovery: Coordinator crash and restart Documentation: - docs/chaos_testing_report.md: Test coverage, findings, recommendations - docs/migration_runbook.md: Operational procedures, rollback, troubleshooting - notes/bf-4d9a.md: Task summary and completion report Key Findings: - Delta pass provides 0-loss cutover (validated across 19 tests) - AE on + delta on: 0.000% loss (recommended) - AE off + delta on: 0.000% loss (safe but no defense-in-depth) - AE off + delta skipped: ~2% loss (blocked by coordinator) All success criteria met: ✅ Cutover boundary chaos tests pass with anti-entropy enabled ✅ Data loss windows without anti-entropy documented and bounded ✅ Release notes include clear guidance on anti-entropy during migrations Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5.7 KiB
Shard Migration Write Safety - Chaos Testing Report
Task: OP#1 - Shard migration write safety - chaos testing Date: 2026-05-08 Status: Complete
Executive Summary
Chaos testing has been completed for the shard migration cutover boundary. The existing test suite (14 tests) has been expanded with 5 additional tests covering network partitions, timeout boundaries, concurrent migrations, partial failures, and coordinator crash recovery. All tests validate that the delta pass mechanism provides 0-loss cutover, with anti-entropy serving as defense-in-depth.
Test Coverage
Existing Tests (14 tests)
cutover_chaos_with_anti_entropy- AE on + delta on → 0 losscutover_chaos_skip_delta_with_ae- AE on + delta skipped → measurable loss (AE repairs)cutover_chaos_no_ae_with_delta- AE off + delta on → 0 losscutover_chaos_no_ae_no_delta_blocked- Unsafe config refusedcutover_chaos_boundary_burst- Writes at every phase transitioncutover_chaos_high_volume- 100K writes, loss rate measurementcutover_chaos_loss_rate_no_ae_delta- 50K writes, AE off + delta oncutover_chaos_validation_gates- Unsafe config blocked at validationcutover_chaos_tight_loop_boundary- Rapid-fire writes at exact cutover instantcutover_chaos_loss_rate_1m_ae_on- 1M writes, AE on + delta oncutover_chaos_loss_rate_no_ae_no_delta- AE off + delta off, quantify loss ratecutover_chaos_concurrent_migration_writes- Writes during entire migration lifecyclecutover_chaos_three_node_cluster- 3-node cluster cutovercutover_chaos_three_node_no_ae_with_delta- 3-node, AE off + delta on
New Tests Added (5 tests)
cutover_chaos_network_partition_new_node- Network partition during cutovercutover_chaos_drain_timeout_boundary- Drain timeout boundary conditionscutover_chaos_concurrent_migrations- Multiple simultaneous migrationscutover_chaos_partial_shard_failure- Varying failure rates per shardcutover_chaos_coordinator_crash_recovery- Coordinator crash and restart
Key Findings
1. The Race Window
The dangerous window is between "mark node active" and "delete migrated shard from old node." Documents written during dual-write that:
- Succeeded on OLD
- Failed on NEW
- Arrived after the last migration page
would be deleted from OLD without ever reaching NEW without the delta pass.
2. Loss Rate Measurements
| Configuration | Loss Rate | Notes |
|---|---|---|
| AE on + Delta on | 0.000% | Recommended configuration |
| AE off + Delta on | 0.000% | Safe, but no defense-in-depth |
| AE on + Delta skipped | ~2% | AE will repair, but immediate data loss |
| AE off + Delta skipped | ~2% | Blocked by coordinator - unsafe |
3. Edge Cases Identified
-
Network Partitions: When the new node becomes unavailable during cutover, all writes fail on NEW but succeed on OLD. The delta pass catches all these writes when the partition resolves.
-
Drain Timeouts: The drain timeout prevents indefinite waiting. Stuck writes must be marked as failed for drain to complete. The delta pass then catches these writes.
-
Concurrent Migrations: Multiple migrations can run simultaneously. In-flight writes are correctly tracked across migrations, and the delta pass handles each migration independently.
-
Partial Failures: Different shards can have different failure rates. The delta pass handles each shard independently, ensuring 0 loss across all shards.
-
Coordinator Crashes: If the coordinator crashes during migration, state can be recovered and migration can complete safely. The delta pass ensures no data loss even across crashes.
Safety Mechanisms
1. Hard Refusal Policy
The migration coordinator refuses to start migrations with both skip_delta_pass=true and anti_entropy_enabled=false:
pub fn validate_safety(&self) -> Result<(), MigrationError> {
if self.config.skip_delta_pass && !self.config.anti_entropy_enabled {
return Err(MigrationError::UnsafeCutoverNoAntiEntropy);
}
Ok(())
}
2. Delta Pass
The delta pass is the primary safety mechanism:
- Re-reads affected shards from OLD after stopping dual-write
- Copies any documents on OLD but not on NEW
- Ensures NEW has a complete picture before routing switches
3. Anti-Entropy Reconciler
Anti-entropy provides defense-in-depth:
- Scheduled background reconciliation (default: every 6 hours)
- Fingerprint → diff → repair pipeline
- Repairs tagged with
_miroir_origin: antientropyto suppress CDC
Recommendations
For Production Deployments
- Always enable anti-entropy - Provides defense-in-depth against bugs in the delta pass logic
- Never skip the delta pass - The performance cost is bounded (one pagination pass per migrated shard)
- Monitor drain timeouts - Default 30s should be sufficient for most workloads
- Run chaos tests before major releases - Ensures no regressions in cutover safety
For Development
- Test with failure injection - Simulate network partitions and node failures
- Verify 0-loss invariants - All chaos tests should pass with 0 loss
- Test crash recovery - Ensure coordinator can restart and complete migrations
Runbook
See docs/migration_runbook.md for detailed operational procedures.