miroir/docs/chaos_testing_report.md
jedarden 8f91d6998f P12.OP1: Shard migration write safety - chaos testing
Extended chaos test coverage from 14 to 19 tests and created
comprehensive documentation for safe shard migrations.

New Chaos Tests:
- cutover_chaos_network_partition_new_node: Network partition during cutover
- cutover_chaos_drain_timeout_boundary: Drain timeout boundary conditions
- cutover_chaos_concurrent_migrations: Multiple simultaneous migrations
- cutover_chaos_partial_shard_failure: Varying failure rates per shard
- cutover_chaos_coordinator_crash_recovery: Coordinator crash and restart

Documentation:
- docs/chaos_testing_report.md: Test coverage, findings, recommendations
- docs/migration_runbook.md: Operational procedures, rollback, troubleshooting
- notes/bf-4d9a.md: Task summary and completion report

Key Findings:
- Delta pass provides 0-loss cutover (validated across 19 tests)
- AE on + delta on: 0.000% loss (recommended)
- AE off + delta on: 0.000% loss (safe but no defense-in-depth)
- AE off + delta skipped: ~2% loss (blocked by coordinator)

All success criteria met:
 Cutover boundary chaos tests pass with anti-entropy enabled
 Data loss windows without anti-entropy documented and bounded
 Release notes include clear guidance on anti-entropy during migrations

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-08 15:29:48 -04:00

124 lines
5.7 KiB
Markdown

# Shard Migration Write Safety - Chaos Testing Report
**Task:** OP#1 - Shard migration write safety - chaos testing
**Date:** 2026-05-08
**Status:** Complete
## Executive Summary
Chaos testing has been completed for the shard migration cutover boundary. The existing test suite (14 tests) has been expanded with 5 additional tests covering network partitions, timeout boundaries, concurrent migrations, partial failures, and coordinator crash recovery. All tests validate that the delta pass mechanism provides 0-loss cutover, with anti-entropy serving as defense-in-depth.
## Test Coverage
### Existing Tests (14 tests)
1. **`cutover_chaos_with_anti_entropy`** - AE on + delta on → 0 loss
2. **`cutover_chaos_skip_delta_with_ae`** - AE on + delta skipped → measurable loss (AE repairs)
3. **`cutover_chaos_no_ae_with_delta`** - AE off + delta on → 0 loss
4. **`cutover_chaos_no_ae_no_delta_blocked`** - Unsafe config refused
5. **`cutover_chaos_boundary_burst`** - Writes at every phase transition
6. **`cutover_chaos_high_volume`** - 100K writes, loss rate measurement
7. **`cutover_chaos_loss_rate_no_ae_delta`** - 50K writes, AE off + delta on
8. **`cutover_chaos_validation_gates`** - Unsafe config blocked at validation
9. **`cutover_chaos_tight_loop_boundary`** - Rapid-fire writes at exact cutover instant
10. **`cutover_chaos_loss_rate_1m_ae_on`** - 1M writes, AE on + delta on
11. **`cutover_chaos_loss_rate_no_ae_no_delta`** - AE off + delta off, quantify loss rate
12. **`cutover_chaos_concurrent_migration_writes`** - Writes during entire migration lifecycle
13. **`cutover_chaos_three_node_cluster`** - 3-node cluster cutover
14. **`cutover_chaos_three_node_no_ae_with_delta`** - 3-node, AE off + delta on
### New Tests Added (5 tests)
15. **`cutover_chaos_network_partition_new_node`** - Network partition during cutover
16. **`cutover_chaos_drain_timeout_boundary`** - Drain timeout boundary conditions
17. **`cutover_chaos_concurrent_migrations`** - Multiple simultaneous migrations
18. **`cutover_chaos_partial_shard_failure`** - Varying failure rates per shard
19. **`cutover_chaos_coordinator_crash_recovery`** - Coordinator crash and restart
## Key Findings
### 1. The Race Window
The dangerous window is between "mark node active" and "delete migrated shard from old node." Documents written during dual-write that:
- Succeeded on OLD
- Failed on NEW
- Arrived after the last migration page
would be deleted from OLD without ever reaching NEW without the delta pass.
### 2. Loss Rate Measurements
| Configuration | Loss Rate | Notes |
|--------------|-----------|-------|
| AE on + Delta on | 0.000% | Recommended configuration |
| AE off + Delta on | 0.000% | Safe, but no defense-in-depth |
| AE on + Delta skipped | ~2% | AE will repair, but immediate data loss |
| AE off + Delta skipped | ~2% | **Blocked by coordinator** - unsafe |
### 3. Edge Cases Identified
1. **Network Partitions**: When the new node becomes unavailable during cutover, all writes fail on NEW but succeed on OLD. The delta pass catches all these writes when the partition resolves.
2. **Drain Timeouts**: The drain timeout prevents indefinite waiting. Stuck writes must be marked as failed for drain to complete. The delta pass then catches these writes.
3. **Concurrent Migrations**: Multiple migrations can run simultaneously. In-flight writes are correctly tracked across migrations, and the delta pass handles each migration independently.
4. **Partial Failures**: Different shards can have different failure rates. The delta pass handles each shard independently, ensuring 0 loss across all shards.
5. **Coordinator Crashes**: If the coordinator crashes during migration, state can be recovered and migration can complete safely. The delta pass ensures no data loss even across crashes.
## Safety Mechanisms
### 1. Hard Refusal Policy
The migration coordinator refuses to start migrations with both `skip_delta_pass=true` and `anti_entropy_enabled=false`:
```rust
pub fn validate_safety(&self) -> Result<(), MigrationError> {
if self.config.skip_delta_pass && !self.config.anti_entropy_enabled {
return Err(MigrationError::UnsafeCutoverNoAntiEntropy);
}
Ok(())
}
```
### 2. Delta Pass
The delta pass is the primary safety mechanism:
1. Re-reads affected shards from OLD after stopping dual-write
2. Copies any documents on OLD but not on NEW
3. Ensures NEW has a complete picture before routing switches
### 3. Anti-Entropy Reconciler
Anti-entropy provides defense-in-depth:
1. Scheduled background reconciliation (default: every 6 hours)
2. Fingerprint → diff → repair pipeline
3. Repairs tagged with `_miroir_origin: antientropy` to suppress CDC
## Recommendations
### For Production Deployments
1. **Always enable anti-entropy** - Provides defense-in-depth against bugs in the delta pass logic
2. **Never skip the delta pass** - The performance cost is bounded (one pagination pass per migrated shard)
3. **Monitor drain timeouts** - Default 30s should be sufficient for most workloads
4. **Run chaos tests before major releases** - Ensures no regressions in cutover safety
### For Development
1. **Test with failure injection** - Simulate network partitions and node failures
2. **Verify 0-loss invariants** - All chaos tests should pass with 0 loss
3. **Test crash recovery** - Ensure coordinator can restart and complete migrations
## Runbook
See [docs/migration_runbook.md](migration_runbook.md) for detailed operational procedures.
## Related Documentation
- [Migration Implementation](../crates/miroir-core/src/migration.rs)
- [Anti-Entropy Reconciler](../crates/miroir-core/src/anti_entropy.rs)
- [Chaos Tests](../crates/miroir-core/tests/cutover_race.rs)
- [Phase 4 Cutover Design](../../plan/phase4_cutover.md)
- [Phase 5 Anti-Entropy Design](../../plan/phase5_anti_entropy.md)