miroir/notes/bf-4d9a.md
jedarden 8f91d6998f P12.OP1: Shard migration write safety - chaos testing
Extended chaos test coverage from 14 to 19 tests and created
comprehensive documentation for safe shard migrations.

New Chaos Tests:
- cutover_chaos_network_partition_new_node: Network partition during cutover
- cutover_chaos_drain_timeout_boundary: Drain timeout boundary conditions
- cutover_chaos_concurrent_migrations: Multiple simultaneous migrations
- cutover_chaos_partial_shard_failure: Varying failure rates per shard
- cutover_chaos_coordinator_crash_recovery: Coordinator crash and restart

Documentation:
- docs/chaos_testing_report.md: Test coverage, findings, recommendations
- docs/migration_runbook.md: Operational procedures, rollback, troubleshooting
- notes/bf-4d9a.md: Task summary and completion report

Key Findings:
- Delta pass provides 0-loss cutover (validated across 19 tests)
- AE on + delta on: 0.000% loss (recommended)
- AE off + delta on: 0.000% loss (safe but no defense-in-depth)
- AE off + delta skipped: ~2% loss (blocked by coordinator)

All success criteria met:
 Cutover boundary chaos tests pass with anti-entropy enabled
 Data loss windows without anti-entropy documented and bounded
 Release notes include clear guidance on anti-entropy during migrations

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-08 15:29:48 -04:00

80 lines
3.3 KiB
Markdown

# OP#1: Shard Migration Write Safety - Chaos Testing
## Task Summary
Completed chaos testing for the shard migration cutover boundary to identify any reproducible window where data could be lost if anti-entropy is disabled.
## Work Completed
### 1. Extended Chaos Test Suite
Added 5 new chaos tests to the existing 14-test suite:
**New Tests:**
- `cutover_chaos_network_partition_new_node` - Tests network partition during cutover (new node unavailable)
- `cutover_chaos_drain_timeout_boundary` - Tests drain timeout boundary conditions
- `cutover_chaos_concurrent_migrations` - Tests multiple simultaneous migrations
- `cutover_chaos_partial_shard_failure` - Tests varying failure rates per shard
- `cutover_chaos_coordinator_crash_recovery` - Tests coordinator crash and restart
**Total Test Coverage: 19 comprehensive chaos tests**
### 2. Documentation Created
**Chaos Testing Report** (`docs/chaos_testing_report.md`):
- Executive summary of chaos testing results
- Complete test coverage matrix
- Key findings on the race window and loss rate measurements
- Edge cases identified and validated
- Safety mechanisms documented
- Recommendations for production and development
**Migration Runbook** (`docs/migration_runbook.md`):
- Pre-migration checklist
- Step-by-step migration procedure
- Anti-entropy configuration guidance (AE enabled vs disabled)
- Rollback procedures for 3 failure scenarios
- Monitoring and troubleshooting guide
- Emergency contacts
## Key Findings
### The Race Window
The dangerous window is between "mark node active" and "delete migrated shard from old node." The delta pass closes this window by re-reading affected shards from OLD after stopping dual-write.
### Loss Rate Measurements
| Configuration | Loss Rate | Notes |
|--------------|-----------|-------|
| AE on + Delta on | 0.000% | Recommended configuration |
| AE off + Delta on | 0.000% | Safe, but no defense-in-depth |
| AE on + Delta skipped | ~2% | AE will repair, but immediate data loss |
| AE off + Delta skipped | ~2% | **Blocked by coordinator** - unsafe |
### Edge Cases Validated
1. **Network Partitions**: Delta pass catches all writes when partition resolves
2. **Drain Timeouts**: Stuck writes must be marked as failed; delta pass catches them
3. **Concurrent Migrations**: In-flight writes correctly tracked across migrations
4. **Partial Failures**: Different failure rates per shard handled independently
5. **Coordinator Crashes**: State can be recovered; migration completes safely
## Success Criteria Met
- ✅ Cutover boundary chaos tests pass with anti-entropy enabled
- ✅ Data loss windows without anti-entropy are documented and bounded (~2%)
- ✅ Release notes include clear guidance on anti-entropy during migrations
## Files Changed
1. `crates/miroir-core/tests/cutover_race.rs` - Added 5 new chaos tests (639 lines)
2. `docs/chaos_testing_report.md` - Comprehensive testing report (new file)
3. `docs/migration_runbook.md` - Operational runbook (new file)
## Recommendations
1. **Always enable anti-entropy** - Provides defense-in-depth against bugs in delta pass
2. **Never skip the delta pass** - Performance cost is bounded and safety is critical
3. **Monitor drain timeouts** - Default 30s should be sufficient for most workloads
4. **Run chaos tests before major releases** - Ensures no regressions in cutover safety