14 chaos tests validate shard migration write safety at every cutover boundary. Key findings: - AE on + delta pass: 0/1M loss (production default) - AE off + delta pass: 0/50K loss (delta pass is sufficient alone) - AE off + delta skipped: ~2% loss → hard refusal at config validation - 3-node cluster cutover: 0 loss with delta pass Hard-coded policy: MigrationCoordinator refuses migrations when both anti-entropy is disabled and delta pass is skipped. Warning logged when AE is disabled but delta pass remains active. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.1 KiB
Miroir Trade-Offs and Design Decisions
Shard Migration Write Safety (Plan §15 OP#1)
Problem
During node addition, documents written at the exact cutover boundary can be lost if they succeed on the OLD node but fail on the NEW node. The dangerous window is between "stop dual-write" and "delete old shard data."
Solution: Quiesce-Then-Verify Cutover
The migration state machine (migration.rs) uses a multi-phase cutover:
- Stop dual-write — no new writes go to either node for affected shards
- Drain — wait for all in-flight writes to complete on both OLD and NEW
- Delta pass — re-read affected shards from OLD, write any docs missing on NEW
- Activate — routing switches to NEW-only
- Cleanup — delete migrated shard data from OLD
Empirical Results
| Configuration | Writes | Loss Rate | Verdict |
|---|---|---|---|
| AE on + delta pass on | 1M | 0/1M (0.000%) | PASS — production default |
| AE off + delta pass on | 50K | 0/50K (0.000%) | PASS — delta pass is sufficient alone |
| AE on + delta pass skipped | 200 | measurable | Acceptable — AE repairs on next pass |
| AE off + delta pass skipped | 100K | ~2.0% | REFUSED — blocked at config validation |
| Tight-loop boundary (AE+delta) | 1350+ | 0 | PASS — writes at every transition boundary |
| High-volume boundary (AE+delta) | 100K | 0/100K | PASS |
| 3-node cluster (AE+delta) | 2600+ | 0 | PASS — multi-owner cutover |
| 3-node cluster (AE off+delta) | 5000 | 0 | PASS — delta pass alone sufficient |
Decision: Hard Refusal of Unsafe Configuration
MigrationCoordinator::validate_safety() refuses to start a migration when
both anti-entropy is disabled AND the delta pass is skipped. This is a
hard-coded policy — not a warning — because:
- The measured loss rate without either safety net is ~2% (deterministic, proportional to the write-failure rate during dual-write)
- Anti-entropy runs every 6 hours by default; disabling it removes the reconciliation safety net
- Skipping the delta pass removes the immediate repair mechanism
- Both off together provides zero recovery path for boundary documents
The validate_migration_safety() function in anti_entropy.rs provides the
same gate at the cross-module level, ensuring no code path can bypass this
check.
Anti-Entropy: Required or Optional?
Anti-entropy is optional but recommended. The delta pass alone provides 0-loss cutover. Anti-entropy exists as a defense-in-depth measure:
- Catches any bugs in the delta pass implementation
- Repairs drift from non-migration causes (network partitions, disk errors)
- Runs on a 6-hour schedule (configurable)
Operators MAY disable anti-entropy if they accept the risk of gradual replica drift. They MAY NOT skip both anti-entropy and the delta pass simultaneously.
Warning When AE Is Disabled During Migration
When anti-entropy is disabled and a migration begins (with delta pass enabled),
the system logs a warning via migration_warning_if_ae_disabled(). This
informs operators that the delta pass is the sole safety mechanism and any
bugs in it could lead to data loss.