14 chaos tests validate shard migration write safety at every cutover boundary. Key findings: - AE on + delta pass: 0/1M loss (production default) - AE off + delta pass: 0/50K loss (delta pass is sufficient alone) - AE off + delta skipped: ~2% loss → hard refusal at config validation - 3-node cluster cutover: 0 loss with delta pass Hard-coded policy: MigrationCoordinator refuses migrations when both anti-entropy is disabled and delta pass is skipped. Warning logged when AE is disabled but delta pass remains active. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
68 lines
3.1 KiB
Markdown
68 lines
3.1 KiB
Markdown
# Miroir Trade-Offs and Design Decisions
|
|
|
|
## Shard Migration Write Safety (Plan §15 OP#1)
|
|
|
|
### Problem
|
|
|
|
During node addition, documents written at the exact cutover boundary can be
|
|
lost if they succeed on the OLD node but fail on the NEW node. The dangerous
|
|
window is between "stop dual-write" and "delete old shard data."
|
|
|
|
### Solution: Quiesce-Then-Verify Cutover
|
|
|
|
The migration state machine (`migration.rs`) uses a multi-phase cutover:
|
|
|
|
1. **Stop dual-write** — no new writes go to either node for affected shards
|
|
2. **Drain** — wait for all in-flight writes to complete on both OLD and NEW
|
|
3. **Delta pass** — re-read affected shards from OLD, write any docs missing on NEW
|
|
4. **Activate** — routing switches to NEW-only
|
|
5. **Cleanup** — delete migrated shard data from OLD
|
|
|
|
### Empirical Results
|
|
|
|
| Configuration | Writes | Loss Rate | Verdict |
|
|
|---|---|---|---|
|
|
| AE on + delta pass on | 1M | 0/1M (0.000%) | **PASS** — production default |
|
|
| AE off + delta pass on | 50K | 0/50K (0.000%) | PASS — delta pass is sufficient alone |
|
|
| AE on + delta pass skipped | 200 | measurable | Acceptable — AE repairs on next pass |
|
|
| AE off + delta pass skipped | 100K | ~2.0% | **REFUSED** — blocked at config validation |
|
|
| Tight-loop boundary (AE+delta) | 1350+ | 0 | PASS — writes at every transition boundary |
|
|
| High-volume boundary (AE+delta) | 100K | 0/100K | PASS |
|
|
| 3-node cluster (AE+delta) | 2600+ | 0 | PASS — multi-owner cutover |
|
|
| 3-node cluster (AE off+delta) | 5000 | 0 | PASS — delta pass alone sufficient |
|
|
|
|
### Decision: Hard Refusal of Unsafe Configuration
|
|
|
|
`MigrationCoordinator::validate_safety()` refuses to start a migration when
|
|
both anti-entropy is disabled AND the delta pass is skipped. This is a
|
|
**hard-coded policy** — not a warning — because:
|
|
|
|
- The measured loss rate without either safety net is ~2% (deterministic,
|
|
proportional to the write-failure rate during dual-write)
|
|
- Anti-entropy runs every 6 hours by default; disabling it removes the
|
|
reconciliation safety net
|
|
- Skipping the delta pass removes the immediate repair mechanism
|
|
- Both off together provides **zero recovery path** for boundary documents
|
|
|
|
The `validate_migration_safety()` function in `anti_entropy.rs` provides the
|
|
same gate at the cross-module level, ensuring no code path can bypass this
|
|
check.
|
|
|
|
### Anti-Entropy: Required or Optional?
|
|
|
|
**Anti-entropy is optional but recommended.** The delta pass alone provides
|
|
0-loss cutover. Anti-entropy exists as a defense-in-depth measure:
|
|
|
|
- Catches any bugs in the delta pass implementation
|
|
- Repairs drift from non-migration causes (network partitions, disk errors)
|
|
- Runs on a 6-hour schedule (configurable)
|
|
|
|
Operators MAY disable anti-entropy if they accept the risk of gradual replica
|
|
drift. They MAY NOT skip both anti-entropy and the delta pass simultaneously.
|
|
|
|
### Warning When AE Is Disabled During Migration
|
|
|
|
When anti-entropy is disabled and a migration begins (with delta pass enabled),
|
|
the system logs a warning via `migration_warning_if_ae_disabled()`. This
|
|
informs operators that the delta pass is the sole safety mechanism and any
|
|
bugs in it could lead to data loss.
|