miroir/docs/trade-offs.md
jedarden fec5aa5e74 P12.OP1: Chaos-test cutover race window + hard refusal policy
14 chaos tests validate shard migration write safety at every cutover
boundary. Key findings:

- AE on + delta pass: 0/1M loss (production default)
- AE off + delta pass: 0/50K loss (delta pass is sufficient alone)
- AE off + delta skipped: ~2% loss → hard refusal at config validation
- 3-node cluster cutover: 0 loss with delta pass

Hard-coded policy: MigrationCoordinator refuses migrations when both
anti-entropy is disabled and delta pass is skipped. Warning logged when
AE is disabled but delta pass remains active.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 22:00:21 -04:00

3.1 KiB

Miroir Trade-Offs and Design Decisions

Shard Migration Write Safety (Plan §15 OP#1)

Problem

During node addition, documents written at the exact cutover boundary can be lost if they succeed on the OLD node but fail on the NEW node. The dangerous window is between "stop dual-write" and "delete old shard data."

Solution: Quiesce-Then-Verify Cutover

The migration state machine (migration.rs) uses a multi-phase cutover:

  1. Stop dual-write — no new writes go to either node for affected shards
  2. Drain — wait for all in-flight writes to complete on both OLD and NEW
  3. Delta pass — re-read affected shards from OLD, write any docs missing on NEW
  4. Activate — routing switches to NEW-only
  5. Cleanup — delete migrated shard data from OLD

Empirical Results

Configuration Writes Loss Rate Verdict
AE on + delta pass on 1M 0/1M (0.000%) PASS — production default
AE off + delta pass on 50K 0/50K (0.000%) PASS — delta pass is sufficient alone
AE on + delta pass skipped 200 measurable Acceptable — AE repairs on next pass
AE off + delta pass skipped 100K ~2.0% REFUSED — blocked at config validation
Tight-loop boundary (AE+delta) 1350+ 0 PASS — writes at every transition boundary
High-volume boundary (AE+delta) 100K 0/100K PASS
3-node cluster (AE+delta) 2600+ 0 PASS — multi-owner cutover
3-node cluster (AE off+delta) 5000 0 PASS — delta pass alone sufficient

Decision: Hard Refusal of Unsafe Configuration

MigrationCoordinator::validate_safety() refuses to start a migration when both anti-entropy is disabled AND the delta pass is skipped. This is a hard-coded policy — not a warning — because:

  • The measured loss rate without either safety net is ~2% (deterministic, proportional to the write-failure rate during dual-write)
  • Anti-entropy runs every 6 hours by default; disabling it removes the reconciliation safety net
  • Skipping the delta pass removes the immediate repair mechanism
  • Both off together provides zero recovery path for boundary documents

The validate_migration_safety() function in anti_entropy.rs provides the same gate at the cross-module level, ensuring no code path can bypass this check.

Anti-Entropy: Required or Optional?

Anti-entropy is optional but recommended. The delta pass alone provides 0-loss cutover. Anti-entropy exists as a defense-in-depth measure:

  • Catches any bugs in the delta pass implementation
  • Repairs drift from non-migration causes (network partitions, disk errors)
  • Runs on a 6-hour schedule (configurable)

Operators MAY disable anti-entropy if they accept the risk of gradual replica drift. They MAY NOT skip both anti-entropy and the delta pass simultaneously.

Warning When AE Is Disabled During Migration

When anti-entropy is disabled and a migration begins (with delta pass enabled), the system logs a warning via migration_warning_if_ae_disabled(). This informs operators that the delta pass is the sole safety mechanism and any bugs in it could lead to data loss.