Add comprehensive documentation comparing the two scaling dimensions: - Core distinction: N-change is lightweight (rendezvous hash), S-change is heavy (dual-hash dual-write) - Node scaling moves only ~1/N of documents; resharding affects 100% with 2× transient amplification - Decision matrix for operators to choose the right approach - Capacity planning guidance with S = max_nodes_per_group_ever × 8 formula - References to existing benchmarks and CLI schedule guidance This completes the remaining work for OP#3 by documenting the trade-offs so operators understand when to use resharding vs adding nodes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: bf-jap1
162 lines
7.7 KiB
Markdown
162 lines
7.7 KiB
Markdown
# Miroir Trade-Offs and Design Decisions
|
||
|
||
## Resharding (S-Change) vs Node Scaling (N-Change) (Plan §15 OP#3)
|
||
|
||
### Core Distinction
|
||
|
||
Miroir supports two orthogonal scaling dimensions with very different cost profiles:
|
||
|
||
| Dimension | Parameter | What Changes | Routing Impact |
|
||
|-----------|-----------|--------------|----------------|
|
||
| **Node scaling** | N (node count) | Which nodes host each shard | Rendezvous hash reassigns shards to new/remaining nodes |
|
||
| **Resharding** | S (shard count) | How the hash space is divided | Every document's shard assignment changes: `hash(pk) % S` |
|
||
|
||
**Key insight:** Node scaling is lightweight; resharding is heavy.
|
||
|
||
### Node Scaling (N-Change)
|
||
|
||
**When to use:** Add or remove nodes to adjust capacity, throughput, or fault tolerance.
|
||
|
||
**Cost:** Minimal — only the affected fraction of documents moves.
|
||
|
||
**Mechanism:**
|
||
- Rendezvous hashing assigns shards to nodes based on a stable ordering
|
||
- Adding a node: only `~1/(N_new)` of documents move (those whose top-ranked node changes)
|
||
- Removing a node: only that node's documents migrate to surviving nodes
|
||
- Migration uses dual-write with delta-pass cutover (see migration write safety below)
|
||
|
||
**Example:** Adding a 5th node to a 4-node group moves ~20% of documents (1/5). The other 80% stay untouched.
|
||
|
||
**Constraints:** None — fully elastic, can be done anytime.
|
||
|
||
**When this is insufficient:** When individual shards are too large for a single node, you need more shards (S-change), not more nodes.
|
||
|
||
### Resharding (S-Change)
|
||
|
||
**When to use:** Increase the logical shard count when individual shards are too large.
|
||
|
||
**Cost:** High — transient 2× storage amplification and 2× write amplification during dual-write phase.
|
||
|
||
**Mechanism:** Six-phase shadow-index operation (§13.1):
|
||
1. Shadow create: new index with new S
|
||
2. Dual-hash dual-write: every write routes to both old and new S
|
||
3. Backfill: stream all documents to shadow index
|
||
4. Verify: cross-index PK-set comparison
|
||
5. Alias swap: atomic cutover
|
||
6. Cleanup: retain old index for rollback, then delete
|
||
|
||
**Example:** Resharding from S=64 to S=128 temporarily doubles storage from 200 GB to 400 GB (× RG), and write throughput from 4,000 writes/sec to 8,000 writes/sec (× RF × RG).
|
||
|
||
**Constraints:** Requires significant headroom and off-peak scheduling. See `docs/benchmarks/resharding-load.md` for empirical data.
|
||
|
||
**When to avoid:** If you can instead add nodes to your current shard layout. Prefer N-change over S-change.
|
||
|
||
### Decision Matrix
|
||
|
||
| Symptom | Solution | Why |
|
||
|---------|----------|-----|
|
||
| Cluster CPU/memory saturated | Add nodes (N-change) | Spreads load across more machines |
|
||
| Shard too large for one node | Reshard (S-change) | Need more hash buckets to split large shards |
|
||
| Need more write throughput | Add replica groups (RG-scale) | Groups are independent; more groups = parallel writes |
|
||
| Need more read throughput | Add nodes or groups | Both help; groups add fault tolerance too |
|
||
| Individual shard > node disk | Reshard (S-change) | Only S-change reduces per-shard size |
|
||
|
||
### Capacity Planning Guidance
|
||
|
||
**Choosing S at index creation:**
|
||
```
|
||
S = max_nodes_per_group_ever × 8
|
||
```
|
||
|
||
This formula ensures you never need to reshard as your cluster grows within a group. Each group's rendezvous assignment is scoped to its own node list, so adding groups doesn't consume S headroom.
|
||
|
||
**Example:** A cluster starting at 2 nodes per group that might grow to 60 nodes per group should use S ≥ 480 (or 512 for a round power of two). Node fleet elasticity is unlimited within that S.
|
||
|
||
**Why the "× 8" factor?**
|
||
- Provides enough shard granularity to evenly distribute documents as nodes are added
|
||
- Avoids hot spots where some nodes host many large shards and others host few small ones
|
||
- Each node typically hosts 8-16 shards, which balances per-shard metadata overhead with distribution quality
|
||
|
||
**If you must reshard:**
|
||
1. Schedule during off-peak hours (use `miroir-ctl reshard start --schedule-window off-peak`)
|
||
2. Set backfill throttle conservatively: aim for peak total writes ≤ 3× normal
|
||
3. Ensure 2× storage headroom before starting
|
||
4. Monitor progress via `miroir-ctl reshard status`
|
||
|
||
### Empirical Validation
|
||
|
||
See `docs/benchmarks/resharding-load.md` for comprehensive benchmark results:
|
||
- Storage amplification: exactly 2.0× (all scenarios)
|
||
- Dual-write amplification: exactly 2.0× (all scenarios)
|
||
- Peak write amplification: varies by corpus and throttle (can exceed 500× for low-write corpora with aggressive backfill)
|
||
|
||
**Bottom line:** The "choose S generously" guidance remains the recommended default because online resharding is expensive. Treat §13.1 as a remediation path, not a license to under-provision.
|
||
|
||
---
|
||
|
||
## Shard Migration Write Safety (Plan §15 OP#1)
|
||
|
||
### Problem
|
||
|
||
During node addition, documents written at the exact cutover boundary can be
|
||
lost if they succeed on the OLD node but fail on the NEW node. The dangerous
|
||
window is between "stop dual-write" and "delete old shard data."
|
||
|
||
### Solution: Quiesce-Then-Verify Cutover
|
||
|
||
The migration state machine (`migration.rs`) uses a multi-phase cutover:
|
||
|
||
1. **Stop dual-write** — no new writes go to either node for affected shards
|
||
2. **Drain** — wait for all in-flight writes to complete on both OLD and NEW
|
||
3. **Delta pass** — re-read affected shards from OLD, write any docs missing on NEW
|
||
4. **Activate** — routing switches to NEW-only
|
||
5. **Cleanup** — delete migrated shard data from OLD
|
||
|
||
### Empirical Results
|
||
|
||
| Configuration | Writes | Loss Rate | Verdict |
|
||
|---|---|---|---|
|
||
| AE on + delta pass on | 1M | 0/1M (0.000%) | **PASS** — production default |
|
||
| AE off + delta pass on | 50K | 0/50K (0.000%) | PASS — delta pass is sufficient alone |
|
||
| AE on + delta pass skipped | 200 | measurable | Acceptable — AE repairs on next pass |
|
||
| AE off + delta pass skipped | 100K | ~2.0% | **REFUSED** — blocked at config validation |
|
||
| Tight-loop boundary (AE+delta) | 1350+ | 0 | PASS — writes at every transition boundary |
|
||
| High-volume boundary (AE+delta) | 100K | 0/100K | PASS |
|
||
| 3-node cluster (AE+delta) | 2600+ | 0 | PASS — multi-owner cutover |
|
||
| 3-node cluster (AE off+delta) | 5000 | 0 | PASS — delta pass alone sufficient |
|
||
|
||
### Decision: Hard Refusal of Unsafe Configuration
|
||
|
||
`MigrationCoordinator::validate_safety()` refuses to start a migration when
|
||
both anti-entropy is disabled AND the delta pass is skipped. This is a
|
||
**hard-coded policy** — not a warning — because:
|
||
|
||
- The measured loss rate without either safety net is ~2% (deterministic,
|
||
proportional to the write-failure rate during dual-write)
|
||
- Anti-entropy runs every 6 hours by default; disabling it removes the
|
||
reconciliation safety net
|
||
- Skipping the delta pass removes the immediate repair mechanism
|
||
- Both off together provides **zero recovery path** for boundary documents
|
||
|
||
The `validate_migration_safety()` function in `anti_entropy.rs` provides the
|
||
same gate at the cross-module level, ensuring no code path can bypass this
|
||
check.
|
||
|
||
### Anti-Entropy: Required or Optional?
|
||
|
||
**Anti-entropy is optional but recommended.** The delta pass alone provides
|
||
0-loss cutover. Anti-entropy exists as a defense-in-depth measure:
|
||
|
||
- Catches any bugs in the delta pass implementation
|
||
- Repairs drift from non-migration causes (network partitions, disk errors)
|
||
- Runs on a 6-hour schedule (configurable)
|
||
|
||
Operators MAY disable anti-entropy if they accept the risk of gradual replica
|
||
drift. They MAY NOT skip both anti-entropy and the delta pass simultaneously.
|
||
|
||
### Warning When AE Is Disabled During Migration
|
||
|
||
When anti-entropy is disabled and a migration begins (with delta pass enabled),
|
||
the system logs a warning via `migration_warning_if_ae_disabled()`. This
|
||
informs operators that the delta pass is the sole safety mechanism and any
|
||
bugs in it could lead to data loss.
|