Extended chaos test coverage from 14 to 19 tests and created
comprehensive documentation for safe shard migrations.
New Chaos Tests:
- cutover_chaos_network_partition_new_node: Network partition during cutover
- cutover_chaos_drain_timeout_boundary: Drain timeout boundary conditions
- cutover_chaos_concurrent_migrations: Multiple simultaneous migrations
- cutover_chaos_partial_shard_failure: Varying failure rates per shard
- cutover_chaos_coordinator_crash_recovery: Coordinator crash and restart
Documentation:
- docs/chaos_testing_report.md: Test coverage, findings, recommendations
- docs/migration_runbook.md: Operational procedures, rollback, troubleshooting
- notes/bf-4d9a.md: Task summary and completion report
Key Findings:
- Delta pass provides 0-loss cutover (validated across 19 tests)
- AE on + delta on: 0.000% loss (recommended)
- AE off + delta on: 0.000% loss (safe but no defense-in-depth)
- AE off + delta skipped: ~2% loss (blocked by coordinator)
All success criteria met:
✅ Cutover boundary chaos tests pass with anti-entropy enabled
✅ Data loss windows without anti-entropy documented and bounded
✅ Release notes include clear guidance on anti-entropy during migrations
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add comprehensive documentation comparing the two scaling dimensions:
- Core distinction: N-change is lightweight (rendezvous hash), S-change is heavy (dual-hash dual-write)
- Node scaling moves only ~1/N of documents; resharding affects 100% with 2× transient amplification
- Decision matrix for operators to choose the right approach
- Capacity planning guidance with S = max_nodes_per_group_ever × 8 formula
- References to existing benchmarks and CLI schedule guidance
This completes the remaining work for OP#3 by documenting the trade-offs
so operators understand when to use resharding vs adding nodes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: bf-jap1
- Add aarch64-unknown-linux-musl target to rust-toolchain.toml for cross-compilation
- Document ARM64 build instructions, prerequisites, and architecture-specific considerations
- No architecture-specific code paths exist; all dependencies are architecture-agnostic
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Correct openraft version from 0.9.22 to 0.9.20 (latest stable per GitHub releases).
Update benchmark measurements from fresh re-run (50K ops). Suppress dead_code warnings
in benchmark module (functions only called from #[test]).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Benchmark numbers stable: state machine apply ~1.0x direct HashMap
overhead, both sub-microsecond. Confirms prior measurements.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fresh benchmark confirms state machine apply adds ~1.0-1.1x overhead
vs direct HashMap — both sub-microsecond. Real Raft cost remains
network + fsync (2-5ms vs Redis 0.3-0.8ms). Decision unchanged:
revisit before v2.0, do not ship in v0.x or v1.0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add rrqlite/openraft+SQLite reference project, correct raft-rs status
to maintenance mode, note openraft 0.10 edition 2024 requirement, and
add additional production users (Helyim, RobustMQ, rrqlite).
Decision unchanged: do not ship Raft in v0.x or v1.0, revisit before v2.0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add resharding load simulation model with real router hash functions
- Benchmark confirms storage amplification is exactly 2.0× and dual-write
amplification is exactly 2.0× across all test matrix scenarios (1KB/10GB,
10KB/100GB, 1MB/1TB), with hash distribution CV < 5% in all cases
- CLI window guard: resharding.allowed_windows config restricts resharding
to named time windows (e.g. "02:00-06:00 UTC"), CLI refuses outside
windows without --force
- Integration tests confirm rejection outside window, --force override,
no-restriction mode, and disabled config handling
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
14 chaos tests validate shard migration write safety at every cutover
boundary. Key findings:
- AE on + delta pass: 0/1M loss (production default)
- AE off + delta pass: 0/50K loss (delta pass is sufficient alone)
- AE off + delta skipped: ~2% loss → hard refusal at config validation
- 3-node cluster cutover: 0 loss with delta pass
Hard-coded policy: MigrationCoordinator refuses migrations when both
anti-entropy is disabled and delta pass is skipped. Warning logged when
AE is disabled but delta pass remains active.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds 14 chaos tests validating zero-data-loss at the migration cutover
boundary under all AE/delta-pass configurations. Two new 3-node cluster
variants exercise multi-owner shard migration with cross-node drain
tracking.
Key results: 0/1M loss with AE+delta; 0/50K loss with delta alone;
~2% hypothetical loss with neither (hard-refused by policy). The
MigrationCoordinator blocks migration when both anti-entropy and delta
pass are disabled.
Also includes: anti-entropy cross-module validation gate, warning log
when AE disabled during migration, empirical results table in
docs/trade-offs.md, and plan §15 OP#1 status update to verified.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Full serde-derived struct tree covering every block in plan §4 (MiroirConfig,
NodeConfig, TaskStoreConfig, AdminConfig, HealthConfig, ScatterConfig,
RebalancerConfig, ServerConfig, ConnectionPoolConfig, TaskRegistryConfig) and
all 21 §13 advanced-capability sub-structs (ReshardingConfig through
SearchUiConfig with nested auth/rate-limit/CSP/analytics structs), plus §14
horizontal-scaling structs (PeerDiscoveryConfig, LeaderElectionConfig, HpaConfig).
Includes:
- Layered loading via config crate: built-in defaults → file → env overrides
- Config::validate() with 14 cross-field rules (HA requires redis, scoped_key
timing inversion, node group bounds, tenant affinity range checks, etc.)
- 10 unit tests: round-trip YAML, full plan example, minimal YAML defaults,
and validation rejection cases
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Enumerates dump variants that streaming mode can/can't handle.
- Added docs/dump-import/compatibility-matrix.md with comprehensive
compatibility matrix covering Meilisearch versions, dump variants,
and workarounds
- Added docs/dump-import/README.md as entry point
- Updated miroir-ctl dump command to reference matrix with helpful
error messages for unimplemented subcommands (import, export, analyze)
Addresses Open Problem #5: identifies what "can't reconstruct" means
in concrete terms, giving operators clear guidance on when broadcast
fallback is needed and what alternatives exist.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Survey openraft, raft-rs, and async-raft crates. Design a Raft-backed
TaskStore prototype using openraft with SQLite state machine. Analytical
benchmark against Redis across latency, throughput, memory, and ops
complexity. Decision: revisit before v2.0, do not ship in v0.x/v1.0 —
Raft fails the decision gate (worse on write latency and correctness
maturity despite removing the Redis dependency).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- LICENSE: MIT (per plan §12)
- CHANGELOG.md: Keep a Changelog 1.1.0 skeleton with [Unreleased]
and [0.1.0] sections matching the awk extractor from plan §7
- .gitignore: Rust target/, editor junk; Cargo.lock kept in VCS
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>