jedarden/miroir

Author	SHA1	Message	Date
jedarden	1da32f8d57	Phase 3 (miroir-r3j): Task Registry + Persistence — Verification complete Verified and documented the existing task store implementation: - All 14 tables from plan §4 implemented in SQLite and Redis backends - TaskStore trait enables runtime backend switching via task_store.backend - Schema version tracking with migration detection - Comprehensive test suite: property tests + integration tests with testcontainers - Helm values.schema.json enforces replicas > 1 → redis requirement - Redis memory accounting validated against representative load (20 kQPS) Added documentation: - docs/notes/phase3-task-store-verification.md — DoD checklist and Redis memory analysis - notes/miroir-r3j-phase3-summary.md — Completion summary and retrospective Definition of Done — ALL MET ✅ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 05:40:08 -04:00
jedarden	3556f64742	Phase 3 (miroir-r3j): Task Registry + Persistence — Complete This phase implements a comprehensive task store with dual backend support (SQLite for single-pod, Redis for multi-pod deployments), covering all 14 tables from plan §4. ## What Was Already Implemented The task store module was already complete with: - Complete 14-table schema (tasks, aliases, sessions, jobs, etc.) - SQLite backend with idempotent schema initialization - Redis backend with hash+index pattern for O(n) list queries - Unified TaskStore trait with runtime backend selection - Comprehensive property tests and integration tests - Helm schema validation enforcing Redis for replicas > 1 ## What Was Added - Redis memory accounting documentation (docs/redis-memory-accounting.md) - Complete keyspace inventory with size estimates - Representative load calculation (~2.8 MB baseline) - Scaling characteristics and production recommendations - Fixed job_dequeue() to properly fetch the updated job after transaction - Previously returned a stale Job object from before the UPDATE - Now fetches the job after the status change for accuracy ## Definition of Done — All Complete ✅ - [x] rusqlite-backed store initializing every table idempotently - [x] Redis-backed store mirroring the same API (TaskStore trait) - [x] Schema versioning with schema_version row - [x] Property tests on SQLite backend - [x] Integration test for pod restart simulation - [x] Redis-backend integration tests with testcontainers - [x] miroir:tasks:_index pattern for list endpoints (no SCAN) - [x] Helm schema enforces taskStore.backend:redis when replicas > 1 - [x] Redis memory accounting validated against representative load All future features (§13 advanced capabilities, §14 HA modes) can consume this persistence layer without modification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 02:29:38 -04:00
jedarden	6c32dd8efc	Phase 0 (miroir-qon): Rust 1.88 upgrade + test infrastructure - Bump Rust toolchain from 1.87 to 1.88 - Add testcontainers and arbitrary dependencies for property testing - Update router with rendezvous hashing improvements - Fix credential handling in miroir-ctl - Update reshard and migration modules - Add Helm chart scaffolding - Add Redis memory accounting documentation All Phase 0 DoD checks pass: - cargo build --all succeeds - cargo test --all succeeds (103 tests) - cargo clippy --all-targets --all-features -- -D warnings passes - cargo fmt --all -- --check passes - Config round-trip YAML test passes Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 02:05:44 -04:00
jedarden	e0d6735ec0	Phase 0 (miroir-qon): Foundation — verification complete Phase 0 (Foundation) was already established in the repository. All required components are in place: - Cargo workspace with three crates (miroir-core, miroir-proxy, miroir-ctl) - rust-toolchain.toml pinning Rust 1.87 - All key dependencies wired (axum, tokio, reqwest, serde, config, clap, uuid) - Config struct with full YAML schema from plan §4 - Style configuration (rustfmt.toml, clippy.toml, .editorconfig) - Project files (CHANGELOG.md, LICENSE, .gitignore) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-08 19:20:18 -04:00
jedarden	8f91d6998f	P12.OP1: Shard migration write safety - chaos testing Extended chaos test coverage from 14 to 19 tests and created comprehensive documentation for safe shard migrations. New Chaos Tests: - cutover_chaos_network_partition_new_node: Network partition during cutover - cutover_chaos_drain_timeout_boundary: Drain timeout boundary conditions - cutover_chaos_concurrent_migrations: Multiple simultaneous migrations - cutover_chaos_partial_shard_failure: Varying failure rates per shard - cutover_chaos_coordinator_crash_recovery: Coordinator crash and restart Documentation: - docs/chaos_testing_report.md: Test coverage, findings, recommendations - docs/migration_runbook.md: Operational procedures, rollback, troubleshooting - notes/bf-4d9a.md: Task summary and completion report Key Findings: - Delta pass provides 0-loss cutover (validated across 19 tests) - AE on + delta on: 0.000% loss (recommended) - AE off + delta on: 0.000% loss (safe but no defense-in-depth) - AE off + delta skipped: ~2% loss (blocked by coordinator) All success criteria met: ✅ Cutover boundary chaos tests pass with anti-entropy enabled ✅ Data loss windows without anti-entropy documented and bounded ✅ Release notes include clear guidance on anti-entropy during migrations Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-08 15:29:48 -04:00
jedarden	4a3c05473e	OP#3: Document S-change (resharding) vs N-change (node scaling) trade-offs Add comprehensive documentation comparing the two scaling dimensions: - Core distinction: N-change is lightweight (rendezvous hash), S-change is heavy (dual-hash dual-write) - Node scaling moves only ~1/N of documents; resharding affects 100% with 2× transient amplification - Decision matrix for operators to choose the right approach - Capacity planning guidance with S = max_nodes_per_group_ever × 8 formula - References to existing benchmarks and CLI schedule guidance This completes the remaining work for OP#3 by documenting the trade-offs so operators understand when to use resharding vs adding nodes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: bf-jap1	2026-05-08 15:25:53 -04:00
jedarden	e89f02a174	OP#6: Add ARM64 (aarch64-unknown-linux-musl) target support - Add aarch64-unknown-linux-musl target to rust-toolchain.toml for cross-compilation - Document ARM64 build instructions, prerequisites, and architecture-specific considerations - No architecture-specific code paths exist; all dependencies are architecture-agnostic Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-08 15:25:12 -04:00
jedarden	ffc0ae3beb	P12.OP2: Finalize Raft research — correct openraft version, update benchmarks, suppress warnings Correct openraft version from 0.9.22 to 0.9.20 (latest stable per GitHub releases). Update benchmark measurements from fresh re-run (50K ops). Suppress dead_code warnings in benchmark module (functions only called from #[test]). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 22:37:20 -04:00
jedarden	7a6dea77cf	P12.OP2: Re-verify Raft state machine benchmark with fresh run Benchmark numbers stable: state machine apply ~1.0x direct HashMap overhead, both sub-microsecond. Confirms prior measurements. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 22:25:34 -04:00
jedarden	2c628a6f87	P12.OP2: Re-run Raft state machine benchmark, update measured values Fresh benchmark confirms state machine apply adds ~1.0-1.1x overhead vs direct HashMap — both sub-microsecond. Real Raft cost remains network + fsync (2-5ms vs Redis 0.3-0.8ms). Decision unchanged: revisit before v2.0, do not ship in v0.x or v1.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 22:14:11 -04:00
jedarden	111a128278	P12.OP2: Update Raft vs Redis research with web survey findings Add rrqlite/openraft+SQLite reference project, correct raft-rs status to maintenance mode, note openraft 0.10 edition 2024 requirement, and add additional production users (Helyim, RobustMQ, rrqlite). Decision unchanged: do not ship Raft in v0.x or v1.0, revisit before v2.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 22:03:29 -04:00
jedarden	e47c1c2f73	P12.OP3: Validate 2× transient load caveat and add CLI schedule window guard - Add resharding load simulation model with real router hash functions - Benchmark confirms storage amplification is exactly 2.0× and dual-write amplification is exactly 2.0× across all test matrix scenarios (1KB/10GB, 10KB/100GB, 1MB/1TB), with hash distribution CV < 5% in all cases - CLI window guard: resharding.allowed_windows config restricts resharding to named time windows (e.g. "02:00-06:00 UTC"), CLI refuses outside windows without --force - Integration tests confirm rejection outside window, --force override, no-restriction mode, and disabled config handling Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 22:00:57 -04:00
jedarden	fec5aa5e74	P12.OP1: Chaos-test cutover race window + hard refusal policy 14 chaos tests validate shard migration write safety at every cutover boundary. Key findings: - AE on + delta pass: 0/1M loss (production default) - AE off + delta pass: 0/50K loss (delta pass is sufficient alone) - AE off + delta skipped: ~2% loss → hard refusal at config validation - 3-node cluster cutover: 0 loss with delta pass Hard-coded policy: MigrationCoordinator refuses migrations when both anti-entropy is disabled and delta pass is skipped. Warning logged when AE is disabled but delta pass remains active. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 22:00:21 -04:00
jedarden	81155beb0d	P12.OP1: Shard migration write safety — cutover race window analysis Adds 14 chaos tests validating zero-data-loss at the migration cutover boundary under all AE/delta-pass configurations. Two new 3-node cluster variants exercise multi-owner shard migration with cross-node drain tracking. Key results: 0/1M loss with AE+delta; 0/50K loss with delta alone; ~2% hypothetical loss with neither (hard-refused by policy). The MigrationCoordinator blocks migration when both anti-entropy and delta pass are disabled. Also includes: anti-entropy cross-module validation gate, warning log when AE disabled during migration, empirical results table in docs/trade-offs.md, and plan §15 OP#1 status update to verified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 21:52:34 -04:00
jedarden	232092ffbb	P0.5: Implement Config struct mirroring plan §4/§13 YAML schema Full serde-derived struct tree covering every block in plan §4 (MiroirConfig, NodeConfig, TaskStoreConfig, AdminConfig, HealthConfig, ScatterConfig, RebalancerConfig, ServerConfig, ConnectionPoolConfig, TaskRegistryConfig) and all 21 §13 advanced-capability sub-structs (ReshardingConfig through SearchUiConfig with nested auth/rate-limit/CSP/analytics structs), plus §14 horizontal-scaling structs (PeerDiscoveryConfig, LeaderElectionConfig, HpaConfig). Includes: - Layered loading via config crate: built-in defaults → file → env overrides - Config::validate() with 14 cross-field rules (HA requires redis, scoped_key timing inversion, node group bounds, tenant affinity range checks, etc.) - 10 unit tests: round-trip YAML, full plan example, minimal YAML defaults, and validation rejection cases Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 21:46:12 -04:00
jedarden	188fd5404c	P12.OP5: Add dump import compatibility matrix Enumerates dump variants that streaming mode can/can't handle. - Added docs/dump-import/compatibility-matrix.md with comprehensive compatibility matrix covering Meilisearch versions, dump variants, and workarounds - Added docs/dump-import/README.md as entry point - Updated miroir-ctl dump command to reference matrix with helpful error messages for unimplemented subcommands (import, export, analyze) Addresses Open Problem #5: identifies what "can't reconstruct" means in concrete terms, giving operators clear guidance on when broadcast fallback is needed and what alternatives exist. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 21:06:46 -04:00
jedarden	fe274a5c0e	P12.OP2: Add Raft vs Redis task store HA research doc Survey openraft, raft-rs, and async-raft crates. Design a Raft-backed TaskStore prototype using openraft with SQLite state machine. Analytical benchmark against Redis across latency, throughput, memory, and ops complexity. Decision: revisit before v2.0, do not ship in v0.x/v1.0 — Raft fails the decision gate (worse on write latency and correctness maturity despite removing the Redis dependency). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 21:00:53 -04:00
jedarden	409f952f59	Add repo hygiene: LICENSE, CHANGELOG, .gitignore - LICENSE: MIT (per plan §12) - CHANGELOG.md: Keep a Changelog 1.1.0 skeleton with [Unreleased] and [0.1.0] sections matching the awk extractor from plan §7 - .gitignore: Rust target/, editor junk; Cargo.lock kept in VCS Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 20:47:36 -04:00

18 commits