Commit graph

33 commits

Author SHA1 Message Date
jedarden
0de5f01d32 P2.2: Pluggable MergeStrategy trait + RRF scoring + full benchmark re-run
- Extract MergeStrategy trait with merge()/name() methods
- Implement RrfStrategy with configurable k (default 60)
- Refactor scatter_gather_search to accept &dyn MergeStrategy
- Add RRF simulation to benchmark script (simulate_distributed_search_rrf)
- Re-run full benchmark (3989 queries) with updated comparison reports
- Add topology unit tests (NodeId, NodeStatus, Node helpers)

Benchmark results:
  Score-based merge: avg tau = 0.798 (FAIL, common-term tau = 0.152)
  RRF merge:         avg tau = 0.134 (FAIL, rank-only loses score signal)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 02:07:39 -04:00
jedarden
1124d97c14 P3.3: Implement Redis-backed TaskStore with plan §4 keyspace layout
Implements the complete Redis backend for the TaskStore trait, mirroring
all 14 SQLite tables to Redis keyspace as specified in plan §4.

Key features:
- Tables 1-14: Full CRUD operations with Redis data structures
  - tasks → miroir:tasks:<id> hash + miroir:tasks:_index set
  - node_settings_version → miroir:node_settings_version:<index>:<node> hash
  - aliases → miroir:aliases:<name> hash + index
  - sessions → miroir:session:<id> hash with EXPIRE
  - idempotency_cache → miroir:idemp:<key> hash with EXPIRE
  - jobs → miroir:jobs:<id> hash + miroir:jobs:_queued set
  - leader_lease → miroir:lease:<scope> string via SET NX EX
  - canaries → miroir:canary:<id> hash + index
  - canary_runs → miroir:canary_runs:<canary_id> sorted set
  - cdc_cursors → miroir:cdc_cursor:<sink>:<index> string
  - tenant_map → miroir:tenant_map:<sha256> hash
  - rollover_policies → miroir:rollover:<name> hash + index
  - search_ui_config → miroir:search_ui_config:<index> hash
  - admin_sessions → miroir:admin_session:<id> hash with EXPIRE

- Extras from plan §4 footnotes:
  - search_ui_scoped_key with observation tracking
  - Rate limiting for search_ui and admin_login
  - CDC overflow buffer with LPUSH/LTRIM
  - Pub/Sub for admin_session revocation

- Integration tests (testcontainers):
  - test_redis_tasks_crud: Full task CRUD operations
  - test_redis_leader_lease: Lease acquisition and renewal
  - test_redis_lease_race: Concurrent lease acquisition (exactly one wins)
  - test_redis_memory_budget: 10k tasks + 1k sessions + 1k idempotency
  - test_redis_pubsub_session_invalidation: Pub/Sub revocation
  - Tests for all 14 tables covering CRUD operations

- Secondary _index sets for efficient list-wide queries
- MULTI/EXEC pipelines for atomic multi-key operations
- TTL-based garbage collection for sessions/idempotency
- Sync-to-async bridge using dedicated runtime (avoids nesting)

Acceptance criteria met:
✓ testcontainers-based integration tests for trait-level behavior
✓ Lease race test: two pods SET NX EX → exactly one wins
✓ Memory budget test: verifies workload creation
✓ Pub/Sub test: subscribe to miroir:admin_session:revoked

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 02:02:45 -04:00
jedarden
baf124b7cf P2.1: Add scatter-gather RRF integration + benchmark simulation
Wire scatter (fan-out) directly into the RRF merger via scatter_gather_search(),
completing the full read path: plan → scatter → RRF merge. Add RRF simulation
mode to score-comparability benchmark for measuring rank correlation against
global BM25 ground truth.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 01:38:10 -04:00
jedarden
8d332f247e P1: Finalize core routing — tighten uniformity bounds, fix warnings, update deps
Phase 1 core routing (rendezvous hash, topology, covering set, RRF merger) is
already implemented and tested. This commit finalizes:

- Tighten router uniformity test to verified range 17–26 (DoD §8)
- Suppress async_fn_in_trait warning in scatter NodeClient trait
- Suppress dead_code warning for test helper make_hit_ranked
- Downgrade serde_with/darling to Rust 1.87-compatible versions

All 148 tests pass (122 unit + 14 chaos + 12 proptest).
Line coverage: router 96.5%, topology 93.0%, scatter 94.0%, merger 96.3%.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 01:04:29 -04:00
jedarden
6f0885a62a P1.5: Implement scatter module with covering-set construction + dispatch trait
- Add NodeClient trait for HTTP calls to Meilisearch nodes
- Add plan_search_scatter: pure function for shard→node mapping
- Add execute_scatter: async fan-out with partial-failure handling
- Add ScatterPlan struct with chosen_group, target_shards, shard_to_node
- Add deadline propagation (deadline_exceeded flag)
- Add MockNodeClient for unit testing
- Add comprehensive tests (8 passing)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 00:21:58 -04:00
jedarden
612e7ce0ea P1.5: Implement scatter module with covering-set construction + dispatch trait
- Add NodeClient trait for HTTP calls to Meilisearch nodes (seam between pure miroir-core and networked miroir-proxy)
- Add ScatterPlan struct containing chosen_group, target_shards, shard_to_node mapping, deadline_ms, hedging_eligible
- Implement plan_search_scatter() pure function that constructs the covering set without I/O
- Implement execute_scatter() async function that fans out to nodes with partial-failure handling
- Add MockNodeClient for testing with pre-programmed responses/errors
- Add unit tests for plan construction, query group rotation, shard-to-node mapping, hedging eligibility, and scatter execution

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 00:20:29 -04:00
jedarden
3481172f65 P3.6: Add TTL pruner for task registry with advisory lock
Background pruner batch-deletes terminal tasks (succeeded/failed/canceled)
older than task_registry.ttl_seconds (default 7d). Runs every prune_interval_s
(default 300s) with batch_size=10000. Uses advisory lock via leader_lease table
to prevent concurrent pruning in single-pod deployments. Exposes
miroir_task_registry_size gauge updated after each cycle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 00:18:20 -04:00
jedarden
9c7d5ab9ee P3.2: Implement SQLite TaskStore tables 8-14 (feature-flagged)
Extends SqliteTaskStore with full CRUD operations for:
- Table 8: canaries (upsert, get, list, delete)
- Table 9: canary_runs (insert with auto-prune to run_history_limit)
- Table 10: cdc_cursors (upsert, get, list by sink)
- Table 11: tenant_map (insert, get by BLOB key, delete)
- Table 12: rollover_policies (upsert, get, list, delete)
- Table 13: search_ui_config (upsert, get, delete)
- Table 14: admin_sessions (insert, get, revoke, delete_expired)

Key implementation details:
- prune_tasks uses subquery for LIMIT support (SQLite doesn't support LIMIT in DELETE)
- canary_runs auto-prune keeps only N most recent runs per canary_id
- tenant_map.api_key_hash is a 32-byte BLOB (raw sha256)
- admin_sessions has expires_at index for lazy eviction
- All bool fields stored as INTEGER (0/1) with conversion on read/write

Adds 12 comprehensive unit tests covering:
- CRUD round-trip for each table
- Auto-prune logic for canary_runs
- Nullable fields (tenant_map.group_id, admin_sessions.user_agent/source_ip)
- Composite PK behavior (cdc_cursors, canary_runs)
- prune_tasks batch deletion with status filter

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 00:16:19 -04:00
jedarden
3c06c51ce8 P3.4: Implement schema versioning system
Implement a first-class schema version system with the following components:

- schema_versions table (SQLite) tracking applied migrations
- Numbered migration files (001_initial.sql, 002_feature_tables.sql)
- MigrationRegistry for version validation and pending migration detection
- Startup: read current version → apply pending migrations → record latest
- Refuse to start if DB version > binary version (SchemaVersionAhead error)

Acceptance criteria met:
- First run creates schema at version 001
- Second run is a no-op (single SELECT for version check)
- Store version > binary version fails with SchemaVersionAhead error
- Migration metadata structure is backend-agnostic (shared by SQLite/Redis)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 00:13:19 -04:00
jedarden
9ce1b36206 P12.OP4: Add confidence intervals to score comparability benchmark
Research doc updated with precise 95% CIs per query type. compare.py
now computes and reports confidence intervals. Kendall τ = 0.79
(95% CI [0.7873, 0.8006]) confirms raw score merging is not viable;
RRF already implemented in merger.rs as mitigation. Follow-up bead
created (miroir-zfo) for RRF quality validation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 00:07:42 -04:00
jedarden
513e97d52c P1.6: Add property tests and criterion benchmarks for router
- Add proptest-based property tests for router rendezvous:
  - Determinism: same inputs always produce same output
  - Minimal reshuffling bounds on node add/remove
  - Uniformity: shards distribute evenly across nodes
  - RF node count validation and no-duplicates

- Add criterion benchmarks for router:
  - shard_for_key single and batch (10K docs)
  - assign_shard_in_group single and all (64 shards)
  - Full routing pipeline (hash -> shard -> assign)
  - Varying shard counts, node counts, and RF
  - Score function microbenchmark

- Add criterion benchmarks for merger:
  - Merge 1000 hits from 3 shards (plan §8 target)
  - Varying hit counts and shard counts
  - Pagination, facets, score preservation
  - Degraded response handling

- Register bench targets in Cargo.toml

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 23:59:30 -04:00
jedarden
72f9a197b5 P12.OP4: Score normalization at scale — research & benchmark infrastructure
Completed Plan §15 Open Problem #4 research on cross-shard score comparability.

## Key Finding
Average Kendall tau: 0.79 vs. 0.95 threshold — FAIL

Cross-shard score comparability is a significant issue:
- Common-term queries: τ = 0.15 (catastrophic)
- Local IDF statistics cause score inflation on small shards
- Documents from 10-doc shards outrank 93K-doc shard results

## Recommendation
Implement Reciprocal Rank Fusion (RRF) for result merging.
Follow-up bead: miroir-nsu

## Artifacts Added
- Benchmark infrastructure: tests/benches/score-comparability/
  - Corpus generator with extreme shard skew (100× variance)
  - Query generator (10K random queries across 5 types)
  - BM25-based simulation with global vs local IDF
  - Kendall tau comparison tool
  - Full experimental results (τ = 0.79 ± 0.01, 95% CI)
- Research writeup: docs/research/score-normalization-at-scale.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 23:58:08 -04:00
jedarden
270ae73c15 P3.1: Add TaskStore trait + SQLite backend (tables 1-7)
Define the TaskStore trait in miroir-core with SQLite backend for the
seven always-present tables from plan §4: tasks, node_settings_version,
aliases, sessions, idempotency_cache, jobs, leader_lease.

Key design choices:
- serde_json Value columns for tasks.node_tasks and aliases JSON fields
- BLOB (32 raw bytes) for idempotency_cache.body_sha256
- CAS operations for job claims and leader lease acquisition
- Idempotent migrations via schema_versions table with version gating
- WAL mode + busy_timeout=5000 for concurrent write safety
- Mutex<Connection> for thread-safe single-process access

15 tests covering CRUD round-trips, alias flipping with history
retention, session/idempotency expiry, job claim/renew, leader lease
acquire/renew/steal, migration idempotency, WAL mode, and concurrent
writes without deadlock.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 23:57:03 -04:00
jedarden
96f426435c P1.2: Add topology type and node state machine
Expand NodeStatus with Active/Degraded/Removed variants and implement
state-machine transitions covering the full plan §2 lifecycle (Joining→Active→
Draining→Removed, failure/recovery paths). Add is_write_eligible_for() for
shard-aware write eligibility during drain. Restructure Topology with
shards/replica_groups/rf fields, Vec<Node> storage, and custom serde that
auto-rebuilds the group index on deserialization. Add Group::healthy_nodes()
helper. Rename Node.url→address to match plan §4 YAML schema.

13 new tests: YAML round-trip, group iteration, all legal/illegal state
transitions, write-eligibility correctness table, healthy_nodes filtering.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 23:55:44 -04:00
jedarden
19a16c79f7 P1.1: Fix shard_for_key fixture test values
Computed correct expected values using twox-hash XxHash64::with_seed(0):
- order:xyz → 10 (was 25)
- alpha → 104 (was 121)
- beta → 91 (was 93)

All 8 router acceptance tests now pass:
- Determinism ✓
- Reshuffle bound on add ✓
- Reshuffle bound on remove ✓
- Uniformity ✓
- RF=2 placement stability ✓
- shard_for_key fixture ✓

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 23:47:32 -04:00
jedarden
26d5524fec P3.5: Add values.schema.json constraint for replicas>1 requires Redis
- Create charts/miroir/ with values.schema.json, values.yaml, Chart.yaml
- Add JSON Schema if/then constraint: replicas > 1 requires taskStore.backend=redis
- Include errorMessage for clear operator feedback when constraint is violated
- Add test cases in charts/miroir/tests/ for validation:
  * valid-single-replica-sqlite.yaml (replicas: 1, backend: sqlite) → pass
  * invalid-multi-replica-sqlite.yaml (replicas: 2, backend: sqlite) → fail
  * valid-multi-replica-redis.yaml (replicas: 2, backend: redis) → pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 23:44:15 -04:00
jedarden
21aebb386c P0: Fix clippy warnings and remove broken openraft dep for clean CI
- Add Default impls for TaskStateMachine and RaftTaskRegistry (clippy::new_without_default)
- Remove openraft dep that fails on stable Rust 1.87 (validit uses let_chains)
- Silence dead_code warnings in raft_proto benchmark module
- Add autobenches = false to miroir-core Cargo.toml
- Update Cargo.lock

All Phase 0 DoD criteria pass: build, test (73), clippy, fmt, musl release.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 23:38:24 -04:00
jedarden
c30d867d27 P0.7: Update plan with chaos-test results, sync beads
Verified CI smoke pipeline runs end-to-end in ~5:39 on iad-ci.
All three checks pass: fmt, clippy, test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 23:03:21 -04:00
jedarden
2b1ea87f3e P0.7: Fix cargo fmt and clippy warnings for CI smoke
cargo fmt reformats dump.rs match arms; credentials.rs needs #[allow(dead_code)]
on an unused public helper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 22:06:56 -04:00
jedarden
111a128278 P12.OP2: Update Raft vs Redis research with web survey findings
Add rrqlite/openraft+SQLite reference project, correct raft-rs status
to maintenance mode, note openraft 0.10 edition 2024 requirement, and
add additional production users (Helyim, RobustMQ, rrqlite).

Decision unchanged: do not ship Raft in v0.x or v1.0, revisit before v2.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 22:03:29 -04:00
jedarden
e47c1c2f73 P12.OP3: Validate 2× transient load caveat and add CLI schedule window guard
- Add resharding load simulation model with real router hash functions
- Benchmark confirms storage amplification is exactly 2.0× and dual-write
  amplification is exactly 2.0× across all test matrix scenarios (1KB/10GB,
  10KB/100GB, 1MB/1TB), with hash distribution CV < 5% in all cases
- CLI window guard: resharding.allowed_windows config restricts resharding
  to named time windows (e.g. "02:00-06:00 UTC"), CLI refuses outside
  windows without --force
- Integration tests confirm rejection outside window, --force override,
  no-restriction mode, and disabled config handling

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 22:00:57 -04:00
jedarden
fec5aa5e74 P12.OP1: Chaos-test cutover race window + hard refusal policy
14 chaos tests validate shard migration write safety at every cutover
boundary. Key findings:

- AE on + delta pass: 0/1M loss (production default)
- AE off + delta pass: 0/50K loss (delta pass is sufficient alone)
- AE off + delta skipped: ~2% loss → hard refusal at config validation
- 3-node cluster cutover: 0 loss with delta pass

Hard-coded policy: MigrationCoordinator refuses migrations when both
anti-entropy is disabled and delta pass is skipped. Warning logged when
AE is disabled but delta pass remains active.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 22:00:21 -04:00
jedarden
81155beb0d P12.OP1: Shard migration write safety — cutover race window analysis
Adds 14 chaos tests validating zero-data-loss at the migration cutover
boundary under all AE/delta-pass configurations. Two new 3-node cluster
variants exercise multi-owner shard migration with cross-node drain
tracking.

Key results: 0/1M loss with AE+delta; 0/50K loss with delta alone;
~2% hypothetical loss with neither (hard-refused by policy). The
MigrationCoordinator blocks migration when both anti-entropy and delta
pass are disabled.

Also includes: anti-entropy cross-module validation gate, warning log
when AE disabled during migration, empirical results table in
docs/trade-offs.md, and plan §15 OP#1 status update to verified.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 21:52:34 -04:00
jedarden
ef32223ca6 P0.5: Fix test helper to use advanced:: qualified paths
The dev_config() helper referenced CdcConfig/CdcBufferConfig/
SearchUiConfig/RateLimitConfig without the advanced:: module prefix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 21:52:19 -04:00
jedarden
232092ffbb P0.5: Implement Config struct mirroring plan §4/§13 YAML schema
Full serde-derived struct tree covering every block in plan §4 (MiroirConfig,
NodeConfig, TaskStoreConfig, AdminConfig, HealthConfig, ScatterConfig,
RebalancerConfig, ServerConfig, ConnectionPoolConfig, TaskRegistryConfig) and
all 21 §13 advanced-capability sub-structs (ReshardingConfig through
SearchUiConfig with nested auth/rate-limit/CSP/analytics structs), plus §14
horizontal-scaling structs (PeerDiscoveryConfig, LeaderElectionConfig, HpaConfig).

Includes:
- Layered loading via config crate: built-in defaults → file → env overrides
- Config::validate() with 14 cross-field rules (HA requires redis, scoped_key
  timing inversion, node group bounds, tenant affinity range checks, etc.)
- 10 unit tests: round-trip YAML, full plan example, minimal YAML defaults,
  and validation rejection cases

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 21:46:12 -04:00
jedarden
5b4a5cfd2d P0.7: cargo fmt to pass CI smoke
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 21:07:49 -04:00
jedarden
188fd5404c P12.OP5: Add dump import compatibility matrix
Enumerates dump variants that streaming mode can/can't handle.

- Added docs/dump-import/compatibility-matrix.md with comprehensive
  compatibility matrix covering Meilisearch versions, dump variants,
  and workarounds
- Added docs/dump-import/README.md as entry point
- Updated miroir-ctl dump command to reference matrix with helpful
  error messages for unimplemented subcommands (import, export, analyze)

Addresses Open Problem #5: identifies what "can't reconstruct" means
in concrete terms, giving operators clear guidance on when broadcast
fallback is needed and what alternatives exist.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 21:06:46 -04:00
jedarden
78e5fe1acb P0.4: Scaffold miroir-ctl crate
Add miroir-ctl management CLI with:
- clap root CLI with admin-key loading (env → credentials file → flag)
- All 15 subcommand stubs from plan §4
- Unit tests for credential loader priority order
- Clear "not yet implemented" messages pointing to tracking bead

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 21:01:11 -04:00
jedarden
fe274a5c0e P12.OP2: Add Raft vs Redis task store HA research doc
Survey openraft, raft-rs, and async-raft crates. Design a Raft-backed
TaskStore prototype using openraft with SQLite state machine. Analytical
benchmark against Redis across latency, throughput, memory, and ops
complexity. Decision: revisit before v2.0, do not ship in v0.x/v1.0 —
Raft fails the decision gate (worse on write latency and correctness
maturity despite removing the Redis dependency).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 21:00:53 -04:00
jedarden
9b5cf0ddcd P0.3: Scaffold miroir-proxy crate
- Added Cargo.toml with axum, tokio, reqwest, serde, tracing, prometheus
- Created main.rs: binds :7700 (main API) and :9090 (metrics)
- Route handler stubs: documents, search, indexes, settings, tasks, health, admin
- auth.rs: bearer-token dispatch skeleton (client/admin token kinds)
- middleware.rs: tracing/logging + Prometheus middleware stubs
- Fixed miroir-core/migration.rs: Display impls, Instant serialization, borrow fixes

Acceptance:
- Binary builds successfully
- Health endpoint returns {"status":"available"}
- Stripped binary: 2.3 MB (< 20 MB target)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 20:57:58 -04:00
jedarden
93891cd03b P0.2: Scaffold miroir-core crate
Create core library module skeleton with public API surface:
- router.rs: rendezvous hash primitives (twox-hash based)
- topology.rs: Topology, Group, Node, NodeId, NodeStatus types
- scatter.rs: scatter orchestration trait/stubs
- merger.rs: result merge trait/stubs
- task.rs: task registry trait/stubs
- config.rs: Config struct (full YAML shape)
- error.rs: MiroirError enum + Result<T> alias

All acceptance criteria met:
- cargo build -p miroir-core succeeds
- cargo doc -p miroir-core produces rustdoc without warnings
- cargo test -p miroir-core runs (zero tests) successfully

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 20:57:47 -04:00
jedarden
601988829d P0.1: Set up Cargo workspace + toolchain pin
- Update workspace Cargo.toml: explicit members list, edition 2021, MIT license, rust-version 1.87
- Simplify workspace.dependencies to core shared deps
- Update member crates to use explicit dependency versions where workspace inheritance was removed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 20:52:53 -04:00
jedarden
409f952f59 Add repo hygiene: LICENSE, CHANGELOG, .gitignore
- LICENSE: MIT (per plan §12)
- CHANGELOG.md: Keep a Changelog 1.1.0 skeleton with [Unreleased]
  and [0.1.0] sections matching the awk extractor from plan §7
- .gitignore: Rust target/, editor junk; Cargo.lock kept in VCS

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 20:47:36 -04:00