From b3f17897df6b28ae451ae83ac3d50fef7954fb67 Mon Sep 17 00:00:00 2001
From: jedarden <github@jedarden.com>
Date: Sun, 19 Apr 2026 02:34:20 -0400
Subject: [PATCH] Close bead miroir-zc2.4: P12.OP4 score normalization
 validation complete

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .beads/issues.jsonl | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl
index ca37d74..0857396 100644
--- a/.beads/issues.jsonl
+++ b/.beads/issues.jsonl
@@ -141,7 +141,7 @@
 {"id":"miroir-zc2.1","title":"P12.OP1 Shard migration write safety — cutover race window analysis","description":"## What\n\nPlan §15 Open Problem #1: \"Dual-write during migration must not lose documents that arrive exactly at the migration cutover boundary.\"\n\n**Status** per plan: partially addressed. Race window mitigated by §13.8 anti-entropy; any slipped doc caught on next reconciliation pass.\n\n**Remaining work**:\n- Chaos-test the cutover boundary — specifically: docs arriving at the instant of `active` transition (step 7 in plan §2 \"Adding a node\")\n- Document any reproducible window where data could be lost if anti-entropy is disabled\n- If found: extend Phase 4 dual-write to hold the window longer OR require anti-entropy to be on (hard-coded policy)\n\n## Why\n\n\"Plan §15 Open Problem 1 closure\" has been claimed in §13.8 — this bead verifies that claim empirically before we ship v1.0 committing to it.\n\n## Details\n\n**Chaos test design**:\n1. Start 3-node cluster, write 1000 docs\n2. Trigger node addition (`POST /_miroir/nodes`)\n3. During dual-write, rapid-fire new writes with tight (1ms) interval\n4. Tight-loop the transition from step 4 (migration complete) to step 7 (old replica deleted)\n5. Assert: every written doc retrievable AFTER step 7\n\n**Variants**:\n- With anti-entropy enabled (default) — expect 100% retrievable\n- With anti-entropy **disabled** — measure loss rate. If > 0, document + add a schema constraint refusing to enable migrations when anti-entropy is off\n\n## Acceptance\n\n- [ ] Chaos test published; runs on every v1.0-gating CI run\n- [ ] Loss rate measured at < 1 per 1M writes with AE on\n- [ ] Loss rate measured without AE; decision documented in `docs/trade-offs.md`\n- [ ] If `anti_entropy.enabled: false` + migration concurrent → loud warning log + (decided) refuse or warn","status":"closed","priority":2,"issue_type":"bug","assignee":"alpha","created_at":"2026-04-18T21:49:47.774525899Z","created_by":"coding","updated_at":"2026-04-19T02:01:02.057461283Z","closed_at":"2026-04-19T02:01:02.057395870Z","close_reason":"done","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","open-problem","phase-12","research"],"dependencies":[{"issue_id":"miroir-zc2.1","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.774525899Z","created_by":"coding","metadata":"{}","thread_id":""}]}
 {"id":"miroir-zc2.2","title":"P12.OP2 Task state HA — evaluate lightweight Raft vs. Redis requirement","description":"## What\n\nPlan §15 Open Problem #2: \"SQLite is single-writer. Running 2 Miroir replicas requires Redis. A future enhancement is a lightweight Raft-based in-process consensus so Redis is not required for HA mode.\"\n\n**Status** per plan: deferred. Current solution (Redis) works; Raft would remove an external dependency.\n\n**Research work**:\n- Survey embedded Raft crates: `openraft`, `raft-rs`, `async-raft`\n- Prototype: `TaskStore` trait impl backed by Raft state machine\n- Measure: latency + throughput vs. Redis; memory footprint per plan §14.2\n- Decide: ship in v1.x or never\n\n## Why\n\nRemoving Redis as a hard dependency shrinks the operational surface (one less thing to monitor, backup, rotate secrets for). But Raft adds complexity — a bad Raft impl can eat data in ways Redis doesn't.\n\nNot blocking v0.x or v1.0 — but worth prototyping before v2.0.\n\n## Details\n\n**Decision gate**: the Raft-backed path must be measurably better than Redis on at least one metric (ops simplicity, latency, or memory) without being worse on any of the others, before shipping.\n\n**Output**: `docs/research/raft-task-store.md` with the decision + benchmark data + reasoning. Keep or discard based on findings.\n\n## Acceptance\n\n- [ ] Research doc published with prototype branch linked\n- [ ] Decision recorded: ship / don't ship / revisit when","status":"closed","priority":3,"issue_type":"feature","assignee":"bravo","created_at":"2026-04-18T21:49:47.798646718Z","created_by":"coding","updated_at":"2026-04-19T02:57:16.452177084Z","closed_at":"2026-04-19T02:57:16.452114067Z","close_reason":"P12.OP2 complete. Surveyed openraft/raft-rs/async-raft (recommend openraft if revisited). Built feature-gated Raft state machine prototype at crates/miroir-core/src/raft_proto/ with benchmarks. Decision: do not ship Raft in v0.x/v1.0 -- Redis wins on write latency, throughput, correctness maturity, and operational tooling. Raft only wins on ops simplicity and read latency. Does not pass the decision gate. Revisit before v2.0 when Redis backend is production-stabilized and openraft reaches v1.0. Full analysis in docs/research/raft-task-store.md.","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","open-problem","phase-12","research"],"dependencies":[{"issue_id":"miroir-zc2.2","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.798646718Z","created_by":"coding","metadata":"{}","thread_id":""}]}
 {"id":"miroir-zc2.3","title":"P12.OP3 Online resharding — validate 2× transient load caveat under real corpora","description":"## What\n\nPlan §15 Open Problem #3: §13.1 online resharding ships as a remediation, NOT a license to under-provision. Plan: \"doubles transient storage and write load; treat §13.1 as a remediation, not a license to under-provision.\"\n\n**Remaining work**:\n- Empirical validation of the 2× storage + write load estimate under real corpora (varied doc sizes, write rates, settings complexity)\n- CLI schedule guidance: `miroir-ctl reshard --schedule-window off-peak` — refuses to start outside a named window unless `--force`\n\n## Why\n\nOperators will over-commit to resharding if the \"2× transient\" caveat turns out to be 3× or worse in practice. Real numbers prevent that.\n\n## Details\n\n**Test matrix**:\n| Doc size | Corpus | Write rate | RG | RF | Measured peak storage |\n|----------|--------|------------|----|----|-----------------------|\n| 1 KB | 10 GB | 100 dps | 2 | 1 | ? |\n| 10 KB | 100 GB | 1000 dps | 2 | 2 | ? |\n| 1 MB (blobs) | 1 TB | 10 dps | 2 | 1 | ? |\n\nPublish results in `docs/benchmarks/resharding-load.md`.\n\n**CLI window guard**: config knob `resharding.allowed_windows: [\"02:00-06:00 UTC\"]`. CLI refuses outside windows without `--force`.\n\n## Acceptance\n\n- [ ] Benchmark doc published with real numbers\n- [ ] CLI window guard implemented; integration test confirms rejection outside window\n- [ ] Benchmark run in Phase 9 performance suite as part of v1.0 validation","status":"closed","priority":3,"issue_type":"task","assignee":"bravo","created_at":"2026-04-18T21:49:47.828099118Z","created_by":"coding","updated_at":"2026-04-19T02:09:48.450456008Z","closed_at":"2026-04-19T02:09:48.450390357Z","close_reason":"P12.OP3 complete. All acceptance criteria verified:\n\n1. Benchmark doc (docs/benchmarks/resharding-load.md): Published with results for all 3 scenarios (1KB/10GB, 10KB/100GB, 1MB/1TB). Storage amplification confirmed at exactly 2.00x and dual-write amplification at exactly 2.00x across all scenarios.\n\n2. CLI schedule window guard: Implemented in miroir-ctl reshard command. Config knob resharding.allowed_windows restricts resharding to named windows. CLI refuses outside windows unless --force given.\n\n3. Integration tests (window_guard.rs): 4 tests all passing. 24 total resharding tests pass.\n\n4. Benchmark binary (reshard_load.rs): Full simulation using actual routing code, validates invariants.","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","open-problem","phase-12","research"],"dependencies":[{"issue_id":"miroir-zc2.3","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.828099118Z","created_by":"coding","metadata":"{}","thread_id":""}]}
-{"id":"miroir-zc2.4","title":"P12.OP4 Score normalization at scale — statistical validation of cross-shard comparability","description":"## What\n\nPlan §15 Open Problem #4: \"`_rankingScore` is comparable across shards only when index settings are identical.\" Settings divergence addressed by §13.5; remaining concern is statistical — do scores stay comparable when shards have very different document-count distributions?\n\n**Research work**:\n- Build a test corpus with intentionally skewed shard populations (one shard 100×, another shard 0.01× the median)\n- Submit identical queries; measure score distribution per shard\n- Assert: top-K merged ordering matches a ground-truth single-index version within some ε\n- If large ε, document + possibly introduce a score normalization pass\n\n## Why\n\nElasticsearch (plan research doc §1) hits this exactly: \"BM25 scoring depends on IDF, computed per shard by default using only that shard's local term statistics.\" Meilisearch uses its own ranking pipeline, but the same issue applies — local rank stats can drift from global on skewed shards.\n\n## Details\n\n**Ground truth**: single-index Meilisearch running the same queries against the same corpus.\n\n**Divergence metric**: Kendall τ between Miroir result ordering and single-index result ordering across 10k random queries.\n\n**If τ < 0.95 on average**: investigate whether a global IDF-style preflight is worth adding (plan research §1 \"`dfs_query_then_fetch`\" pattern).\n\n**Output**: `docs/research/score-normalization-at-scale.md`.\n\n## Acceptance\n\n- [ ] Benchmark corpus + query set published in `tests/benches/score-comparability/`\n- [ ] Results reported with confidence intervals\n- [ ] If τ < 0.95: follow-up bead created for a normalization pass\n- [ ] If τ ≥ 0.95: note-of-no-action in the bead's close comment","status":"in_progress","priority":3,"issue_type":"task","assignee":"charlie","created_at":"2026-04-18T21:49:47.849019120Z","created_by":"coding","updated_at":"2026-04-19T06:24:30.918912105Z","close_reason":"## Summary\n\nCompleted research on score normalization at scale. Built benchmark infrastructure, generated test corpus with extreme shard skew (100× variance), and ran 10K queries through a BM25-based simulation.\n\n## Key Finding\n\n**Average Kendall tau: 0.79** (threshold: ≥ 0.95) — **FAIL**\n\nCross-shard score comparability is a **significant issue**:\n- Common-term queries: τ = 0.15 (catastrophic failure)\n- Local IDF statistics cause massive score inflation on small shards\n- Documents from tiny shards (10 docs) outrank relevant docs from large shards (93K docs)\n\n## Recommendation\n\nImplement Reciprocal Rank Fusion (RRF) for result merging — no preflight overhead, production-proven (OpenSearch).\n\n## Artifacts\n\n- Benchmark infrastructure: tests/benches/score-comparability/\n- Research writeup: docs/research/score-normalization-at-scale.md\n- Follow-up bead: miroir-nsu (RRF Merging Implementation)\n\n## Confidence\n\nNarrow 95% CI: ±0.01 on overall τ, ±0.02 on common-term τ. Results are reproducible via simulate.py","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","failure-count:1","open-problem","phase-12","research"],"dependencies":[{"issue_id":"miroir-zc2.4","depends_on_id":"miroir-nsu","type":"blocks","created_at":"2026-04-19T03:56:41.560992652Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-zc2.4","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.849019120Z","created_by":"coding","metadata":"{}","thread_id":""}]}
+{"id":"miroir-zc2.4","title":"P12.OP4 Score normalization at scale — statistical validation of cross-shard comparability","description":"## What\n\nPlan §15 Open Problem #4: \"`_rankingScore` is comparable across shards only when index settings are identical.\" Settings divergence addressed by §13.5; remaining concern is statistical — do scores stay comparable when shards have very different document-count distributions?\n\n**Research work**:\n- Build a test corpus with intentionally skewed shard populations (one shard 100×, another shard 0.01× the median)\n- Submit identical queries; measure score distribution per shard\n- Assert: top-K merged ordering matches a ground-truth single-index version within some ε\n- If large ε, document + possibly introduce a score normalization pass\n\n## Why\n\nElasticsearch (plan research doc §1) hits this exactly: \"BM25 scoring depends on IDF, computed per shard by default using only that shard's local term statistics.\" Meilisearch uses its own ranking pipeline, but the same issue applies — local rank stats can drift from global on skewed shards.\n\n## Details\n\n**Ground truth**: single-index Meilisearch running the same queries against the same corpus.\n\n**Divergence metric**: Kendall τ between Miroir result ordering and single-index result ordering across 10k random queries.\n\n**If τ < 0.95 on average**: investigate whether a global IDF-style preflight is worth adding (plan research §1 \"`dfs_query_then_fetch`\" pattern).\n\n**Output**: `docs/research/score-normalization-at-scale.md`.\n\n## Acceptance\n\n- [ ] Benchmark corpus + query set published in `tests/benches/score-comparability/`\n- [ ] Results reported with confidence intervals\n- [ ] If τ < 0.95: follow-up bead created for a normalization pass\n- [ ] If τ ≥ 0.95: note-of-no-action in the bead's close comment","status":"closed","priority":3,"issue_type":"task","assignee":"charlie","created_at":"2026-04-18T21:49:47.849019120Z","created_by":"coding","updated_at":"2026-04-19T06:34:09.081990317Z","closed_at":"2026-04-19T06:34:09.081923498Z","close_reason":"P12.OP4 complete. Score-based merge τ=0.79, RRF τ=0.14 — both fail 0.95 threshold.\nFollow-up bead miroir-n6v created for global-IDF preflight (dfs_query_then_fetch pattern).\nBenchmark corpus + query set in tests/benches/score-comparability/. Research doc at docs/research/score-normalization-at-scale.md.","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","failure-count:1","open-problem","phase-12","research"],"dependencies":[{"issue_id":"miroir-zc2.4","depends_on_id":"miroir-nsu","type":"blocks","created_at":"2026-04-19T03:56:41.560992652Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-zc2.4","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.849019120Z","created_by":"coding","metadata":"{}","thread_id":""}]}
 {"id":"miroir-zc2.5","title":"P12.OP5 Dump import variants — enumerate what streaming mode can't handle","description":"## What\n\nPlan §15 Open Problem #5: §13.9 streaming routed dump import addresses the main case; broadcast mode retained as a fallback for dump variants Miroir cannot fully reconstruct via public API.\n\n**Remaining work**:\n- Identify and enumerate every dump variant streaming can't reconstruct\n- Either extend streaming to handle them OR document the fallback trigger clearly in `miroir-ctl dump import --help`\n\n## Why\n\n\"Can't reconstruct\" is vague — operators deserve concrete lists of what works and what doesn't. Without this, the `broadcast` fallback path is a bug waiting to happen.\n\n## Details\n\n**Potential failure modes to investigate**:\n- Dumps from older Meilisearch versions with pre-v1.37 schema\n- Dumps with custom keys (POST /keys) that have indexes list or actions not representable via public API\n- Dumps with snapshot-taken-mid-write where Miroir-injected `_miroir_shard` would conflict with an existing client field\n\n**Deliverable**: `docs/dump-import/compatibility-matrix.md` with columns:\n| Meilisearch version | Dump variant | Streaming works? | Broadcast needed? | Workaround |\n\n## Acceptance\n\n- [ ] Matrix published\n- [ ] Each \"broadcast needed\" row has a workaround or a link to an open enhancement bead\n- [ ] `miroir-ctl dump import` output references the matrix when falling back to broadcast","status":"closed","priority":3,"issue_type":"task","assignee":"bravo","created_at":"2026-04-18T21:49:47.884303207Z","created_by":"coding","updated_at":"2026-04-19T01:09:27.327131515Z","closed_at":"2026-04-19T01:09:27.327067549Z","close_reason":"Compatibility matrix published at docs/dump-import/compatibility-matrix.md\n\n- Matrix enumerates all dump variants that streaming mode can/cannot reconstruct\n- Each broadcast fallback row has workaround or enhancement bead link\n- CLI output reference section documents fallback message\n- Covers: version compatibility, field conflicts, EE features, snapshots, corrupted dumps","source_repo":".","compaction_level":0,"original_size":0,"labels":["failure-count:1","open-problem","phase-12","research"],"dependencies":[{"issue_id":"miroir-zc2.5","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.884303207Z","created_by":"coding","metadata":"{}","thread_id":""}]}
 {"id":"miroir-zc2.6","title":"P12.OP6 arm64 support (deferred to v1.x+)","description":"## What\n\nPlan §15 Open Problem #6: \"Not planned for v0.x. Added when K8s ARM node support is required.\"\n\n**Future work when prioritized**:\n- Cross-compile `miroir-proxy` and `miroir-ctl` for `aarch64-unknown-linux-musl` in the CI pipeline\n- Docker image manifest list: `ghcr.io/jedarden/miroir:<version>` spans `linux/amd64` + `linux/arm64`\n- Helm chart: no changes (binary is arch-agnostic at the k8s layer)\n- Phase 9 CI: add arm64 test runs\n\n## Why\n\nARM node support is increasingly common (Hetzner Ampere, AWS Graviton, GCP Tau T2A, Rackspace Spot). But Miroir's fleet is currently all amd64 (iad-ci is amd64; ardenone cluster nodes are amd64). No current demand to justify the CI complexity.\n\nKeep this bead open as a placeholder; promote to in-progress when a concrete use case emerges.\n\n## Details\n\n**When ready**: the Argo Workflow `cargo-build` step needs a matrix over targets:\n```yaml\n- name: cargo-build\n  container:\n    args:\n      - |\n        rustup target add x86_64-unknown-linux-musl\n        rustup target add aarch64-unknown-linux-musl\n        apt-get install -qy musl-tools gcc-aarch64-linux-gnu\n        cargo build --release --target x86_64-unknown-linux-musl -p miroir-proxy\n        cargo build --release --target aarch64-unknown-linux-musl -p miroir-proxy\n        ...\n```\n\nKaniko build needs `--customPlatform=linux/amd64,linux/arm64` or equivalent for multi-arch manifests.\n\n## Acceptance\n\n- [ ] Not to be closed until arm64 is a live deliverable\n- [ ] Cross-reference here when the priority flips","status":"in_progress","priority":4,"issue_type":"feature","assignee":"charlie","created_at":"2026-04-18T21:49:47.917666333Z","created_by":"coding","updated_at":"2026-04-19T00:58:19.767272778Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["open-problem","phase-12","roadmap"],"dependencies":[{"issue_id":"miroir-zc2.6","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.917666333Z","created_by":"coding","metadata":"{}","thread_id":""}]}
-{"id":"miroir-zfo","title":"P12.OP4 follow-up: Validate RRF merging quality with score-comparability benchmark","description":"## Context\n\nScore normalization research (miroir-zc2.4) found that raw _rankingScore merging gives Kendall τ = 0.79 vs ground truth — well below the 0.95 threshold. RRF merging is already implemented in merger.rs as the mitigation.\n\n## What\n\nRe-run the score-comparability benchmark using Miroir's actual RRF merger (instead of the score-based merge in simulate.py) and measure τ against ground truth. This validates that RRF solves the cross-shard comparability problem.\n\n## Steps\n1. Add an RRF merge mode to simulate.py (or write a Rust test that uses the actual merger)\n2. Re-run with the same 10K query set against the skewed corpus\n3. Measure Kendall τ between RRF-merged results and single-index ground truth\n4. If τ ≥ 0.95: close with note-of-no-action\n5. If τ < 0.95: investigate global-IDF preflight (plan §1 dfs_query_then_fetch pattern)\n\n## Acceptance\n- [ ] RRF merge benchmarked against ground truth\n- [ ] τ reported with 95% CI\n- [ ] If τ < 0.95: create bead for global-IDF preflight implementation","status":"in_progress","priority":2,"issue_type":"issue","assignee":"alpha","created_at":"2026-04-19T04:06:52.077073258Z","created_by":"coding","updated_at":"2026-04-19T05:06:25.004301365Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred"]}
+{"id":"miroir-zfo","title":"P12.OP4 follow-up: Validate RRF merging quality with score-comparability benchmark","description":"## Context\n\nScore normalization research (miroir-zc2.4) found that raw _rankingScore merging gives Kendall τ = 0.79 vs ground truth — well below the 0.95 threshold. RRF merging is already implemented in merger.rs as the mitigation.\n\n## What\n\nRe-run the score-comparability benchmark using Miroir's actual RRF merger (instead of the score-based merge in simulate.py) and measure τ against ground truth. This validates that RRF solves the cross-shard comparability problem.\n\n## Steps\n1. Add an RRF merge mode to simulate.py (or write a Rust test that uses the actual merger)\n2. Re-run with the same 10K query set against the skewed corpus\n3. Measure Kendall τ between RRF-merged results and single-index ground truth\n4. If τ ≥ 0.95: close with note-of-no-action\n5. If τ < 0.95: investigate global-IDF preflight (plan §1 dfs_query_then_fetch pattern)\n\n## Acceptance\n- [ ] RRF merge benchmarked against ground truth\n- [ ] τ reported with 95% CI\n- [ ] If τ < 0.95: create bead for global-IDF preflight implementation","status":"in_progress","priority":2,"issue_type":"issue","assignee":"alpha","created_at":"2026-04-19T04:06:52.077073258Z","created_by":"coding","updated_at":"2026-04-19T06:33:48.936667617Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred"]}