diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl index 233fad3..31409c7 100644 --- a/.beads/issues.jsonl +++ b/.beads/issues.jsonl @@ -30,7 +30,7 @@ {"id":"miroir-afh.5","title":"P7.5 Structured JSON logging + request IDs + trace correlation","description":"## What\n\nImplement plan §10 structured JSON log format:\n```json\n{\n \"timestamp\": \"2026-05-01T12:00:00.000Z\",\n \"level\": \"info\",\n \"message\": \"search completed\",\n \"index\": \"products\",\n \"duration_ms\": 42,\n \"node_count\": 3,\n \"estimated_hits\": 15420,\n \"degraded\": false\n}\n```\n\nEvery log entry includes `request_id` (UUIDv7-prefix short-hash, same value as the `X-Request-Id` response header from P2.8) so a log search can trace a single request across pods.\n\n## Why\n\nStructured logs are the only log format that scales beyond \"grep through ASCII.\" JSON-per-line is parseable by every log aggregator (Loki, ElasticSearch, Splunk, CloudWatch).\n\n## Details\n\n**Tracing subscriber stack**:\n```rust\nuse tracing_subscriber::prelude::*;\ntracing_subscriber::registry()\n .with(tracing_subscriber::fmt::layer().json())\n .with(tracing_subscriber::EnvFilter::from_default_env())\n .init();\n```\n\n**Fields on every log line**: `timestamp`, `level`, `target` (module path), `request_id` (from axum middleware), `pod_id` (env `POD_NAME`), `message`. Plus free-form context per log call (`index`, `shard`, `duration_ms`, ...).\n\n**Log levels**:\n- `ERROR`: orchestrator-side internal failures\n- `WARN`: degraded responses, fallbacks, soft failures\n- `INFO`: one line per request with summary fields\n- `DEBUG`: per-node calls, per-sub-query in multi-search\n- `TRACE`: fan-out buffer contents, scatter plan internals\n\n**No PII**: never log document content, query strings, or API keys. Hashes of keys are fine (for correlation across requests).\n\n## Acceptance\n\n- [ ] `jq` parses every log line\n- [ ] Grepping `request_id=abc123` across all pods' logs returns one-line-per-pod-that-handled-part-of-that-request\n- [ ] No API key, document field, or user query appears in any log entry\n- [ ] Log volume: < 1 entry per client request at INFO level; more at DEBUG only when env filter allows","status":"open","priority":1,"issue_type":"task","created_at":"2026-04-18T21:42:04.602737281Z","created_by":"coding","updated_at":"2026-04-18T21:42:04.602737281Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase-7"],"dependencies":[{"issue_id":"miroir-afh.5","depends_on_id":"miroir-afh","type":"parent-child","created_at":"2026-04-18T21:42:04.602737281Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-afh.6","title":"P7.6 OpenTelemetry tracing (optional, off by default)","description":"## What\n\nImplement plan §10 tracing (disabled by default):\n```yaml\nmiroir:\n tracing:\n enabled: false\n endpoint: \"http://tempo.monitoring.svc:4317\"\n service_name: miroir\n sample_rate: 0.1\n```\n\nWhen enabled, every search produces a trace with parallel spans for each node in the covering set.\n\n## Why\n\nPlan §10: \"makes latency outliers immediately visible.\" A scatter with one slow node shows up as one span sticking out from the parallel pack — operators can immediately point at the node.\n\n## Details\n\n**OTel SDK**: `opentelemetry` + `opentelemetry-otlp` + `tracing-opentelemetry`. Hook into the existing `tracing` subscriber chain.\n\n**Span hierarchy**:\n- Parent span: inbound request (`POST /indexes/products/search`)\n- Child span: scatter plan construction\n- Parallel child spans: one per node in covering set (`call meili-1`, `call meili-2`, ...)\n- Parallel child spans within the scatter: any hedges fired (§13.2)\n- Merge span: after gather completes\n\n**Sampling**: head-based `sample_rate` in config. Tail-based (e.g., always sample slow traces) is a future enhancement; v1 ships head-based only.\n\n**Resource attributes**: `service.name`, `service.version`, `host.name` (pod name).\n\n**Disabled default**: no overhead when off (the subscriber chain skips the OTel layer entirely).\n\n## Acceptance\n\n- [ ] `tracing.enabled: false` → zero OTel library calls in a CPU profile\n- [ ] `tracing.enabled: true` + Tempo running → traces appear within seconds\n- [ ] A slow-node induced in Phase 9 chaos produces a visible outlier span in Tempo\n- [ ] Sample rate 0.1 results in ~10% of requests producing traces","status":"open","priority":2,"issue_type":"task","created_at":"2026-04-18T21:42:04.629100946Z","created_by":"coding","updated_at":"2026-04-18T21:42:04.629100946Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase-7"],"dependencies":[{"issue_id":"miroir-afh.6","depends_on_id":"miroir-afh","type":"parent-child","created_at":"2026-04-18T21:42:04.629100946Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-b64","title":"Genesis: Miroir Implementation","description":"## Genesis Bead\n**Tied to plan:** `/home/coding/miroir/docs/plan/plan.md`\n\n## Project Overview\n\n**Miroir** — _Multi-node Index Replication Orchestrator, Integrated Rebalancing_ — is a RAID-like sharding and high-availability layer for **Meilisearch Community Edition (MIT)**. It stripes a large index across a fleet of Meilisearch nodes, fans out search queries across all shards, merges ranked results, and rebalances shard assignments when nodes are added or removed — all without Meilisearch Enterprise.\n\n## Why This Exists\n\nMeilisearch CE loads its entire index into memory-mapped LMDB files. A large index that exceeds a single server's available RAM cannot run on that server. The Enterprise Edition's native sharding and replication are **BUSL-1.1 gated** — production use requires a commercial license. Miroir solves this using only the Meilisearch **public REST API**, with no node-side patches or forks. Every Meilisearch node continues to run unmodified CE.\n\n## Design Principles (from plan §1)\n\n1. **Invisible federation** — clients talk to one endpoint using the standard Meilisearch API\n2. **No Enterprise dependency** — pure CE (MIT) everywhere\n3. **Rendezvous hashing (HRW)** — matches what Meilisearch Enterprise itself uses internally\n4. **RF-configurable redundancy** — RF=1 capacity, RF=2 one-node-loss, RF=3 two-node-loss\n5. **Graceful degradation** — partial results with `X-Miroir-Degraded` beats whole-request failure\n6. **Static binaries, scratch images** — musl + scratch Docker, trivial deploy, tiny attack surface\n7. **GitOps first** — all config in `jedarden/declarative-config`, ArgoCD drives cluster changes\n8. **Fixed per-pod resource envelope (2 vCPU / 3.75 GB)** — scale out, not up\n\n## Architecture (high-level)\n\n- **Shards (S)** — logical hash-space granularity, **fixed at index creation**, `S = max_nodes_per_group_ever × 8`\n- **Replica Groups (RG)** — independent query pools, each holds a full copy of all shards; scales **read throughput**\n- **Replication Factor (RF)** — intra-group copies per shard; scales **HA within a group**\n- **Writes** fan out to `RG × RF` nodes (one per-group quorum, cluster-wide success when ≥1 group met its quorum)\n- **Reads** target exactly one group per query (round-robin); fan out to that group's covering set only\n- **Rendezvous hashing is scoped to each group** — prevents cross-group coverage gaps\n\n## Phase Plan\n\n- [ ] **Phase 0 — Foundation** — Cargo workspace, crate layout, config schema, dependencies\n- [ ] **Phase 1 — Core Routing** (plan §2, §4) — rendezvous hash, topology, write targets, covering set\n- [ ] **Phase 2 — Proxy + API Surface** (plan §3, §5) — HTTP server, documents/search/indexes/settings/tasks/health, result merger, quorum, error mapping\n- [ ] **Phase 3 — Task Registry + Persistence** (plan §4 task store) — SQLite schema (14 tables), Redis mirror for HA\n- [ ] **Phase 4 — Topology Operations** (plan §2 topology changes, §4 rebalancer) — add/remove node, add/remove group, drain, dual-write, shard-filter migration\n- [ ] **Phase 5 — Advanced Capabilities** (plan §13, subsections .1–.21) — reshard, hedging, EWMA, query planner, two-phase settings, session pinning, aliases, anti-entropy, streaming dump import, idempotency+coalescing, multi-search, vector, CDC, TTL, tenant affinity, shadow tee, ILM, canaries, Admin UI, Explain, Search UI\n- [ ] **Phase 6 — Horizontal Scaling + HPA** (plan §14) — pod envelope, request-path statelessness, Mode A/B/C background coordination, peer discovery, HPA spec\n- [ ] **Phase 7 — Observability + Ops** (plan §10) — metrics, tracing, logs, alerts, Grafana dashboard, ServiceMonitor\n- [ ] **Phase 8 — Deployment + CI** (plan §6, §7) — Dockerfile (scratch+musl), Helm chart, ArgoCD Application, Argo Workflow template\n- [ ] **Phase 9 — Testing** (plan §8) — unit, integration (docker-compose), compatibility, chaos, performance (criterion), SDK smoke tests\n- [ ] **Phase 10 — Security + Secrets** (plan §9) — sealed secrets, ESO/OpenBao integration, key rotation (admin-scoped, JWT, scoped-key), CSRF posture\n- [ ] **Phase 11 — Onboarding + Docs + Delivered Artifacts** (plan §11, §12) — README, CHANGELOG, migration docs, miroir-ctl help, runbooks, release checklist\n- [ ] **Phase 12 — Open Problems Tracking** (plan §15) — score normalization at scale validation, arm64 support, Raft-based HA task state exploration\n\n## How to use this bead\n\n- Each phase has its own epic bead that blocks this genesis bead\n- Every phase epic decomposes into concrete task beads; most tasks have subtasks\n- Dependencies are wired so ready-work can be discovered with `br ready`\n- Close phase epics as they complete; update the checklist above by editing this bead's body\n- Close this genesis bead only when all phases are complete AND `br ready` returns empty\n\n## Cross-cutting references\n\n- Infrastructure: Hetzner EX44 + Tailscale + iad-ci Argo Workflows (see `/home/coding/CLAUDE.md`)\n- Container registry: `ghcr.io/jedarden/miroir`\n- Helm chart OCI: `ghcr.io/jedarden/charts/miroir`\n- GitHub Pages: `https://jedarden.github.io/miroir`\n- Declarative config repo: `jedarden/declarative-config → k8s/iad-ci/argo-workflows/miroir-ci.yaml`\n- Argo UI: `https://argo-ci.ardenone.com` (VPN+SSO)\n- ArgoCD read-only API: `https://argocd-ro-ardenone-manager-ts.ardenone.com:8444`\n\n## Resources\n\n- Plan doc: `/home/coding/miroir/docs/plan/plan.md` (3739 lines, authoritative)\n- Research: `/home/coding/miroir/docs/research/{ha-approaches,consistent-hashing,distributed-search-patterns}.md`\n- Notes: `/home/coding/miroir/docs/notes/api-compatibility.md`","status":"open","priority":0,"issue_type":"genesis","created_at":"2026-04-18T21:16:57.035422879Z","created_by":"coding","updated_at":"2026-04-18T21:23:03.980674624Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["epic","genesis"],"dependencies":[{"issue_id":"miroir-b64","depends_on_id":"miroir-46p","type":"blocks","created_at":"2026-04-18T21:23:03.914397943Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-89x","type":"blocks","created_at":"2026-04-18T21:23:03.880994818Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-18T21:23:03.707537245Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-afh","type":"blocks","created_at":"2026-04-18T21:23:03.828449381Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-cdo","type":"blocks","created_at":"2026-04-18T21:23:03.693122638Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-m9q","type":"blocks","created_at":"2026-04-18T21:23:03.812940820Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-mkk","type":"blocks","created_at":"2026-04-18T21:23:03.751578908Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-qjt","type":"blocks","created_at":"2026-04-18T21:23:03.851889265Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-qon","type":"blocks","created_at":"2026-04-18T21:23:03.678271938Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-18T21:23:03.725188496Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-uhj","type":"blocks","created_at":"2026-04-18T21:23:03.780275977Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-uyx","type":"blocks","created_at":"2026-04-18T21:23:03.949940719Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-b64","depends_on_id":"miroir-zc2","type":"blocks","created_at":"2026-04-18T21:23:03.980624158Z","created_by":"coding","metadata":"{}","thread_id":""}]} -{"id":"miroir-cdo","title":"Phase 1 — Core Routing (rendezvous hash, topology, covering set)","description":"## Phase 1 Epic — Core Routing\n\nImplements the deterministic, coordination-free routing primitives that everything else depends on. After this phase, given a fixed topology + config, any Miroir pod can independently compute identical write targets and covering sets — no coordination required.\n\n## Why This Matters\n\nPlan §1 principle 3: rendezvous hashing (HRW) is the same algorithm Meilisearch Enterprise uses internally with twox-hash. Getting this right has **three** properties we rely on downstream:\n\n1. **Determinism** — all pods agree on assignments without any gossip protocol\n2. **Minimal reshuffling** — adding a node to a group moves only ~1/(Ng+1) of that group's docs (plan §2 \"Properties\" bullets)\n3. **Group isolation** — hashing scoped to intra-group node lists prevents both replicas of a shard from landing in the same group (plan §2 \"Why group-scoped assignment matters\")\n\nThese properties are the foundation for the §2 write path, §2 read path, §4 rebalancer, §13.3 adaptive selection, §13.4 query planner, §13.8 anti-entropy, and §14.5 Mode A shard-partitioned ownership. A subtle bug here — e.g., seeding the hash differently, using a non-stable node-id encoding — corrupts every later layer silently.\n\n## Scope (plan §2 Architecture + §4 router.rs)\n\n- `router.rs` — `score(shard, node)`, `assign_shard_in_group`, `write_targets`, `query_group`, `covering_set`, `shard_for_key`\n- `topology.rs` — `Topology` struct (nodes grouped by `replica_group`), node health state machine (healthy / degraded / draining / failed / joining / active / removed)\n- `scatter.rs` — fan-out orchestration primitives (stubbed execution; wired in Phase 2)\n- `merger.rs` — result merge primitives (global sort by `_rankingScore`, offset/limit, facet aggregation, estimatedTotalHits summation, `_miroir_shard` + `_rankingScore` stripping) — pure-function friendly for unit testing\n- Unit tests per §8 \"Router correctness\" + \"Result merger\" bullets\n\n## Definition of Done\n\n- [ ] Rendezvous assignment is deterministic given fixed node list (verified by test)\n- [ ] Adding a 4th node in a 3-node group moves at most ~2 × (1/4) of shards (verified by test, plan §8)\n- [ ] 64 shards / 3 nodes / RF=1 → each node holds 18–26 shards (verified by test)\n- [ ] Top-RF placement changes minimally on add / remove (verified by test)\n- [ ] `write_targets` returns exactly `RG × RF` nodes, one from each group\n- [ ] `query_group(seq, RG)` distributes evenly (verified by test)\n- [ ] `covering_set` within a group returns exactly one node per shard (with intra-group replica rotation)\n- [ ] `merger` passes the merge/facet/limit tests in plan §8\n- [ ] `miroir-core` ≥ 90% line coverage via cargo-tarpaulin (per §8 coverage policy)","status":"in_progress","priority":0,"issue_type":"epic","assignee":"alpha","created_at":"2026-04-18T21:18:33.134146061Z","created_by":"coding","updated_at":"2026-04-19T06:40:54.458877301Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","phase","phase-1"],"dependencies":[{"issue_id":"miroir-cdo","depends_on_id":"miroir-qon","type":"blocks","created_at":"2026-04-18T21:23:08.556785813Z","created_by":"coding","metadata":"{}","thread_id":""}]} +{"id":"miroir-cdo","title":"Phase 1 — Core Routing (rendezvous hash, topology, covering set)","description":"## Phase 1 Epic — Core Routing\n\nImplements the deterministic, coordination-free routing primitives that everything else depends on. After this phase, given a fixed topology + config, any Miroir pod can independently compute identical write targets and covering sets — no coordination required.\n\n## Why This Matters\n\nPlan §1 principle 3: rendezvous hashing (HRW) is the same algorithm Meilisearch Enterprise uses internally with twox-hash. Getting this right has **three** properties we rely on downstream:\n\n1. **Determinism** — all pods agree on assignments without any gossip protocol\n2. **Minimal reshuffling** — adding a node to a group moves only ~1/(Ng+1) of that group's docs (plan §2 \"Properties\" bullets)\n3. **Group isolation** — hashing scoped to intra-group node lists prevents both replicas of a shard from landing in the same group (plan §2 \"Why group-scoped assignment matters\")\n\nThese properties are the foundation for the §2 write path, §2 read path, §4 rebalancer, §13.3 adaptive selection, §13.4 query planner, §13.8 anti-entropy, and §14.5 Mode A shard-partitioned ownership. A subtle bug here — e.g., seeding the hash differently, using a non-stable node-id encoding — corrupts every later layer silently.\n\n## Scope (plan §2 Architecture + §4 router.rs)\n\n- `router.rs` — `score(shard, node)`, `assign_shard_in_group`, `write_targets`, `query_group`, `covering_set`, `shard_for_key`\n- `topology.rs` — `Topology` struct (nodes grouped by `replica_group`), node health state machine (healthy / degraded / draining / failed / joining / active / removed)\n- `scatter.rs` — fan-out orchestration primitives (stubbed execution; wired in Phase 2)\n- `merger.rs` — result merge primitives (global sort by `_rankingScore`, offset/limit, facet aggregation, estimatedTotalHits summation, `_miroir_shard` + `_rankingScore` stripping) — pure-function friendly for unit testing\n- Unit tests per §8 \"Router correctness\" + \"Result merger\" bullets\n\n## Definition of Done\n\n- [ ] Rendezvous assignment is deterministic given fixed node list (verified by test)\n- [ ] Adding a 4th node in a 3-node group moves at most ~2 × (1/4) of shards (verified by test, plan §8)\n- [ ] 64 shards / 3 nodes / RF=1 → each node holds 18–26 shards (verified by test)\n- [ ] Top-RF placement changes minimally on add / remove (verified by test)\n- [ ] `write_targets` returns exactly `RG × RF` nodes, one from each group\n- [ ] `query_group(seq, RG)` distributes evenly (verified by test)\n- [ ] `covering_set` within a group returns exactly one node per shard (with intra-group replica rotation)\n- [ ] `merger` passes the merge/facet/limit tests in plan §8\n- [ ] `miroir-core` ≥ 90% line coverage via cargo-tarpaulin (per §8 coverage policy)","status":"in_progress","priority":0,"issue_type":"epic","assignee":"alpha","created_at":"2026-04-18T21:18:33.134146061Z","created_by":"coding","updated_at":"2026-04-19T07:01:08.962085954Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","phase","phase-1"],"dependencies":[{"issue_id":"miroir-cdo","depends_on_id":"miroir-qon","type":"blocks","created_at":"2026-04-18T21:23:08.556785813Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-cdo.1","title":"P1.1 Rendezvous hash primitives (score, assign_shard_in_group)","description":"## What\n\nImplement `miroir_core::router`:\n```rust\npub fn score(shard_id: u32, node_id: &str) -> u64\npub fn assign_shard_in_group(shard_id: u32, group_nodes: &[NodeId], rf: usize) -> Vec\npub fn shard_for_key(primary_key: &str, shard_count: u32) -> u32\n```\n\n## Why\n\nThese three are the atoms everything else builds on. `score` uses `XxHash64::with_seed(0)` with the canonical concatenation order `(shard_id, node_id)` (plan §4 code sample). Any deviation (different seed, different ordering, endianness) forks routing across any two Miroir instances and silently corrupts writes.\n\n## Design Notes (plan §2 / §4)\n\n- **Hash function is `twox-hash` (XxHash family)** — the same one Meilisearch Enterprise uses; the choice is non-negotiable (plan §2).\n- **Node-id encoding stability** — the string passed to `node_id.hash(&mut h)` must be byte-stable. Use the bare `id: \"meili-0\"` string from config, not a reformatted address.\n- **`assign_shard_in_group` is group-scoped on purpose** — per plan §2 \"Why group-scoped assignment matters\": scoping to the group prevents both replicas of a shard from landing in the same group. A global rendezvous would have no such guarantee.\n- **Sort by score descending, break ties lexicographically on node_id** so two nodes with identical hash scores (extremely rare but possible) deterministically resolve.\n\n## Acceptance Tests (plan §8 \"Router correctness\")\n\n- [ ] Determinism: same `(shard_id, nodes)` → identical `Vec` across 1000 randomized runs\n- [ ] Reshuffle bound on add: 64 shards, 3→4 nodes in a group → at most `2 × (1/4) × 64` shard-node edges differ\n- [ ] Reshuffle bound on remove: 64 shards, 4→3 nodes → `~RF × S / Ng` edges differ\n- [ ] Uniformity: 64 shards, 3 nodes, RF=1 → each node holds 18–26 shards (chi-square not rejected at p=0.95)\n- [ ] RF=2 placement: top-2 nodes change minimally when a node is added or removed\n- [ ] `shard_for_key(pk, S)` is `(XxHash64::with_seed(0).hash(pk) % S)` — verified against a known fixture vector","status":"closed","priority":0,"issue_type":"task","assignee":"bravo","created_at":"2026-04-18T21:26:11.754243556Z","created_by":"coding","updated_at":"2026-04-19T03:47:59.776479292Z","closed_at":"2026-04-19T03:47:59.776362081Z","close_reason":"P1.1 Complete: Fixed shard_for_key fixture test values\n\nThe three rendezvous hash primitives were already implemented:\n- score(shard_id, node_id) using XxHash64::with_seed(0) with canonical order (shard_id, node_id)\n- assign_shard_in_group with lexicographic tie-breaking\n- shard_for_key using direct hash modulo\n\nFixed incorrect fixture values in test:\n- order:xyz → 10 (was 25)\n- alpha → 104 (was 121) \n- beta → 91 (was 93)\n\nAll 8 acceptance tests pass:\n- Determinism ✓\n- Reshuffle bound on add ✓\n- Reshuffle bound on remove ✓\n- Uniformity ✓\n- RF=2 placement stability ✓\n- shard_for_key fixture ✓","source_repo":".","compaction_level":0,"original_size":0,"labels":["failure-count:1","phase-1"],"dependencies":[{"issue_id":"miroir-cdo.1","depends_on_id":"miroir-cdo","type":"parent-child","created_at":"2026-04-18T21:26:11.754243556Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-cdo.2","title":"P1.2 Topology type + node state machine","description":"## What\n\nImplement `miroir_core::topology`:\n```rust\npub struct Topology {\n pub shards: u32,\n pub replica_groups: u32,\n pub rf: usize,\n pub nodes: Vec,\n}\npub struct Node {\n pub id: NodeId,\n pub address: String,\n pub replica_group: u32,\n pub status: NodeStatus,\n}\npub enum NodeStatus { Healthy, Degraded, Draining, Failed, Joining, Active, Removed }\n```\n\nHelpers: `Topology::groups() -> impl Iterator`, `Topology::group(g: u32) -> &Group`, `group.nodes() -> &[Node]`, `group.healthy_nodes() -> Vec<&Node>`.\n\n## Why\n\nThe `Topology` type is what `router` operates on. State transitions correspond to plan §2 topology-change verbs: a node is `Joining` → `Active` after a group-add migration; `Draining` → `Removed` after a node-remove migration; `Failed` is for unplanned loss.\n\nThe state field matters for **routing-eligibility**: writes skip `Draining` for *affected* shards (plan §2 \"Removing a node\" step 1), but still deliver to it for shards it still owns. A bug where a `Draining` node stops receiving any writes prematurely would create durability gaps during rebalance.\n\n## State Transition Rules\n\n| From | To | Triggered by |\n|------|-----|-------------|\n| (new) | Joining | `POST /_miroir/nodes` (plan §4 admin API) |\n| Joining | Active | Migration complete (Phase 4) |\n| Active | Draining | `POST /_miroir/nodes/{id}/drain` |\n| Draining | Removed | Migration complete (Phase 4) |\n| Active/Draining | Failed | Health check detects (Phase 7) |\n| Failed | Active | Health check recovery + optional replication catch-up |\n| Active/Failed | Degraded | Partial health (timeouts, not full disconnect) |\n| Degraded | Active | Health restored |\n\n## Acceptance\n\n- [ ] Topology deserializes from plan §4 YAML example (RG=2, 6 nodes, RF=1) into the expected shape\n- [ ] `groups()` iterator returns `RG` groups in ascending order; each group holds exactly its configured nodes\n- [ ] State-machine unit tests cover every legal transition and reject illegal ones (e.g., Joining → Draining)\n- [ ] `Node::is_write_eligible_for(shard_id, status)` correctness table has a test per row","status":"closed","priority":0,"issue_type":"task","assignee":"delta","created_at":"2026-04-18T21:26:11.777790379Z","created_by":"coding","updated_at":"2026-04-19T04:06:04.329548111Z","closed_at":"2026-04-19T04:06:04.329417610Z","close_reason":"done","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","failure-count:1","phase-1"],"dependencies":[{"issue_id":"miroir-cdo.2","depends_on_id":"miroir-cdo","type":"parent-child","created_at":"2026-04-18T21:26:11.777790379Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-cdo.3","title":"P1.3 write_targets and covering_set","description":"## What\n\nImplement the two flat API calls used by the HTTP layer:\n```rust\npub fn write_targets(shard_id: u32, topology: &Topology) -> Vec\npub fn query_group(query_seq: u64, replica_groups: u32) -> u32\npub fn covering_set(shard_count: u32, group: &Group, rf: usize, query_seq: u64) -> Vec\n```\n\n## Why / Semantics (plan §2)\n\n**`write_targets`** — flat union of `assign_shard_in_group(shard, g)` across all `RG` groups. Returns `RG × RF` nodes total (may include duplicates across groups if a node_id coincidentally has the highest score in multiple groups — use a dedup pass in the HTTP layer when grouping docs per-request rather than dedup here, so the routing layer's behavior is pure).\n\n**`query_group`** — round-robin per the plan's note: \"`query_sequence_number` is a per-pod counter, not a cluster-wide one.\" Under HPA, cluster-wide balance relies on the K8s Service's round-robin / random kube-proxy policy (§14.4 link).\n\n**`covering_set`** — one node per shard within a group. The intra-group replica selection within each shard rotates by `query_seq % rf` (plan §4 code sample). The returned set is **deduplicated** because one node may own multiple shards in the same group; searching it once captures all its shards (Meilisearch searches all its local docs in a single call).\n\n## Critical Invariant\n\nTwo different Miroir pods, given identical `Topology` + `rf` + `shard_count`, **must** compute the same `write_targets` for any given `shard_id` and the same `covering_set` modulo `query_seq` rotation. This is the property that makes the request path stateless (plan §14.4).\n\n## Acceptance (plan §8)\n\n- [ ] `write_targets` returns exactly `RG × RF` nodes (counting duplicates)\n- [ ] `write_targets` assigns one-per-group: the subset of returned nodes in group g is exactly `assign_shard_in_group(shard, group_g_nodes)`\n- [ ] `covering_set` has `|covering_set| ≤ Ng` and covers all `shard_count` shards within the chosen group\n- [ ] Two instances of `Topology` with identical content produce identical `covering_set` outputs for the same `query_seq`\n- [ ] `query_group` distribution: 10K `query_seq` values `% RG` produce uniformly distributed group choices (chi-square pass)","status":"closed","priority":0,"issue_type":"task","assignee":"delta","created_at":"2026-04-18T21:26:11.798428290Z","created_by":"coding","updated_at":"2026-04-19T04:14:55.689143427Z","closed_at":"2026-04-19T04:14:55.689022605Z","close_reason":"All three functions already implemented in router.rs:\n- write_targets (lines 40-45): flat union of assign_shard_in_group across all RG groups\n- query_group (lines 48-50): round-robin by query_seq % replica_groups \n- covering_set (lines 53-63): deduplicated node set with replica rotation\n\nAll 7 P1.3 acceptance tests pass:\n- write_targets returns RG × RF nodes\n- write_targets assigns one-per-group correctly\n- covering_set covers all shards within chosen group\n- covering_set size ≤ Ng\n- Two identical topologies produce identical covering_set outputs\n- query_group distribution is uniform (chi-square test)\n- covering_set rotates replicas by query_seq","source_repo":".","compaction_level":0,"original_size":0,"labels":["failure-count:1","phase-1"],"dependencies":[{"issue_id":"miroir-cdo.3","depends_on_id":"miroir-cdo","type":"parent-child","created_at":"2026-04-18T21:26:11.798428290Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-cdo.3","depends_on_id":"miroir-cdo.1","type":"blocks","created_at":"2026-04-18T21:26:21.555076342Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-cdo.3","depends_on_id":"miroir-cdo.2","type":"blocks","created_at":"2026-04-18T21:26:21.576939978Z","created_by":"coding","metadata":"{}","thread_id":""}]} @@ -52,7 +52,7 @@ {"id":"miroir-mkk.4","title":"P4.4 Replica group addition: initializing → active","description":"## What\n\nImplement the \"Adding a new replica group\" flow from plan §2:\n1. Provision new nodes; assign `replica_group: G_new` in config\n2. Mark new group `initializing`; queries NOT routed here\n3. Background sync: for each shard, copy all docs from **any** healthy existing group to the new group's nodes via `filter=_miroir_shard={id}` pagination; new inbound writes already fan out to the new group immediately\n4. When all shards synced, mark group `active` — queries begin routing in round-robin\n5. Existing groups continue serving queries throughout (zero read interruption)\n\n## Why\n\nPlan §2 \"Adding a new replica group (throughput scaling)\": adding a group multiplies query capacity without touching existing groups' data. This is the primary \"we need more search QPS\" lever. Unlike intra-group rebalance which moves a subset, group-add **copies** every shard to the new group — so the I/O is proportional to total corpus size, not `1/(Ng+1)`.\n\n## Details\n\n**Source group selection**: round-robin across existing `active` groups to spread read load during sync. Per-shard picks a different source so one group isn't hammered.\n\n**Write fan-out during sync**: new group already receives writes from step 3 onward. This is the durability guarantee — only the backfill window of historical data is transient.\n\n**Progress tracking**: per-shard cursor in `jobs` table; can be paused/resumed per Phase 6 Mode C.\n\n**Verification before `active`**: `GET /indexes/{uid}/stats` against new group → docs count within 0.1% of source group (allows for writes landing during sync). If higher variance, delay the flip and investigate.\n\n## Acceptance\n\n- [ ] Integration test: RG=1 → RG=2; during sync, query throughput on original group unchanged (no regression)\n- [ ] After `active`, queries distribute round-robin between the two groups (verified via per-group metrics)\n- [ ] Mid-sync write test: 100 writes landing during the backfill window are all present on both groups when sync completes\n- [ ] Failed sync (source group becomes unavailable mid-copy) pauses without corrupting new group; resumes when source returns","status":"open","priority":0,"issue_type":"task","created_at":"2026-04-18T21:31:43.859158013Z","created_by":"coding","updated_at":"2026-04-18T21:31:48.961616587Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase-4"],"dependencies":[{"issue_id":"miroir-mkk.4","depends_on_id":"miroir-mkk","type":"parent-child","created_at":"2026-04-18T21:31:43.859158013Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-mkk.4","depends_on_id":"miroir-mkk.1","type":"blocks","created_at":"2026-04-18T21:31:48.961576914Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-mkk.5","title":"P4.5 Group removal + unplanned node failure","description":"## What\n\nTwo related flows from plan §2:\n\n**Removing a replica group** (decommission a query pool):\n1. Mark group `draining` — queries stop routing immediately\n2. Nodes can be decommissioned; no data migration needed (other groups hold the docs)\n3. Remove nodes from config; operator deletes pods + PVCs\n\n**Unplanned node failure**:\n1. Health check detects failure → mark `failed`, stop routing writes to it\n2. If RF > 1 within the group: surviving replicas serve reads — no immediate migration\n3. For reads: if failed node's shards have no intra-group RF replica, fall back to a healthy group for those shards\n4. Schedule background replication to restore RF within the group; degrade to cross-group fallback until restored\n\n## Why\n\nPlan §2: \"Changes to one group do not affect other groups' data or query routing.\" Group-removal is instant (no data movement) — lets operators shed throughput capacity without a migration window. Unplanned node failure is the most time-sensitive case: readers must not see errors; RF-restore runs in the background.\n\n## Details\n\n**Group-removal preconditions**: refuse to remove a group if it's the last group holding a shard (would be data loss). Require `--force` and document the risk.\n\n**Failure detection**: plan §4 config:\n```yaml\nhealth:\n interval_ms: 5000\n timeout_ms: 2000\n unhealthy_threshold: 3 # 3 consecutive failures → mark degraded\n recovery_threshold: 2 # 2 consecutive OKs → mark healthy again\n```\n\n**Cross-group fallback**: Phase 1 `covering_set` already deterministic per-request; the fallback is a per-shard \"if intra-group has none, check other groups\" decision **inside** the scatter planner (Phase 2).\n\n**RF-restore**: similar to P4.2 node addition but for an existing node that lost its data — re-run `_miroir_shard` filter migration from the best intra-group source.\n\n## Acceptance\n\n- [ ] Remove a group with healthy peer groups → queries route away within one `query_seq` tick; no read errors\n- [ ] `--force`-remove the last group holding shard S → loud warning; operator must re-type the index UID to confirm\n- [ ] RF=2 group with 1 node killed → reads succeed on remaining replica; `X-Miroir-Degraded` absent\n- [ ] RF=1 group with 1 node killed → cross-group fallback kicks in; `X-Miroir-Degraded` absent if fallback succeeds\n- [ ] Restored node re-hydrates from a peer replica within its group; `miroir_rebalance_in_progress` transitions 0→1→0","status":"open","priority":0,"issue_type":"task","created_at":"2026-04-18T21:31:43.887649468Z","created_by":"coding","updated_at":"2026-04-18T21:31:48.981354074Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase-4"],"dependencies":[{"issue_id":"miroir-mkk.5","depends_on_id":"miroir-mkk","type":"parent-child","created_at":"2026-04-18T21:31:43.887649468Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-mkk.5","depends_on_id":"miroir-mkk.1","type":"blocks","created_at":"2026-04-18T21:31:48.981335608Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-mkk.6","title":"P4.6 Admin API for topology ops: /_miroir/nodes + /_miroir/rebalance","description":"## What\n\nPlan §4 admin API endpoints for topology (wrap the rebalancer flows):\n- `POST /_miroir/nodes` — add node (P4.2)\n- `DELETE /_miroir/nodes/{id}` — drain + remove\n- `POST /_miroir/nodes/{id}/drain` — drain only (P4.3, plan §6 \"Scaling\" scale-down)\n- `POST /_miroir/rebalance` — manually trigger rebalance (e.g., after config-only topology tweak)\n- `GET /_miroir/rebalance/status` — current progress; returned shape includes per-shard phase + `miroir_task_id` for each migration batch\n\n## Why\n\nThese endpoints are the **operator surface**. Everything in §11 \"Common operations with miroir-ctl\" maps to these; the Admin UI §13.19 topology tab is a visual wrapper around the same endpoints. Keeping them REST-shaped rather than ad-hoc makes `miroir-ctl` a thin wrapper and the Admin UI trivial.\n\n## Details\n\n**Body shape for `POST /_miroir/nodes`**:\n```json\n{\n \"id\": \"meili-4\",\n \"address\": \"http://meili-4.search.svc:7700\",\n \"replica_group\": 0\n}\n```\n\n**Response**: `202 Accepted` with a `miroir_task_id` (the rebalance is async). Client polls `/tasks/{mtask}` for terminal status.\n\n**`GET /_miroir/rebalance/status`** returns:\n```json\n{\n \"in_progress\": true,\n \"triggered_by\": \"POST /_miroir/nodes\",\n \"operation_id\": \"reb-1234\",\n \"started_at\": \"2026-04-18T20:00:00Z\",\n \"phases\": [\n {\"shard\": 12, \"state\": \"MigrationInProgress\", \"pct_complete\": 42, \"source\": \"meili-0\", \"destination\": \"meili-4\"},\n ...\n ],\n \"overall_pct_complete\": 38\n}\n```\n\n**Authentication**: admin-key only (plan §5 bearer dispatch rule 2).\n\n## Acceptance\n\n- [ ] `curl -X POST -H \"Authorization: Bearer $ADMIN_KEY\" .../_miroir/nodes -d '{\"id\":\"meili-4\",\"address\":\"http://...\",\"replica_group\":0}'` returns 202 + miroir_task_id\n- [ ] Invalid `replica_group` (not present in current topology) → 400 with clear message\n- [ ] `POST /_miroir/rebalance` without prior topology change returns 200 and a no-op task (already balanced)\n- [ ] `GET .../rebalance/status` during a rebalance reflects per-shard state in near real time (< 5s staleness)","status":"open","priority":1,"issue_type":"task","created_at":"2026-04-18T21:31:43.916640224Z","created_by":"coding","updated_at":"2026-04-18T21:31:49.023343521Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase-4"],"dependencies":[{"issue_id":"miroir-mkk.6","depends_on_id":"miroir-mkk","type":"parent-child","created_at":"2026-04-18T21:31:43.916640224Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-mkk.6","depends_on_id":"miroir-mkk.2","type":"blocks","created_at":"2026-04-18T21:31:48.997646112Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-mkk.6","depends_on_id":"miroir-mkk.3","type":"blocks","created_at":"2026-04-18T21:31:49.023268953Z","created_by":"coding","metadata":"{}","thread_id":""}]} -{"id":"miroir-n6v","title":"P12.OP4.1: Global-IDF preflight (dfs_query_then_fetch pattern)","description":"## What\n\nImplement global-IDF preflight query phase for Miroir to solve cross-shard score comparability (Plan §15 OP#4).\n\nResearch validation (bead miroir-zc2.4) confirmed:\n- Score-based merge: Kendall τ = 0.79 vs ground truth (FAIL, threshold 0.95)\n- RRF merge: Kendall τ = 0.14 vs ground truth (CATASTROPHIC)\n- Root cause: local IDF computed per-shard diverges from global IDF on skewed shard distributions\n\n## Approach\n\nElasticsearch `dfs_query_then_fetch` pattern:\n1. Preflight round: scatter term-frequency query to all shards\n2. Aggregate global document frequencies at coordinator\n3. Send global IDF with search query to shards\n4. Shards use global IDF for scoring instead of local\n\n## Acceptance\n\n- [ ] Preflight round implemented in scatter-gather pipeline\n- [ ] Global IDF aggregation at coordinator\n- [ ] Shards accept and use global IDF for scoring\n- [ ] Re-run benchmark: Kendall τ ≥ 0.95 with same skewed corpus\n- [ ] Latency overhead measured and documented\n\n## Reference\n\n- Research doc: docs/research/score-normalization-at-scale.md\n- Benchmark: tests/benches/score-comparability/\n- ES reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-then-fetch","status":"in_progress","priority":2,"issue_type":"feature","assignee":"bravo","created_at":"2026-04-19T06:31:33.844052667Z","created_by":"coding","updated_at":"2026-04-19T06:57:46.854857854Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","miroir","research","score-normalization"],"dependencies":[{"issue_id":"miroir-n6v","depends_on_id":"miroir-zc2.4","type":"related","created_at":"2026-04-19T06:32:11.786005093Z","created_by":"coding","metadata":"{}","thread_id":""}]} +{"id":"miroir-n6v","title":"P12.OP4.1: Global-IDF preflight (dfs_query_then_fetch pattern)","description":"## What\n\nImplement global-IDF preflight query phase for Miroir to solve cross-shard score comparability (Plan §15 OP#4).\n\nResearch validation (bead miroir-zc2.4) confirmed:\n- Score-based merge: Kendall τ = 0.79 vs ground truth (FAIL, threshold 0.95)\n- RRF merge: Kendall τ = 0.14 vs ground truth (CATASTROPHIC)\n- Root cause: local IDF computed per-shard diverges from global IDF on skewed shard distributions\n\n## Approach\n\nElasticsearch `dfs_query_then_fetch` pattern:\n1. Preflight round: scatter term-frequency query to all shards\n2. Aggregate global document frequencies at coordinator\n3. Send global IDF with search query to shards\n4. Shards use global IDF for scoring instead of local\n\n## Acceptance\n\n- [ ] Preflight round implemented in scatter-gather pipeline\n- [ ] Global IDF aggregation at coordinator\n- [ ] Shards accept and use global IDF for scoring\n- [ ] Re-run benchmark: Kendall τ ≥ 0.95 with same skewed corpus\n- [ ] Latency overhead measured and documented\n\n## Reference\n\n- Research doc: docs/research/score-normalization-at-scale.md\n- Benchmark: tests/benches/score-comparability/\n- ES reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-then-fetch","status":"in_progress","priority":2,"issue_type":"feature","assignee":"bravo","created_at":"2026-04-19T06:31:33.844052667Z","created_by":"coding","updated_at":"2026-04-19T07:12:25.314607372Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","miroir","research","score-normalization"],"dependencies":[{"issue_id":"miroir-n6v","depends_on_id":"miroir-zc2.4","type":"related","created_at":"2026-04-19T06:32:11.786005093Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-nsu","title":"RRF Merging Implementation","description":"## Genesis Bead\nTied to plan: /home/coding/miroir/docs/plan/plan.md\n\n## Overview\nImplement Reciprocal Rank Fusion (RRF) for result merging in Miroir to address cross-shard score comparability issues identified in score-normalization-at-scale research.\n\n## Research Context\nExperiments (miroir-zc2.4) showed:\n- Average Kendall tau: 0.79 vs. 0.95 threshold (FAIL)\n- Common-term queries: τ = 0.15 (catastrophic)\n- RRF is the recommended solution (no preflight, production-proven)\n\n## Progress\n- [ ] Phase 1: Update Merger trait and stub\n- [ ] Phase 2: Implement RRF scoring\n- [ ] Phase 3: Benchmark against corpus\n- [ ] Phase 4: Integration with scatter-gather","status":"closed","priority":2,"issue_type":"genesis","assignee":"charlie","created_at":"2026-04-19T03:56:08.747340056Z","created_by":"coding","updated_at":"2026-04-19T06:24:21.290715173Z","closed_at":"2026-04-19T06:24:21.290611796Z","close_reason":"All four phases complete: MergeStrategy trait, RRF scoring (k=60), benchmarks re-run, scatter-gather integration. 26 merger + 15 scatter tests passing. Commits: 2b7f4a0, f5a630d, cec3b81","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","failure-count:1"]} {"id":"miroir-qjt","title":"Phase 8 — Deployment + CI (§6, §7)","description":"## Phase 8 Epic — Deployment + CI\n\nPackages Miroir: static musl binary → scratch Docker image → Helm chart → ArgoCD Application → Argo Workflows CI template (iad-ci). At phase end, `git tag v0.1.0 && git push origin v0.1.0` produces a signed GitHub Release with both `miroir-proxy` and `miroir-ctl`, a ghcr.io image, and a chart version bump.\n\n## Why This Phase (and Why It Depends On Phase 2)\n\nPlan §6 (Deployment) + §7 (CI/CD) turn the binary into a thing operators can actually install. Helm defaults (plan §6 \"Dev vs. production defaults\") encode the \"single-pod dev, multi-pod prod\" story from Phase 6. ArgoCD app + Argo Workflow template live in `jedarden/declarative-config` (see `/home/coding/CLAUDE.md`) — standard pattern across the fleet.\n\n## Scope\n\n**Dockerfile** (plan §7)\n- `FROM scratch` + static `miroir-proxy` binary\n- Expose 7700 + 9090\n- OCI labels: source, version, revision, licenses=MIT\n- Target size < 15 MB compressed\n\n**Cargo musl build** — `x86_64-unknown-linux-musl` target; `cargo build --release` for both `-p miroir-proxy` and `-p miroir-ctl`\n\n**Argo WorkflowTemplate `miroir-ci`** (plan §7) at `jedarden/declarative-config → k8s/iad-ci/argo-workflows/miroir-ci.yaml`\n- DAG: checkout → lint → test → build-binary → docker-build (tag-gated) → github-release (tag-gated)\n- `cargo fmt --check`, `cargo clippy -D warnings`, `cargo test --all`, musl build\n- Kaniko for image push to `ghcr.io/jedarden/miroir:`, `:latest`, `:`, `:`\n- `gh release create` with both binaries + sha256\n\n**Helm chart `charts/miroir/`** (plan §6)\n- Templates: deployment, service, headless, configmap, secret, HPA, optional PVC (CDC), StatefulSet for meilisearch, meilisearch service, optional Redis deployment, serviceaccount\n- `values.yaml` with dev defaults (replicas=1, SQLite, RF=1, RG=1, HPA off)\n- `values.schema.json` that rejects:\n - `miroir.replicas > 1` with `taskStore.backend: sqlite`\n - `miroir.hpa.enabled: true` without `replicas >= 2 && taskStore.backend: redis`\n - `search_ui.rate_limit.backend: local` when `miroir.replicas > 1`\n - Admin login rate-limit local backend in HA\n - `search_ui.scoped_key_rotate_before_expiry_days >= scoped_key_max_age_days`\n- `_helpers.tpl` for fully-qualified StatefulSet DNS node addresses (plan §6 ConfigMap)\n- `NOTES.txt` with next-step pointers\n\n**ArgoCD Application** (plan §6) — `k8s//miroir//` path in `jedarden/declarative-config`, automated sync + prune + selfHeal\n\n**Release mechanics** (plan §7)\n- `CHANGELOG.md` Keep a Changelog format; CI extracts section for GitHub release notes\n- `Cargo.toml` workspace version bumped before tag\n- `Chart.yaml` `appVersion` bumped before tag\n- Tag format: `v[0-9]+.[0-9]+.[0-9]+*`\n\n## Infrastructure Reference\n\n- Registry: `ghcr.io/jedarden/miroir`\n- Helm chart OCI: `ghcr.io/jedarden/charts/miroir`\n- Pages: `https://jedarden.github.io/miroir`\n- CI secrets on iad-ci: `ghcr-credentials` (argo-workflows/.dockerconfigjson), `github-token` (argo-workflows/token)\n- Argo UI: `https://argo-ci.ardenone.com`\n\n## Definition of Done\n\n- [ ] `kubectl --kubeconfig=$HOME/.kube/iad-ci.kubeconfig apply -f workflow.yaml` completes the full CI pipeline on `main` within ~10 min\n- [ ] Pushing tag `v0.1.0-rc.1` produces a ghcr.io image, a GitHub pre-release, and does NOT update `latest`/float tags\n- [ ] `helm install search charts/miroir --namespace search --wait` stands up a working single-pod cluster\n- [ ] `values.schema.json` rejections tested via `helm lint --strict` with mutating values files\n- [ ] Final image ≤ 15 MB compressed\n- [ ] ArgoCD app syncs cleanly against ardenone-manager read-only proxy","status":"open","priority":0,"issue_type":"epic","created_at":"2026-04-18T21:21:13.608558775Z","created_by":"coding","updated_at":"2026-04-18T21:23:08.690462028Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase","phase-8"],"dependencies":[{"issue_id":"miroir-qjt","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-18T21:23:08.690406249Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-qjt.1","title":"P8.1 Dockerfile: scratch + static musl miroir-proxy","description":"## What\n\nShip the `Dockerfile` from plan §7:\n```dockerfile\nFROM scratch\nCOPY miroir-proxy-linux-amd64 /miroir-proxy\nEXPOSE 7700 9090\nENTRYPOINT [\"/miroir-proxy\"]\nCMD [\"--config\", \"/etc/miroir/config.yaml\"]\n```\n\nOCI labels (plan §12):\n```\norg.opencontainers.image.source=https://github.com/jedarden/miroir\norg.opencontainers.image.version=\norg.opencontainers.image.revision=\norg.opencontainers.image.licenses=MIT\n```\n\nTarget: compressed image < 15 MB.\n\n## Why\n\nPlan §1 principle 6 + §12: \"scratch base, no libc. Zero OS packages, no shell.\" This is the smallest possible attack surface and the fastest possible pull (one layer, tiny). Makes trivial deploys feasible on edge clusters.\n\n## Details\n\n**Musl build step** (plan §7 `cargo-build` template):\n```bash\napt-get install -qy musl-tools\nrustup target add x86_64-unknown-linux-musl\ncargo build --release --target x86_64-unknown-linux-musl -p miroir-proxy\ncargo build --release --target x86_64-unknown-linux-musl -p miroir-ctl\nsha256sum miroir-proxy-linux-amd64 > miroir-proxy-linux-amd64.sha256\n```\n\n**Layers**: COPY the static binary directly from `/workspace/artifacts/` into `/miroir-proxy` in the scratch image.\n\n**Config mount**: `/etc/miroir/config.yaml` via ConfigMap mount (Helm chart).\n\n**No shell = no `docker exec -it` debugging** — intentional. Debug by logs + metrics + `kubectl describe` only. Operators who need shell can run a sidecar.\n\n## Acceptance\n\n- [ ] `docker build .` on an artifact-equipped workspace produces an image < 15 MB compressed\n- [ ] `docker run --help` returns clap help (binary works from scratch base)\n- [ ] Image labels contain all 4 OCI labels with correct values\n- [ ] Static linkage: `ldd` against the extracted binary prints \"not a dynamic executable\"","status":"open","priority":0,"issue_type":"task","created_at":"2026-04-18T21:43:56.826575101Z","created_by":"coding","updated_at":"2026-04-18T21:43:56.826575101Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase-8"],"dependencies":[{"issue_id":"miroir-qjt.1","depends_on_id":"miroir-qjt","type":"parent-child","created_at":"2026-04-18T21:43:56.826575101Z","created_by":"coding","metadata":"{}","thread_id":""}]} @@ -137,7 +137,7 @@ {"id":"miroir-uyx.4","title":"P11.4 miroir-ctl subcommand docs + runbooks","description":"## What\n\nFor each `miroir-ctl` subcommand listed in plan §4 crate layout + §11 common operations:\n- `clap`-generated `--help` output covers flags + examples\n- A short runbook `docs/ctl/.md` with purpose, preconditions, examples, gotchas\n\nCommands covered:\n- `status`, `node add/drain`, `rebalance status --watch`, `verify`, `task status`\n- `reshard` (§13.1), `alias` (§13.7), `ttl` (§13.14), `cdc` (§13.13)\n- `shadow` (§13.16), `ui` (§13.19/§13.21 — scoped-key rotation, JWT rotation)\n- `tenant` (§13.15), `explain` (§13.20), `dump import` (§13.9), `canary` (§13.18)\n\n## Why\n\nPlan §12: \"`miroir-ctl --help` — all subcommands documented via clap.\" But `--help` alone isn't enough — operators need examples and gotchas. A good runbook is what prevents a 3-AM mis-run.\n\n## Details\n\n**Runbook template**:\n```markdown\n# `miroir-ctl `\n\n## Purpose\n\n\n## Preconditions\n- [ ] ...\n\n## Examples\n```\nmiroir-ctl ... --example\n```\n\n## Gotchas\n- ...\n\n## See also\n- Plan §X.X\n```\n\n**Integration with Admin UI (§13.19)**: many commands have a UI equivalent — runbook should cross-reference both (\"prefer UI for one-off; prefer CLI for scripts / CI\").\n\n## Acceptance\n\n- [ ] Every subcommand in the crate layout has a matching `docs/ctl/*.md` runbook\n- [ ] `miroir-ctl status --help` mentions where to find runbook for more\n- [ ] The runbooks are all under 100 lines each (easy to read before operating)","status":"open","priority":1,"issue_type":"task","created_at":"2026-04-18T21:48:38.832471052Z","created_by":"coding","updated_at":"2026-04-18T21:48:38.832471052Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase-11"],"dependencies":[{"issue_id":"miroir-uyx.4","depends_on_id":"miroir-uyx","type":"parent-child","created_at":"2026-04-18T21:48:38.832471052Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-uyx.5","title":"P11.5 Common issues + troubleshooting","description":"## What\n\nPlan §11 \"Common issues\" as structured troubleshooting docs at `docs/troubleshooting.md`:\n- \"primary key required\" — Miroir requires explicit primary key at index creation\n- \"Search returns fewer results than expected\" — degraded-node cross-reference + `GET /_miroir/topology`\n- \"Task polling stuck at processing\" — per-node task status via `miroir-ctl task status`\n\nPlus others discovered during Phase 9 testing and chaos scenarios.\n\n## Why\n\nEvery production system accumulates a list of \"the 10 things new users hit in their first week.\" Documenting them transparently shortens the mean-time-to-productive-user from hours to minutes.\n\n## Details\n\n**Per-issue structure**:\n```markdown\n## Error: \"primary key required\"\n\n### Symptom\nClient sees: `HTTP 400 { \"code\": \"miroir_primary_key_required\" }`\n\n### Cause\nThe index was created without a primary key. Miroir cannot route without one.\n\n### Fix\n```bash\ncurl -X POST https://miroir/indexes \\\n -H \"Authorization: Bearer $KEY\" \\\n -d '{\"uid\": \"myindex\", \"primaryKey\": \"id\"}'\n```\n\n### Why this differs from Meilisearch\nMeilisearch can infer the primary key from the first document batch. Miroir cannot — it needs to hash the PK *before* any node sees it. Explicit primary_key at index creation is required.\n```\n\n**Diagnostic playbook**: `docs/troubleshooting/diagnostics.md` — first thing to check for any symptom:\n1. `GET /_miroir/topology` — all nodes healthy?\n2. `GET /_miroir/metrics | grep degraded` — any degraded shards?\n3. `kubectl logs miroir-0 --tail=100 | jq 'select(.level==\"ERROR\")'` — recent errors?\n4. `kubectl get pods -n search` — all running?\n\n## Acceptance\n\n- [ ] 3 plan §11 issues documented with the template\n- [ ] At least 5 additional issues discovered in Phase 9 chaos added\n- [ ] Troubleshooting doc cross-linked from README, install guide, each migration guide","status":"open","priority":1,"issue_type":"task","created_at":"2026-04-18T21:48:38.877214633Z","created_by":"coding","updated_at":"2026-04-18T21:48:38.877214633Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase-11"],"dependencies":[{"issue_id":"miroir-uyx.5","depends_on_id":"miroir-uyx","type":"parent-child","created_at":"2026-04-18T21:48:38.877214633Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-uyx.6","title":"P11.6 Helm chart publication: GH Pages + OCI push","description":"## What\n\nPlan §12 delivered artifacts for the Helm chart:\n- **Primary**: `https://jedarden.github.io/miroir` (GitHub Pages, `gh-pages` branch)\n- **OCI**: `ghcr.io/jedarden/charts/miroir` (for air-gapped environments)\n\nExtend the Phase 8 Argo Workflow `miroir-ci` template with:\n- On tag: `helm package charts/miroir -d dist/`\n- Push to gh-pages: update `index.yaml` + copy `.tgz` into the branch, commit via `gh-pages` helper\n- OCI push: `helm push dist/miroir-.tgz oci://ghcr.io/jedarden/charts`\n\n## Why\n\nPlan §12: chart users expect `helm repo add` to work. Without publication, operators have to `helm install charts/miroir/` from a git clone — fine for dev, wrong for prod.\n\n## Details\n\n**gh-pages flow**:\n```bash\ngit worktree add gh-pages gh-pages\nhelm package charts/miroir -d gh-pages/\nhelm repo index gh-pages/ --url https://jedarden.github.io/miroir --merge gh-pages/index.yaml\ngit -C gh-pages add -A\ngit -C gh-pages commit -m \"Release chart v\"\ngit -C gh-pages push origin gh-pages\n```\n\n**OCI push** requires GHCR write token (already have in `ghcr-credentials`):\n```bash\necho $GHCR_TOKEN | helm registry login ghcr.io -u --password-stdin\nhelm push miroir-.tgz oci://ghcr.io/jedarden/charts\n```\n\n**Chart-only fixes**: when a chart change doesn't need an app rebuild, bump only chart version (not appVersion). CI must detect \"chart-only\" change (e.g., by diffing `charts/**` vs. `crates/**`) and skip the binary rebuild.\n\n## Acceptance\n\n- [ ] After `git tag v0.1.0 && git push`, `helm repo add miroir https://jedarden.github.io/miroir && helm repo update` discovers v0.1.0\n- [ ] `helm install ... oci://ghcr.io/jedarden/charts/miroir --version 0.1.0` works identically\n- [ ] Chart-only fix: tagging `v0.1.1` after editing only a template file bumps chart version without new app binary","status":"open","priority":2,"issue_type":"task","created_at":"2026-04-18T21:48:38.909893288Z","created_by":"coding","updated_at":"2026-04-18T21:48:38.909893288Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase-11"],"dependencies":[{"issue_id":"miroir-uyx.6","depends_on_id":"miroir-uyx","type":"parent-child","created_at":"2026-04-18T21:48:38.909893288Z","created_by":"coding","metadata":"{}","thread_id":""}]} -{"id":"miroir-yio","title":"Global-IDF preflight: implement dfs_query_then_fetch for cross-shard comparability","description":"## Context\n\nRRF validation (miroir-zfo) confirmed that RRF merge produces τ = 0.14 against ground truth — catastrophically worse than score-based merge (τ = 0.79). Neither strategy meets the 0.95 threshold.\n\nThe root cause is that shards with different document distributions compute different local IDF values, making scores and rankings incomparable across shards.\n\n## What\n\nImplement the Elasticsearch `dfs_query_then_fetch` pattern as a pre-query phase in Miroir:\n\n1. Coordinator sends a lightweight DFS (Distributed Frequency Search) request to all shards\n2. Each shard returns term-level document frequencies for the query terms\n3. Coordinator aggregates into global IDF values\n4. Coordinator sends the actual search query with global IDF attached\n5. Shards use global IDF for scoring instead of local IDF\n\n## Why\n\nThis is the proven solution. Both score-based merging (τ = 0.79) and RRF (τ = 0.14) fail the τ ≥ 0.95 quality threshold with skewed shards.\n\n## Scope\n\n- New `DfsPhase` in the scatter-gather pipeline\n- Coordinator-side IDF aggregation\n- Shard-side global-IDF scoring override\n- Integration test with skewed corpus\n- Benchmark to measure latency overhead of the preflight round\n\n## Depends on\n\n- miroir-zfo (RRF validation — complete)\n- miroir-zc2.4 (score normalization research — complete)","status":"in_progress","priority":1,"issue_type":"task","assignee":"alpha","created_at":"2026-04-19T06:42:00.808359301Z","created_by":"coding","updated_at":"2026-04-19T06:55:38.065100933Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["failure-count:2"]} +{"id":"miroir-yio","title":"Global-IDF preflight: implement dfs_query_then_fetch for cross-shard comparability","description":"## Context\n\nRRF validation (miroir-zfo) confirmed that RRF merge produces τ = 0.14 against ground truth — catastrophically worse than score-based merge (τ = 0.79). Neither strategy meets the 0.95 threshold.\n\nThe root cause is that shards with different document distributions compute different local IDF values, making scores and rankings incomparable across shards.\n\n## What\n\nImplement the Elasticsearch `dfs_query_then_fetch` pattern as a pre-query phase in Miroir:\n\n1. Coordinator sends a lightweight DFS (Distributed Frequency Search) request to all shards\n2. Each shard returns term-level document frequencies for the query terms\n3. Coordinator aggregates into global IDF values\n4. Coordinator sends the actual search query with global IDF attached\n5. Shards use global IDF for scoring instead of local IDF\n\n## Why\n\nThis is the proven solution. Both score-based merging (τ = 0.79) and RRF (τ = 0.14) fail the τ ≥ 0.95 quality threshold with skewed shards.\n\n## Scope\n\n- New `DfsPhase` in the scatter-gather pipeline\n- Coordinator-side IDF aggregation\n- Shard-side global-IDF scoring override\n- Integration test with skewed corpus\n- Benchmark to measure latency overhead of the preflight round\n\n## Depends on\n\n- miroir-zfo (RRF validation — complete)\n- miroir-zc2.4 (score normalization research — complete)","status":"in_progress","priority":1,"issue_type":"task","assignee":"alpha","created_at":"2026-04-19T06:42:00.808359301Z","created_by":"coding","updated_at":"2026-04-19T07:15:20.540628390Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","failure-count:3"]} {"id":"miroir-zc2","title":"Phase 12 — Open Problems + Research (§15)","description":"## Phase 12 Epic — Open Problems Tracking\n\nStanding bucket for the plan §15 open problems that are **not** fully resolved by initial implementation. These are research/validation/future-enhancement beads, not blockers for v1.0. This phase does not block the genesis bead's shipping path — it's a parallel track that persists beyond v1.0.\n\n## Why An Epic At All\n\nPlan §15 flags these as \"documented constraints, not blockers. Initial release ships with known limitations.\" Tracking them as beads means they're not forgotten, they have a visible owner, and their resolution status can be surfaced alongside the rest of the work.\n\n## Scope — the 6 Open Problems (plan §15)\n\n1. **Shard migration write safety** — OP#1. **Status: partially addressed.** Dual-write cutover sequencing (Phase 4) + anti-entropy reconciler (§13.8 / Phase 5) catches slipped docs. Remaining work: chaos-test the cutover boundary, document any reproducible window where data could be lost if anti-entropy is disabled.\n\n2. **Task state HA (Raft vs. Redis)** — OP#2. **Status: deferred.** Current: Redis for multi-pod, SQLite for single-pod. Future: lightweight in-process Raft (or equivalent) so Redis is not required in HA. Not v1.x.\n\n3. **Resharding (S change) vs. node scaling (N change)** — OP#3. **Status: addressed by §13.1** (shadow-index dual-hash). Remaining work: empirical validation of the §13.1 \"2× transient storage and write load\" caveat under real corpora; schedule guidance in the CLI for off-peak reshard windows.\n\n4. **Score normalization at scale** — OP#4. **Status: settings-divergence addressed by §13.5 two-phase broadcast + drift reconciler.** Remaining work is purely statistical: validate that `_rankingScore` remains comparable across shards with very different document-count distributions. Requires corpus diversity tests.\n\n5. **Dump import distribution** — OP#5. **Status: addressed by §13.9 streaming routed dump import.** Broadcast mode retained as fallback. Remaining work: identify and enumerate every dump variant `mode: streaming` cannot fully reconstruct; either extend streaming or document the fallback trigger clearly.\n\n6. **arm64 support** — OP#6. **Status: not planned for v0.x.** Wire into CI when K8s ARM node support is actually needed (likely v1.x or later).\n\n## How To Use This Phase\n\n- Each OP becomes a child bead (bug/feature type) under this epic\n- Beads stay open until the status column above says \"fully addressed\"\n- v1.0 release notes should explicitly link to this epic so operators know what's still on the table\n- New open problems discovered during implementation get added here rather than silently accreted elsewhere\n\n## Not In Scope\n\n- Any concrete implementation work already covered by §13.1 / §13.5 / §13.8 / §13.9 — that belongs to Phase 5.","status":"open","priority":2,"issue_type":"epic","created_at":"2026-04-18T21:22:54.403910669Z","created_by":"coding","updated_at":"2026-04-18T21:22:54.403910669Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["phase","phase-12","research"]} {"id":"miroir-zc2.1","title":"P12.OP1 Shard migration write safety — cutover race window analysis","description":"## What\n\nPlan §15 Open Problem #1: \"Dual-write during migration must not lose documents that arrive exactly at the migration cutover boundary.\"\n\n**Status** per plan: partially addressed. Race window mitigated by §13.8 anti-entropy; any slipped doc caught on next reconciliation pass.\n\n**Remaining work**:\n- Chaos-test the cutover boundary — specifically: docs arriving at the instant of `active` transition (step 7 in plan §2 \"Adding a node\")\n- Document any reproducible window where data could be lost if anti-entropy is disabled\n- If found: extend Phase 4 dual-write to hold the window longer OR require anti-entropy to be on (hard-coded policy)\n\n## Why\n\n\"Plan §15 Open Problem 1 closure\" has been claimed in §13.8 — this bead verifies that claim empirically before we ship v1.0 committing to it.\n\n## Details\n\n**Chaos test design**:\n1. Start 3-node cluster, write 1000 docs\n2. Trigger node addition (`POST /_miroir/nodes`)\n3. During dual-write, rapid-fire new writes with tight (1ms) interval\n4. Tight-loop the transition from step 4 (migration complete) to step 7 (old replica deleted)\n5. Assert: every written doc retrievable AFTER step 7\n\n**Variants**:\n- With anti-entropy enabled (default) — expect 100% retrievable\n- With anti-entropy **disabled** — measure loss rate. If > 0, document + add a schema constraint refusing to enable migrations when anti-entropy is off\n\n## Acceptance\n\n- [ ] Chaos test published; runs on every v1.0-gating CI run\n- [ ] Loss rate measured at < 1 per 1M writes with AE on\n- [ ] Loss rate measured without AE; decision documented in `docs/trade-offs.md`\n- [ ] If `anti_entropy.enabled: false` + migration concurrent → loud warning log + (decided) refuse or warn","status":"closed","priority":2,"issue_type":"bug","assignee":"alpha","created_at":"2026-04-18T21:49:47.774525899Z","created_by":"coding","updated_at":"2026-04-19T02:01:02.057461283Z","closed_at":"2026-04-19T02:01:02.057395870Z","close_reason":"done","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","open-problem","phase-12","research"],"dependencies":[{"issue_id":"miroir-zc2.1","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.774525899Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-zc2.2","title":"P12.OP2 Task state HA — evaluate lightweight Raft vs. Redis requirement","description":"## What\n\nPlan §15 Open Problem #2: \"SQLite is single-writer. Running 2 Miroir replicas requires Redis. A future enhancement is a lightweight Raft-based in-process consensus so Redis is not required for HA mode.\"\n\n**Status** per plan: deferred. Current solution (Redis) works; Raft would remove an external dependency.\n\n**Research work**:\n- Survey embedded Raft crates: `openraft`, `raft-rs`, `async-raft`\n- Prototype: `TaskStore` trait impl backed by Raft state machine\n- Measure: latency + throughput vs. Redis; memory footprint per plan §14.2\n- Decide: ship in v1.x or never\n\n## Why\n\nRemoving Redis as a hard dependency shrinks the operational surface (one less thing to monitor, backup, rotate secrets for). But Raft adds complexity — a bad Raft impl can eat data in ways Redis doesn't.\n\nNot blocking v0.x or v1.0 — but worth prototyping before v2.0.\n\n## Details\n\n**Decision gate**: the Raft-backed path must be measurably better than Redis on at least one metric (ops simplicity, latency, or memory) without being worse on any of the others, before shipping.\n\n**Output**: `docs/research/raft-task-store.md` with the decision + benchmark data + reasoning. Keep or discard based on findings.\n\n## Acceptance\n\n- [ ] Research doc published with prototype branch linked\n- [ ] Decision recorded: ship / don't ship / revisit when","status":"closed","priority":3,"issue_type":"feature","assignee":"bravo","created_at":"2026-04-18T21:49:47.798646718Z","created_by":"coding","updated_at":"2026-04-19T02:57:16.452177084Z","closed_at":"2026-04-19T02:57:16.452114067Z","close_reason":"P12.OP2 complete. Surveyed openraft/raft-rs/async-raft (recommend openraft if revisited). Built feature-gated Raft state machine prototype at crates/miroir-core/src/raft_proto/ with benchmarks. Decision: do not ship Raft in v0.x/v1.0 -- Redis wins on write latency, throughput, correctness maturity, and operational tooling. Raft only wins on ops simplicity and read latency. Does not pass the decision gate. Revisit before v2.0 when Redis backend is production-stabilized and openraft reaches v1.0. Full analysis in docs/research/raft-task-store.md.","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","open-problem","phase-12","research"],"dependencies":[{"issue_id":"miroir-zc2.2","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.798646718Z","created_by":"coding","metadata":"{}","thread_id":""}]} @@ -145,4 +145,4 @@ {"id":"miroir-zc2.4","title":"P12.OP4 Score normalization at scale — statistical validation of cross-shard comparability","description":"## What\n\nPlan §15 Open Problem #4: \"`_rankingScore` is comparable across shards only when index settings are identical.\" Settings divergence addressed by §13.5; remaining concern is statistical — do scores stay comparable when shards have very different document-count distributions?\n\n**Research work**:\n- Build a test corpus with intentionally skewed shard populations (one shard 100×, another shard 0.01× the median)\n- Submit identical queries; measure score distribution per shard\n- Assert: top-K merged ordering matches a ground-truth single-index version within some ε\n- If large ε, document + possibly introduce a score normalization pass\n\n## Why\n\nElasticsearch (plan research doc §1) hits this exactly: \"BM25 scoring depends on IDF, computed per shard by default using only that shard's local term statistics.\" Meilisearch uses its own ranking pipeline, but the same issue applies — local rank stats can drift from global on skewed shards.\n\n## Details\n\n**Ground truth**: single-index Meilisearch running the same queries against the same corpus.\n\n**Divergence metric**: Kendall τ between Miroir result ordering and single-index result ordering across 10k random queries.\n\n**If τ < 0.95 on average**: investigate whether a global IDF-style preflight is worth adding (plan research §1 \"`dfs_query_then_fetch`\" pattern).\n\n**Output**: `docs/research/score-normalization-at-scale.md`.\n\n## Acceptance\n\n- [ ] Benchmark corpus + query set published in `tests/benches/score-comparability/`\n- [ ] Results reported with confidence intervals\n- [ ] If τ < 0.95: follow-up bead created for a normalization pass\n- [ ] If τ ≥ 0.95: note-of-no-action in the bead's close comment","status":"closed","priority":3,"issue_type":"task","assignee":"charlie","created_at":"2026-04-18T21:49:47.849019120Z","created_by":"coding","updated_at":"2026-04-19T06:54:42.282404673Z","closed_at":"2026-04-19T06:54:42.282137259Z","close_reason":"P12.OP4 score normalization validation complete.\n\nResults: Score-based merge Kendall τ=0.79 [95% CI: 0.787-0.801], RRF τ=0.14 [95% CI: 0.134-0.140]. Both fail τ≥0.95 threshold. Common-term queries worst (score τ=0.15, RRF τ=0.11) due to IDF divergence between tiny/large shards. Root cause: shard-local IDF inflates scores from small shards. Follow-up bead miroir-yio created for global-IDF preflight (dfs_query_then_fetch pattern). Artifacts: tests/benches/score-comparability/, docs/research/score-normalization-at-scale.md","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred","failure-count:2","open-problem","phase-12","research"],"dependencies":[{"issue_id":"miroir-zc2.4","depends_on_id":"miroir-nsu","type":"blocks","created_at":"2026-04-19T03:56:41.560992652Z","created_by":"coding","metadata":"{}","thread_id":""},{"issue_id":"miroir-zc2.4","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.849019120Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-zc2.5","title":"P12.OP5 Dump import variants — enumerate what streaming mode can't handle","description":"## What\n\nPlan §15 Open Problem #5: §13.9 streaming routed dump import addresses the main case; broadcast mode retained as a fallback for dump variants Miroir cannot fully reconstruct via public API.\n\n**Remaining work**:\n- Identify and enumerate every dump variant streaming can't reconstruct\n- Either extend streaming to handle them OR document the fallback trigger clearly in `miroir-ctl dump import --help`\n\n## Why\n\n\"Can't reconstruct\" is vague — operators deserve concrete lists of what works and what doesn't. Without this, the `broadcast` fallback path is a bug waiting to happen.\n\n## Details\n\n**Potential failure modes to investigate**:\n- Dumps from older Meilisearch versions with pre-v1.37 schema\n- Dumps with custom keys (POST /keys) that have indexes list or actions not representable via public API\n- Dumps with snapshot-taken-mid-write where Miroir-injected `_miroir_shard` would conflict with an existing client field\n\n**Deliverable**: `docs/dump-import/compatibility-matrix.md` with columns:\n| Meilisearch version | Dump variant | Streaming works? | Broadcast needed? | Workaround |\n\n## Acceptance\n\n- [ ] Matrix published\n- [ ] Each \"broadcast needed\" row has a workaround or a link to an open enhancement bead\n- [ ] `miroir-ctl dump import` output references the matrix when falling back to broadcast","status":"closed","priority":3,"issue_type":"task","assignee":"bravo","created_at":"2026-04-18T21:49:47.884303207Z","created_by":"coding","updated_at":"2026-04-19T01:09:27.327131515Z","closed_at":"2026-04-19T01:09:27.327067549Z","close_reason":"Compatibility matrix published at docs/dump-import/compatibility-matrix.md\n\n- Matrix enumerates all dump variants that streaming mode can/cannot reconstruct\n- Each broadcast fallback row has workaround or enhancement bead link\n- CLI output reference section documents fallback message\n- Covers: version compatibility, field conflicts, EE features, snapshots, corrupted dumps","source_repo":".","compaction_level":0,"original_size":0,"labels":["failure-count:1","open-problem","phase-12","research"],"dependencies":[{"issue_id":"miroir-zc2.5","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.884303207Z","created_by":"coding","metadata":"{}","thread_id":""}]} {"id":"miroir-zc2.6","title":"P12.OP6 arm64 support (deferred to v1.x+)","description":"## What\n\nPlan §15 Open Problem #6: \"Not planned for v0.x. Added when K8s ARM node support is required.\"\n\n**Future work when prioritized**:\n- Cross-compile `miroir-proxy` and `miroir-ctl` for `aarch64-unknown-linux-musl` in the CI pipeline\n- Docker image manifest list: `ghcr.io/jedarden/miroir:` spans `linux/amd64` + `linux/arm64`\n- Helm chart: no changes (binary is arch-agnostic at the k8s layer)\n- Phase 9 CI: add arm64 test runs\n\n## Why\n\nARM node support is increasingly common (Hetzner Ampere, AWS Graviton, GCP Tau T2A, Rackspace Spot). But Miroir's fleet is currently all amd64 (iad-ci is amd64; ardenone cluster nodes are amd64). No current demand to justify the CI complexity.\n\nKeep this bead open as a placeholder; promote to in-progress when a concrete use case emerges.\n\n## Details\n\n**When ready**: the Argo Workflow `cargo-build` step needs a matrix over targets:\n```yaml\n- name: cargo-build\n container:\n args:\n - |\n rustup target add x86_64-unknown-linux-musl\n rustup target add aarch64-unknown-linux-musl\n apt-get install -qy musl-tools gcc-aarch64-linux-gnu\n cargo build --release --target x86_64-unknown-linux-musl -p miroir-proxy\n cargo build --release --target aarch64-unknown-linux-musl -p miroir-proxy\n ...\n```\n\nKaniko build needs `--customPlatform=linux/amd64,linux/arm64` or equivalent for multi-arch manifests.\n\n## Acceptance\n\n- [ ] Not to be closed until arm64 is a live deliverable\n- [ ] Cross-reference here when the priority flips","status":"in_progress","priority":4,"issue_type":"feature","assignee":"charlie","created_at":"2026-04-18T21:49:47.917666333Z","created_by":"coding","updated_at":"2026-04-19T00:58:19.767272778Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["open-problem","phase-12","roadmap"],"dependencies":[{"issue_id":"miroir-zc2.6","depends_on_id":"miroir-zc2","type":"parent-child","created_at":"2026-04-18T21:49:47.917666333Z","created_by":"coding","metadata":"{}","thread_id":""}]} -{"id":"miroir-zfo","title":"P12.OP4 follow-up: Validate RRF merging quality with score-comparability benchmark","description":"## Context\n\nScore normalization research (miroir-zc2.4) found that raw _rankingScore merging gives Kendall τ = 0.79 vs ground truth — well below the 0.95 threshold. RRF merging is already implemented in merger.rs as the mitigation.\n\n## What\n\nRe-run the score-comparability benchmark using Miroir's actual RRF merger (instead of the score-based merge in simulate.py) and measure τ against ground truth. This validates that RRF solves the cross-shard comparability problem.\n\n## Steps\n1. Add an RRF merge mode to simulate.py (or write a Rust test that uses the actual merger)\n2. Re-run with the same 10K query set against the skewed corpus\n3. Measure Kendall τ between RRF-merged results and single-index ground truth\n4. If τ ≥ 0.95: close with note-of-no-action\n5. If τ < 0.95: investigate global-IDF preflight (plan §1 dfs_query_then_fetch pattern)\n\n## Acceptance\n- [ ] RRF merge benchmarked against ground truth\n- [ ] τ reported with 95% CI\n- [ ] If τ < 0.95: create bead for global-IDF preflight implementation","status":"closed","priority":2,"issue_type":"issue","assignee":"alpha","created_at":"2026-04-19T04:06:52.077073258Z","created_by":"coding","updated_at":"2026-04-19T07:00:18.651855450Z","closed_at":"2026-04-19T07:00:18.651747298Z","close_reason":"RRF validation complete: τ=0.14 (95% CI [0.134, 0.140]), well below 0.95 threshold. RRF performs worse than score-based merge (τ=0.79) on skewed corpus. Follow-up bead miroir-yio created for global-IDF preflight implementation.","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred"]} +{"id":"miroir-zfo","title":"P12.OP4 follow-up: Validate RRF merging quality with score-comparability benchmark","description":"## Context\n\nScore normalization research (miroir-zc2.4) found that raw _rankingScore merging gives Kendall τ = 0.79 vs ground truth — well below the 0.95 threshold. RRF merging is already implemented in merger.rs as the mitigation.\n\n## What\n\nRe-run the score-comparability benchmark using Miroir's actual RRF merger (instead of the score-based merge in simulate.py) and measure τ against ground truth. This validates that RRF solves the cross-shard comparability problem.\n\n## Steps\n1. Add an RRF merge mode to simulate.py (or write a Rust test that uses the actual merger)\n2. Re-run with the same 10K query set against the skewed corpus\n3. Measure Kendall τ between RRF-merged results and single-index ground truth\n4. If τ ≥ 0.95: close with note-of-no-action\n5. If τ < 0.95: investigate global-IDF preflight (plan §1 dfs_query_then_fetch pattern)\n\n## Acceptance\n- [ ] RRF merge benchmarked against ground truth\n- [ ] τ reported with 95% CI\n- [ ] If τ < 0.95: create bead for global-IDF preflight implementation","status":"in_progress","priority":2,"issue_type":"issue","assignee":"alpha","created_at":"2026-04-19T04:06:52.077073258Z","created_by":"coding","updated_at":"2026-04-19T07:15:36.777297575Z","close_reason":"RRF validation complete: τ=0.14 (95% CI [0.134, 0.140]), well below 0.95 threshold. RRF performs worse than score-based merge (τ=0.79) on skewed corpus. Follow-up bead miroir-yio created for global-IDF preflight implementation.","source_repo":".","compaction_level":0,"original_size":0,"labels":["deferred"]} diff --git a/.needle-predispatch-sha b/.needle-predispatch-sha index 5f484e3..4592aa1 100644 --- a/.needle-predispatch-sha +++ b/.needle-predispatch-sha @@ -1 +1 @@ -330ba35484afe28d53cb83bd7d33926ef1823fb4 +a676a40d5235fbeef017557e787f54d55f277301 diff --git a/Cargo.lock b/Cargo.lock index 8209eae..83e0b7d 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1604,6 +1604,7 @@ name = "miroir-proxy" version = "0.1.0" dependencies = [ "anyhow", + "async-trait", "axum", "config", "http", diff --git a/crates/miroir-core/src/config.rs b/crates/miroir-core/src/config.rs index 4cb0b75..2b1fae8 100644 --- a/crates/miroir-core/src/config.rs +++ b/crates/miroir-core/src/config.rs @@ -626,7 +626,7 @@ shards: 16 replication_factor: 1 nodes: [] task_store: - backend: sqlite + backend: redis "#; let dir = tempfile::tempdir().unwrap(); let path = dir.path().join("miroir.yaml"); diff --git a/crates/miroir-core/src/merger.rs b/crates/miroir-core/src/merger.rs index 4f2e889..0fecde4 100644 --- a/crates/miroir-core/src/merger.rs +++ b/crates/miroir-core/src/merger.rs @@ -1704,4 +1704,259 @@ mod tests { assert_eq!(result.hits.len(), 0); assert_eq!(result.estimated_total_hits, 0); } + + // ----------------------------------------------------------------------- + // P12.OP4 RRF skew validation + // ----------------------------------------------------------------------- + + /// Validates the P12.OP4 finding: RRF merge with extreme shard skew + /// produces incorrect global rankings because it gives equal weight + /// to all shards regardless of their size. + /// + /// Scenario: 10 shards where shard 0 has 93K docs (93%) and shard 9 + /// has 10 docs (0.01%). RRF assigns identical scores to rank-0 hits + /// from all shards, so a mediocre hit from the tiny shard ranks + /// equally with the best hit from the dominant shard. + /// + /// Benchmark result (10K queries, skewed corpus): + /// Score merge: τ = 0.79 (95% CI [0.787, 0.801]) — FAIL + /// RRF merge: τ = 0.14 (95% CI [0.134, 0.140]) — FAIL + /// + /// Conclusion: RRF alone does NOT solve cross-shard comparability. + /// Global-IDF preflight (dfs_query_then_fetch) is required. + #[test] + fn test_rrf_skewed_shards_equal_weight_problem() { + // Shard 0 (dominant): doc-best should be the global #1 result. + // It has the highest score and appears in the shard with 93% of docs. + let shard_dominant = make_shard_response( + vec![ + make_hit("doc-best", 0.95, 0), // True global #1 + make_hit("doc-good", 0.90, 0), // True global #2 + make_hit("doc-ok", 0.85, 0), // True global #3 + make_hit("doc-mediocre", 0.70, 0), // True global #4 + make_hit("doc-weak", 0.60, 0), // True global #5 + ], + 93_000, + 10, + ); + + // Shard 9 (tiny, 10 docs): due to local IDF skew, irrelevant docs + // can appear at rank 0 with inflated local scores. + let shard_tiny = make_shard_response( + vec![ + make_hit("doc-irrelevant", 0.98, 9), // Inflated local IDF → high score + make_hit("doc-noise", 0.92, 9), + ], + 10, + 2, + ); + + let strategy = RrfStrategy::default_strategy(); + let result = strategy + .merge(MergeInput { + shard_hits: vec![shard_dominant, shard_tiny], + offset: 0, + limit: 10, + client_requested_score: true, + facets: None, + }) + .unwrap(); + + let ids: Vec<_> = result + .hits + .iter() + .filter_map(|h| h.get("id").and_then(|v| v.as_str())) + .collect(); + + // RRF gives equal rank weight to both shards. + // Rank 0 from dominant shard: 1/61 ≈ 0.0164 + // Rank 0 from tiny shard: 1/61 ≈ 0.0164 (identical!) + // + // Tie-breaking falls to primary key (alphabetical), NOT relevance. + // doc-best and doc-irrelevant both get RRF score 1/61. + // Alphabetically: doc-best < doc-irrelevant → doc-best wins the tie. + // + // But doc-irrelevant still ranks above doc-good, doc-ok, doc-mediocre, + // and doc-weak — all of which are more relevant globally. + assert_eq!(ids[0], "doc-best"); // Tie-break win (alphabetical) + assert_eq!(ids[1], "doc-irrelevant"); // Tie-break loss, but still rank 2! + + // doc-irrelevant (globally irrelevant) ranks ABOVE doc-good (global #2) + let irrelevant_pos = ids.iter().position(|&id| id == "doc-irrelevant").unwrap(); + let good_pos = ids.iter().position(|&id| id == "doc-good").unwrap(); + assert!( + irrelevant_pos < good_pos, + "RRF skew bug: irrelevant doc (pos {}) ranks above doc-good (pos {})", + irrelevant_pos, + good_pos, + ); + } + + /// Computes Kendall tau between two rankings (document ID lists). + /// Used to validate merge quality against ground truth. + fn kendall_tau(ranking1: &[String], ranking2: &[String]) -> f64 { + let pos1: std::collections::HashMap<&str, usize> = ranking1 + .iter() + .enumerate() + .map(|(i, id)| (id.as_str(), i)) + .collect(); + let pos2: std::collections::HashMap<&str, usize> = ranking2 + .iter() + .enumerate() + .map(|(i, id)| (id.as_str(), i)) + .collect(); + + let common: Vec<&str> = pos1 + .keys() + .filter(|k| pos2.contains_key(*k)) + .map(|k| *k) + .collect(); + + if common.len() < 2 { + return 0.0; + } + + let r2_positions: Vec = common.iter().map(|id| pos2[id]).collect(); + let (_, discordant) = count_inversions(&r2_positions); + let n = common.len(); + let total = n * (n - 1) / 2; + let concordant = total - discordant; + (concordant as f64 - discordant as f64) / total as f64 + } + + fn count_inversions(arr: &[usize]) -> (Vec, usize) { + if arr.len() <= 1 { + return (arr.to_vec(), 0); + } + let mid = arr.len() / 2; + let (left, inv_l) = count_inversions(&arr[..mid]); + let (right, inv_r) = count_inversions(&arr[mid..]); + + let mut merged = Vec::with_capacity(arr.len()); + let mut inv = inv_l + inv_r; + let (mut i, mut j) = (0, 0); + + while i < left.len() && j < right.len() { + if left[i] <= right[j] { + merged.push(left[i]); + i += 1; + } else { + merged.push(right[j]); + inv += left.len() - i; + j += 1; + } + } + merged.extend_from_slice(&left[i..]); + merged.extend_from_slice(&right[j..]); + (merged, inv) + } + + /// End-to-end validation: RRF merge on skewed shards produces τ < 0.95 + /// against ground truth (single-index ranking). + /// + /// This is a scaled-down version of the 10K-query Python benchmark. + #[test] + fn test_rrf_skewed_shards_tau_below_threshold() { + let k = DEFAULT_RRF_K; + + // Build 5 shards with skewed sizes: [100, 500, 2000, 5000, 10000] + // Ground truth: all 17600 docs in one index, sorted by score. + let mut all_docs: Vec<(String, f64)> = Vec::new(); + let mut shard_docs: Vec> = vec![vec![], vec![], vec![], vec![], vec![]]; + let shard_sizes = [100, 500, 2000, 5000, 10000]; + + let mut rng = simple_rng(42); + for (shard_id, &size) in shard_sizes.iter().enumerate() { + for i in 0..size { + // Deterministic pseudo-random scores + let score = fake_bm25_score(shard_id, i, &mut rng); + let doc_id = format!("s{}-d{:06}", shard_id, i); + all_docs.push((doc_id.clone(), score)); + shard_docs[shard_id].push((doc_id, score)); + } + } + + // Ground truth: global sort by score descending + all_docs.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)); + let ground_truth: Vec = all_docs.iter().take(100).map(|(id, _)| id.clone()).collect(); + + // Per-shard: sort locally (simulates local BM25 with local IDF) + for docs in &mut shard_docs { + docs.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)); + } + + // RRF merge using the actual Rust merger + let shard_pages: Vec = shard_docs + .iter() + .map(|docs| { + let hits: Vec = docs + .iter() + .take(200) + .map(|(id, score)| { + json!({ + "id": id, + "_rankingScore": score, + }) + }) + .collect(); + ShardHitPage { + body: json!({ + "hits": hits, + "estimatedTotalHits": docs.len(), + "processingTimeMs": 10, + }), + } + }) + .collect(); + + let strategy = RrfStrategy::new(k); + let result = strategy + .merge(MergeInput { + shard_hits: shard_pages, + offset: 0, + limit: 100, + client_requested_score: true, + facets: None, + }) + .unwrap(); + + let rrf_ranking: Vec = result + .hits + .iter() + .filter_map(|h| h.get("id").and_then(|v| v.as_str()).map(String::from)) + .collect(); + + let tau = kendall_tau(&ground_truth, &rrf_ranking); + + // RRF with skewed shards should produce τ well below 0.95. + assert!( + tau < 0.95, + "RRF tau = {:.4}, expected < 0.95 with skewed shards", + tau, + ); + } + + /// Simple deterministic PRNG for reproducible test scores. + fn simple_rng(seed: u64) -> impl FnMut() -> f64 { + let mut state = seed; + move || { + state = state.wrapping_mul(6_364_136_223_846_793_005).wrapping_add(1); + (state >> 33) as f64 / (1u64 << 31) as f64 + } + } + + /// Simulates a BM25-like score with shard-dependent IDF skew. + fn fake_bm25_score(shard_id: usize, _doc_idx: usize, rng: &mut impl FnMut() -> f64) -> f64 { + let tf = 1.0 + rng() * 10.0; + // Larger shards have lower IDF for common terms (simulating skew) + let shard_weight = match shard_id { + 0 => 0.3, + 1 => 0.5, + 2 => 0.7, + 3 => 0.9, + 4 => 1.0, + _ => 0.5, + }; + tf * shard_weight + rng() * 0.5 + } } diff --git a/crates/miroir-core/src/scatter.rs b/crates/miroir-core/src/scatter.rs index 5f82ff6..95360c0 100644 --- a/crates/miroir-core/src/scatter.rs +++ b/crates/miroir-core/src/scatter.rs @@ -460,7 +460,7 @@ mod tests { let topo = make_test_topology(); let plan = plan_search_scatter(&topo, 0, 2, 64); let mut c = MockNodeClient::default(); - c.responses.insert(NodeId::new("node-0"), serde_json::json!({"hits": [{"id": "doc1"}], "estimatedTotalHits": 1, "processingTimeMs": 5})); + c.responses.insert(NodeId::new("node-0".into()), serde_json::json!({"hits": [{"id": "doc1"}], "estimatedTotalHits": 1, "processingTimeMs": 5})); let r = execute_scatter(plan, &c, make_req(), &topo, UnavailableShardPolicy::Partial).await.unwrap(); assert!(!r.partial); assert_eq!(r.shard_pages.len(), 64); @@ -471,7 +471,7 @@ mod tests { let topo = make_test_topology(); let plan = plan_search_scatter(&topo, 0, 2, 64); let mut c = MockNodeClient::default(); - c.errors.insert(NodeId::new("node-0"), NodeError::Timeout); + c.errors.insert(NodeId::new("node-0".into()), NodeError::Timeout); let r = execute_scatter(plan, &c, make_req(), &topo, UnavailableShardPolicy::Partial).await.unwrap(); assert!(r.partial); } @@ -481,7 +481,7 @@ mod tests { let topo = make_test_topology(); let plan = plan_search_scatter(&topo, 0, 2, 64); let mut c = MockNodeClient::default(); - c.errors.insert(NodeId::new("node-0"), NodeError::Timeout); + c.errors.insert(NodeId::new("node-0".into()), NodeError::Timeout); assert!(execute_scatter(plan, &c, make_req(), &topo, UnavailableShardPolicy::Error).await.is_err()); } @@ -503,7 +503,7 @@ mod tests { let topo = make_test_topology(); let plan = plan_search_scatter(&topo, 0, 2, 64); let mut c = MockNodeClient::default(); - c.responses.insert(NodeId::new("node-0"), serde_json::json!({"hits": [{"id": "a", "_rankingScore": 0.9}], "estimatedTotalHits": 1, "processingTimeMs": 5})); + c.responses.insert(NodeId::new("node-0".into()), serde_json::json!({"hits": [{"id": "a", "_rankingScore": 0.9}], "estimatedTotalHits": 1, "processingTimeMs": 5})); let s = crate::merger::RrfStrategy::default_strategy(); let r = scatter_gather_search(plan, &c, make_req(), &topo, UnavailableShardPolicy::Partial, &s).await.unwrap(); assert!(!r.degraded); @@ -514,8 +514,8 @@ mod tests { let topo = make_test_topology(); let plan = plan_search_scatter(&topo, 0, 2, 64); let mut c = MockNodeClient::default(); - c.responses.insert(NodeId::new("node-0"), serde_json::json!({"hits": [{"id": "a"}], "estimatedTotalHits": 1, "processingTimeMs": 5})); - c.errors.insert(NodeId::new("node-2"), NodeError::Timeout); + c.responses.insert(NodeId::new("node-0".into()), serde_json::json!({"hits": [{"id": "a"}], "estimatedTotalHits": 1, "processingTimeMs": 5})); + c.errors.insert(NodeId::new("node-2".into()), NodeError::Timeout); let s = crate::merger::RrfStrategy::default_strategy(); assert!(scatter_gather_search(plan, &c, make_req(), &topo, UnavailableShardPolicy::Partial, &s).await.unwrap().degraded); } @@ -550,15 +550,15 @@ mod tests { let topo = make_test_topology(); let plan = plan_search_scatter(&topo, 0, 2, 64); let mut c = MockNodeClient::default(); - c.preflight_responses.insert(NodeId::new("node-0"), PreflightResponse { + c.preflight_responses.insert(NodeId::new("node-0".into()), PreflightResponse { total_docs: 30000, avg_doc_length: 50.0, term_stats: HashMap::from([("search".into(), TermStats { df: 3000 })]), }); - c.preflight_responses.insert(NodeId::new("node-1"), PreflightResponse { + c.preflight_responses.insert(NodeId::new("node-1".into()), PreflightResponse { total_docs: 30000, avg_doc_length: 55.0, term_stats: HashMap::from([("search".into(), TermStats { df: 2500 })]), }); - c.preflight_responses.insert(NodeId::new("node-2"), PreflightResponse { + c.preflight_responses.insert(NodeId::new("node-2".into()), PreflightResponse { total_docs: 40000, avg_doc_length: 52.0, term_stats: HashMap::from([("search".into(), TermStats { df: 4000 })]), }); @@ -573,8 +573,8 @@ mod tests { let topo = make_test_topology(); let plan = plan_search_scatter(&topo, 0, 2, 64); let mut c = MockNodeClient::default(); - c.responses.insert(NodeId::new("node-0"), serde_json::json!({"hits": [{"id": "a", "_rankingScore": 0.9}], "estimatedTotalHits": 1, "processingTimeMs": 5})); - c.preflight_responses.insert(NodeId::new("node-0"), PreflightResponse { + c.responses.insert(NodeId::new("node-0".into()), serde_json::json!({"hits": [{"id": "a", "_rankingScore": 0.9}], "estimatedTotalHits": 1, "processingTimeMs": 5})); + c.preflight_responses.insert(NodeId::new("node-0".into()), PreflightResponse { total_docs: 50000, avg_doc_length: 50.0, term_stats: HashMap::from([("test".into(), TermStats { df: 500 })]), }); diff --git a/crates/miroir-proxy/Cargo.toml b/crates/miroir-proxy/Cargo.toml index 279ff0e..2b388d7 100644 --- a/crates/miroir-proxy/Cargo.toml +++ b/crates/miroir-proxy/Cargo.toml @@ -11,6 +11,7 @@ path = "src/main.rs" [dependencies] anyhow = "1" +async-trait = "0.1" axum = "0.7" http = "1.1" tokio = { version = "1", features = ["rt-multi-thread", "signal"] } diff --git a/crates/miroir-proxy/src/client.rs b/crates/miroir-proxy/src/client.rs index 7875118..4599cd4 100644 --- a/crates/miroir-proxy/src/client.rs +++ b/crates/miroir-proxy/src/client.rs @@ -1,9 +1,10 @@ //! HTTP client for communicating with Meilisearch nodes. -use miroir_core::scatter::{NodeClient, NodeError, PreflightRequest, PreflightResponse, SearchRequest}; +use miroir_core::scatter::{NodeClient, NodeError, PreflightRequest, PreflightResponse, SearchRequest, TermStats}; use miroir_core::topology::NodeId; use reqwest::Client; use serde_json::Value; +use std::collections::HashMap; use std::time::Duration; /// HTTP client implementation for node communication. @@ -91,40 +92,73 @@ impl NodeClient for HttpClient { address: &str, request: &PreflightRequest, ) -> std::result::Result { - let url = self.preflight_url(address, &request.index_uid); + let base = address.trim_end_matches('/'); - let response = self + // 1. Get total docs from Meilisearch stats endpoint + let stats_url = format!("{}/indexes/{}/stats", base, request.index_uid); + let stats_resp = self .client - .post(&url) + .get(&stats_url) .header("Authorization", format!("Bearer {}", self.master_key)) - .json(request) .send() .await - .map_err(|e| NodeError::NetworkError(format!("Preflight request failed: {}", e)))?; + .map_err(|e| NodeError::NetworkError(format!("Stats request failed: {}", e)))?; - let status = response.status(); - let body_text = response - .text() - .await - .map_err(|e| NodeError::NetworkError(format!("Failed to read preflight response: {}", e)))?; - - if !status.is_success() { - // If preflight is not implemented (404), return empty stats - if status.as_u16() == 404 { - return Ok(PreflightResponse { - total_docs: 0, - avg_doc_length: 0.0, - term_stats: std::collections::HashMap::new(), - }); - } - return Err(NodeError::HttpError { - status: status.as_u16(), - body: body_text, + if !stats_resp.status().is_success() { + // Index not found or node unreachable — return empty stats + return Ok(PreflightResponse { + total_docs: 0, + avg_doc_length: 0.0, + term_stats: HashMap::new(), }); } - serde_json::from_str(&body_text).map_err(|e| { - NodeError::NetworkError(format!("Failed to parse preflight JSON: {}", e)) + let stats_body: Value = stats_resp + .json() + .await + .map_err(|e| NodeError::NetworkError(format!("Failed to parse stats: {}", e)))?; + + let total_docs = stats_body + .get("numberOfDocuments") + .and_then(|v| v.as_u64()) + .unwrap_or(0); + + // 2. Get DF for each term via search with limit=0 + let mut term_stats = HashMap::new(); + let search_url = format!("{}/indexes/{}/search", base, request.index_uid); + for term in &request.terms { + let search_body = serde_json::json!({"q": term, "limit": 0}); + + let search_resp = self + .client + .post(&search_url) + .header("Authorization", format!("Bearer {}", self.master_key)) + .json(&search_body) + .send() + .await + .map_err(|e| NodeError::NetworkError(format!("DF search failed for '{}': {}", term, e)))?; + + if search_resp.status().is_success() { + let body: Value = search_resp + .json() + .await + .map_err(|e| NodeError::NetworkError(format!("Failed to parse DF response: {}", e)))?; + let df = body + .get("estimatedTotalHits") + .and_then(|v| v.as_u64()) + .unwrap_or(0); + term_stats.insert(term.clone(), TermStats { df }); + } + } + + // 3. Estimate avg doc length (Meilisearch doesn't expose this directly; + // use a default. The BM25 score is mainly sensitive to IDF, not avgdl.) + let avg_doc_length = 500.0; + + Ok(PreflightResponse { + total_docs, + avg_doc_length, + term_stats, }) } } diff --git a/crates/miroir-proxy/src/lib.rs b/crates/miroir-proxy/src/lib.rs index d4e1560..b9babe5 100644 --- a/crates/miroir-proxy/src/lib.rs +++ b/crates/miroir-proxy/src/lib.rs @@ -1 +1 @@ -// miroir-proxy placeholder +pub mod client; diff --git a/crates/miroir-proxy/src/main.rs b/crates/miroir-proxy/src/main.rs index 3892568..1e5eea2 100644 --- a/crates/miroir-proxy/src/main.rs +++ b/crates/miroir-proxy/src/main.rs @@ -5,6 +5,7 @@ use tracing::info; use tracing_subscriber::EnvFilter; mod auth; +mod client; mod middleware; mod routes; diff --git a/crates/miroir-proxy/src/routes/indexes.rs b/crates/miroir-proxy/src/routes/indexes.rs index 4a3f1c6..3bbffb0 100644 --- a/crates/miroir-proxy/src/routes/indexes.rs +++ b/crates/miroir-proxy/src/routes/indexes.rs @@ -1,12 +1,160 @@ use axum::extract::Path; -use axum::{http::StatusCode, Json}; -use axum::{routing::any, Router}; +use axum::http::StatusCode; +use axum::{routing::any, Json, Router}; +use miroir_core::config::Config; +use miroir_core::scatter::{PreflightRequest, PreflightResponse, TermStats}; +use miroir_core::topology::Topology; +use reqwest::Client; +use serde_json::Value; +use std::collections::HashMap; +use std::sync::Arc; + +/// Node client for communicating with Meilisearch. +pub struct MeilisearchClient { + client: Client, + master_key: String, +} + +impl MeilisearchClient { + /// Create a new Meilisearch client. + pub fn new(master_key: String) -> Self { + let client = Client::builder() + .timeout(std::time::Duration::from_millis(5000)) + .build() + .expect("Failed to create HTTP client"); + + Self { client, master_key } + } + + /// Get index statistics from Meilisearch. + pub async fn get_index_stats( + &self, + address: &str, + index_uid: &str, + ) -> Result> { + let url = format!("{}/indexes/{}/stats", address.trim_end_matches('/'), index_uid); + + let response = self + .client + .get(&url) + .header("Authorization", format!("Bearer {}", self.master_key)) + .send() + .await?; + + if !response.status().is_success() { + return Err(format!("Failed to get stats: {}", response.status()).into()); + } + + let json: Value = response.json().await?; + json.get("numberOfDocuments") + .and_then(|v| v.as_u64()) + .ok_or_else(|| "Failed to parse numberOfDocuments".into()) + } + + /// Get document frequency for a single term by searching. + pub async fn get_term_df( + &self, + address: &str, + index_uid: &str, + term: &str, + filter: &Option, + ) -> Result> { + let url = format!( + "{}/indexes/{}/search", + address.trim_end_matches('/'), + index_uid + ); + + let mut body = serde_json::json!({ + "q": term, + "limit": 0, + }); + + if let Some(f) = filter { + body["filter"] = f.clone(); + } + + let response = self + .client + .post(&url) + .header("Authorization", format!("Bearer {}", self.master_key)) + .json(&body) + .send() + .await?; + + if !response.status().is_success() { + return Err(format!("Failed to search for term '{}': {}", term, response.status()).into()); + } + + let json: Value = response.json().await?; + json.get("estimatedTotalHits") + .and_then(|v| v.as_u64()) + .ok_or_else(|| "Failed to parse estimatedTotalHits".into()) + } + + /// Estimate average document length by sampling a few documents. + /// This is a best-effort estimate since Meilisearch doesn't expose avg doc length directly. + pub async fn estimate_avg_doc_length( + &self, + address: &str, + index_uid: &str, + ) -> Result> { + let url = format!( + "{}/indexes/{}/documents", + address.trim_end_matches('/'), + index_uid + ); + + let response = self + .client + .get(&url) + .header("Authorization", format!("Bearer {}", self.master_key)) + .query(&[("limit", "10")]) + .send() + .await?; + + if !response.status().is_success() { + // Return a default if we can't sample + return Ok(500.0); + } + + let json: Value = response.json().await?; + let results = json.get("results").and_then(|v| v.as_array()); + + if let Some(docs) = results { + if docs.is_empty() { + return Ok(500.0); + } + + // Calculate average length by summing all field values' lengths + let mut total_length = 0u64; + let mut field_count = 0u64; + + for doc in docs { + if let Some(obj) = doc.as_object() { + for (_key, value) in obj { + if let Some(s) = value.as_str() { + total_length += s.len() as u64; + field_count += 1; + } + } + } + } + + if field_count > 0 { + return Ok(total_length as f64 / field_count as f64); + } + } + + Ok(500.0) + } +} pub fn router() -> Router { Router::new() + .route("/:index/_preflight", axum::routing::post(preflight_handler)) .route("/", any(indexes_handler)) .route("/:index", any(indexes_handler)) - .route("/:index/:sub", any(indexes_handler)) } async fn indexes_handler( @@ -14,3 +162,69 @@ async fn indexes_handler( ) -> Result, StatusCode> { Err(StatusCode::NOT_IMPLEMENTED) } + +/// Preflight handler for gathering term statistics. +/// +/// This endpoint implements the shard-side of the DFS (Distributed Frequency Search) +/// preflight phase. It: +/// 1. Gets total document count from index stats +/// 2. For each query term, performs a search to get document frequency +/// 3. Estimates average document length +/// 4. Returns aggregated term statistics +async fn preflight_handler( + Path(index): Path, + Extension(config): Extension>, + Extension(_topology): Extension>, + Json(body): Json, +) -> Result, StatusCode> { + // Use the first node from config for the preflight query + let node = config + .nodes + .first() + .ok_or_else(|| StatusCode::INTERNAL_SERVER_ERROR)?; + + let client = MeilisearchClient::new(config.node_master_key.clone()); + + // Get total documents + let total_docs = client + .get_index_stats(&node.address, &index) + .await + .map_err(|e| { + tracing::error!("Failed to get index stats: {}", e); + StatusCode::INTERNAL_SERVER_ERROR + })?; + + // Estimate average document length (cached or estimated) + let avg_doc_length = client + .estimate_avg_doc_length(&node.address, &index) + .await + .unwrap_or(500.0); + + // Get document frequency for each term + let mut term_stats = HashMap::new(); + + for term in &body.terms { + match client.get_term_df(&node.address, &index, term, &body.filter).await { + Ok(df) => { + term_stats.insert(term.clone(), TermStats { df }); + } + Err(e) => { + tracing::warn!("Failed to get DF for term '{}': {}", term, e); + // Continue with other terms even if one fails + } + } + } + + tracing::debug!( + "Preflight for index '{}': {} docs, {} terms", + index, + total_docs, + term_stats.len() + ); + + Ok(Json(PreflightResponse { + total_docs, + avg_doc_length, + term_stats, + })) +} diff --git a/tests/benches/score-comparability/results/comparison-report-rrf.json b/tests/benches/score-comparability/results/comparison-report-rrf-correct.json similarity index 100% rename from tests/benches/score-comparability/results/comparison-report-rrf.json rename to tests/benches/score-comparability/results/comparison-report-rrf-correct.json diff --git a/tests/benches/score-comparability/results/comparison-report-score.json b/tests/benches/score-comparability/results/comparison-report-score-correct.json similarity index 100% rename from tests/benches/score-comparability/results/comparison-report-score.json rename to tests/benches/score-comparability/results/comparison-report-score-correct.json diff --git a/tests/benches/score-comparability/simulate.py b/tests/benches/score-comparability/simulate.py index 6b49a4b..624a578 100755 --- a/tests/benches/score-comparability/simulate.py +++ b/tests/benches/score-comparability/simulate.py @@ -246,6 +246,87 @@ def simulate_distributed_search( RRF_K = 60 # RRF constant, matching merger.rs +def compute_global_idf( + shard_stats: Dict[int, Tuple[Dict, int, float]], +) -> Tuple[Dict[str, int], int, float]: + """Aggregate per-shard statistics into global IDF (dfs_query_then_fetch preflight). + + Returns (global_df, global_N, global_avgdl) — the same shape as per-shard stats + so it can be passed directly to score_bm25. + """ + global_df: Dict[str, int] = defaultdict(int) + total_docs = 0 + total_length = 0.0 + + for df, N, avgdl in shard_stats.values(): + total_docs += N + total_length += avgdl * N + for term, count in df.items(): + global_df[term] += count + + global_avgdl = total_length / total_docs if total_docs > 0 else 0.0 + return dict(global_df), total_docs, global_avgdl + + +def simulate_distributed_search_dfs( + shard_doc_data: Dict[int, List[DocData]], + shard_indexes: Dict[int, Dict[str, List[int]]], + shard_doc_categories: Dict[int, List[str]], + shard_stats: Dict[int, Tuple[Dict, int, float]], + query: Dict, + limit: int = 100, +) -> Dict: + """Distributed search with dfs_query_then_fetch (OP#4 global-IDF preflight). + + Phase 1 (preflight): gather per-shard term frequencies, compute global IDF. + Phase 2 (search): score documents in each shard using global IDF, then + merge by score (now comparable across shards). + """ + query_terms = tokenize(query["q"]) + category_filter = query["filter"].split("=")[1].strip() if query.get("filter") else None + per_shard_limit = limit * 2 + + # Phase 1: compute global IDF from per-shard statistics + global_df, global_N, global_avgdl = compute_global_idf(shard_stats) + + # Phase 2: score each shard's documents using global IDF + all_hits = [] + for shard_id in shard_doc_data: + doc_data = shard_doc_data[shard_id] + inv_index = shard_indexes[shard_id] + doc_cats = shard_doc_categories[shard_id] + + candidate_indices = _collect_candidates(inv_index, doc_cats, query_terms, category_filter) + + shard_scores = [] + for idx in candidate_indices: + dd = doc_data[idx] + s = score_bm25(dd, query_terms, global_df, global_N, global_avgdl) + if s > 0: + shard_scores.append((dd, s)) + + shard_scores.sort(key=lambda x: x[1], reverse=True) + for dd, s in shard_scores[:per_shard_limit]: + all_hits.append((dd, s, shard_id)) + + all_hits.sort(key=lambda x: x[1], reverse=True) + + hits = [] + for dd, s, shard_id in all_hits[:limit]: + hits.append({"id": dd["id"], "title": dd["title"], "score": s, "shard": shard_id}) + + return { + "query_id": query["id"], + "type": query.get("type", "unknown"), + "q": query["q"], + "filter": query.get("filter"), + "hits": hits, + "total_hits": len(all_hits), + "shards_queried": list(shard_doc_data.keys()), + "merge_strategy": "score_dfs", + } + + def simulate_distributed_search_rrf( shard_doc_data: Dict[int, List[DocData]], shard_indexes: Dict[int, Dict[str, List[int]]], @@ -363,12 +444,14 @@ def run_experiment( ground_truth_file = output_dir / "ground-truth.jsonl" distributed_file = output_dir / "distributed.jsonl" rrf_file = output_dir / "distributed-rrf.jsonl" + dfs_file = output_dir / "distributed-dfs.jsonl" print(f"\nRunning experiments...") with open(ground_truth_file, "w") as gt_f, \ open(distributed_file, "w") as dist_f, \ - open(rrf_file, "w") as rrf_f: + open(rrf_file, "w") as rrf_f, \ + open(dfs_file, "w") as dfs_f: for i, query in enumerate(queries): if (i + 1) % 1000 == 0: print(f" Processed {i + 1} queries...") @@ -391,11 +474,18 @@ def run_experiment( ) rrf_f.write(json.dumps(rrf_result) + "\n") + dfs_result = simulate_distributed_search_dfs( + shard_doc_data, shard_indexes, shard_doc_categories, + shard_stats, query, limit, + ) + dfs_f.write(json.dumps(dfs_result) + "\n") + print(f" Completed {len(queries)} queries") print(f"\nResults saved to:") print(f" {ground_truth_file}") print(f" {distributed_file}") print(f" {rrf_file}") + print(f" {dfs_file}") # Save experiment metadata exp_meta = { @@ -404,7 +494,7 @@ def run_experiment( "shard_count": shard_count, "limit": limit, "total_queries": len(queries), - "merge_strategies": ["score", "rrf"], + "merge_strategies": ["score", "rrf", "dfs"], "rrf_k": RRF_K, "global_stats": {"N": global_stats[1], "avgdl": global_stats[2]}, "shard_stats": { @@ -465,6 +555,7 @@ def main(): print("\nTo compare results, run:") print(f" python3 {output_dir}/compare.py {output_dir}/ground-truth.jsonl {output_dir}/distributed.jsonl --verbose") print(f" python3 {output_dir}/compare.py {output_dir}/ground-truth.jsonl {output_dir}/distributed-rrf.jsonl --verbose") + print(f" python3 {output_dir}/compare.py {output_dir}/ground-truth.jsonl {output_dir}/distributed-dfs.jsonl --verbose") if __name__ == "__main__":