From 2c09312964b335f093d41ada62efc0882cb88bac Mon Sep 17 00:00:00 2001 From: jedarden Date: Fri, 8 May 2026 15:15:21 -0400 Subject: [PATCH] chore: track beads for lab offload Co-Authored-By: Claude Sonnet 4.6 --- .beads/issues.jsonl | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl index 5379086..469a394 100644 --- a/.beads/issues.jsonl +++ b/.beads/issues.jsonl @@ -193,8 +193,8 @@ {"id":"miroir-m9q.6","title":"P6.6 HPA spec + prometheus-adapter + schema validation","description":"## What\n\nShip the HPA spec (plan §14.4):\n```yaml\napiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nspec:\n minReplicas: 2\n maxReplicas: 24\n behavior:\n scaleDown: { stabilizationWindowSeconds: 300 }\n scaleUp: { stabilizationWindowSeconds: 30 }\n metrics:\n - Resource cpu 70%\n - Resource memory 75%\n - Pods miroir_requests_in_flight AverageValue: 500\n - External miroir_background_queue_depth Value: 10\n```\n\nChart preconditions enforced via `values.schema.json`:\n- `hpa.enabled: true` requires `replicas >= 2 AND taskStore.backend: redis`\n- `prometheus-adapter` (or equivalent) as a documented prerequisite when HPA is enabled\n\n## Why\n\nPlan §14.4: \"`miroir_requests_in_flight` is **per-pod** and uses `type: Pods`. `miroir_background_queue_depth` is **global** and must use `type: External` with `type: Value`.\" Getting the metric type wrong produces a pathological HPA that monotonically scales to `maxReplicas`.\n\n## Details\n\n**Per-workload-tier min/max** (plan §14.7):\n| Peak QPS | minReplicas | maxReplicas |\n|---|---|---|\n| ≤ 500 | 2 | 3 |\n| ≤ 2k | 2 | 4 |\n| ≤ 5k | 4 | 8 |\n| ≤ 20k | 8 | 12 |\n| ≤ 100k | 12 | 24 |\n\nDefault values.yaml ships the ≤ 5k tier; operators override per workload.\n\n**prometheus-adapter config**: add a ConfigMap-defined `rules.externalMetrics` entry mapping `miroir_background_queue_depth` to the external metrics API. This is NOT shipped by the Miroir chart (operators install prometheus-adapter separately); the chart's `NOTES.txt` calls it out.\n\n**Stabilization windows**: scale-up fast (30s), scale-down slow (300s). Avoids pod flapping.\n\n## Acceptance\n\n- [ ] `helm lint --strict` with `hpa.enabled: true + replicas: 1` → fails with schema error\n- [ ] `helm lint --strict` with `hpa.enabled: true + replicas: 2 + backend: sqlite` → fails\n- [ ] HPA in a kind cluster: induce CPU load → scales up within 30s; load drops → scales down after 300s\n- [ ] External metric binding: `miroir_background_queue_depth` visible via `kubectl get --raw /apis/external.metrics.k8s.io/v1beta1/...`","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:40:30.676597441Z","created_by":"coding","updated_at":"2026-04-24T03:52:34.230412558Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase-6"],"dependencies":[{"issue_id":"miroir-m9q.6","depends_on_id":"miroir-m9q.4","type":"blocks","created_at":"2026-04-18T21:40:36.140248526Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-m9q.6","depends_on_id":"miroir-m9q.5","type":"blocks","created_at":"2026-04-18T21:40:36.163063693Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-m9q.6","depends_on_id":"miroir-mkk","type":"blocks","created_at":"2026-04-24T03:52:34.212979405Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-m9q.6","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:34.230367115Z","created_by":"coding","thread_id":""}]} {"id":"miroir-m9q.7","title":"P6.7 Resource-pressure metrics + alerts (§14.9)","description":"## What\n\nRegister the plan §14.9 resource-pressure metrics:\n- `miroir_memory_pressure` gauge (0=ok, 1=warn >75%, 2=critical >90%)\n- `miroir_cpu_throttled_seconds_total` counter (cgroup throttling)\n- `miroir_request_queue_depth` gauge\n- `miroir_background_queue_depth{job_type}` gauge\n- `miroir_peer_pod_count` gauge\n- `miroir_leader` gauge\n- `miroir_owned_shards_count` gauge\n\nAnd the associated `PrometheusRule` alerts (plan §14.9).\n\n## Why\n\nThese surface under-scaling BEFORE user-visible impact. `miroir_memory_pressure` + `MiroirMemoryPressure` alert give operators (and HPA) a leading indicator instead of waiting for OOM-kill.\n\n## Details\n\n**cgroup reads**: on Linux, read `/sys/fs/cgroup/cpu.stat` (cgroup v2) or `/sys/fs/cgroup/cpu/cpu.stat` (v1) for `nr_throttled`/`throttled_time`. Convert throttled_time nanoseconds → seconds for the counter.\n\n**Memory pressure gauge**: read `/sys/fs/cgroup/memory.current` + `memory.max`; compute utilization; map to 0/1/2 per threshold.\n\n**PrometheusRule**:\n```yaml\n- alert: MiroirMemoryPressure\n expr: miroir_memory_pressure >= 2\n for: 5m\n- alert: MiroirRequestQueueBacklog\n expr: miroir_request_queue_depth > 500\n for: 2m\n- alert: MiroirBackgroundJobBacklog\n expr: miroir_background_queue_depth > 100\n for: 10m\n- alert: MiroirPeerDiscoveryGap\n expr: miroir_peer_pod_count < kube_deployment_status_replicas_ready{deployment=\"miroir\"}\n for: 2m\n- alert: MiroirNoLeader\n expr: sum(miroir_leader) == 0\n for: 1m\n```\n\n## Acceptance\n\n- [ ] All 7 metrics present on `:9090/metrics`\n- [ ] `miroir_memory_pressure` reports 2 when artificial allocation pushes RSS > 90% of limit\n- [ ] `MiroirNoLeader` fires after killing the leader without replacement within 1 min\n- [ ] `MiroirPeerDiscoveryGap` fires if headless Service misconfigured","design":"","acceptance_criteria":"","notes":"","status":"open","priority":1,"issue_type":"task","owner":"","created_at":"2026-04-18T21:40:30.711963985Z","created_by":"coding","updated_at":"2026-04-24T03:52:37.545683046Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase-6"],"dependencies":[{"issue_id":"miroir-m9q.7","depends_on_id":"miroir-mkk","type":"blocks","created_at":"2026-04-24T03:52:37.529425637Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-m9q.7","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:37.545645558Z","created_by":"coding","thread_id":""}]} {"id":"miroir-mkk","title":"Phase 4 — Topology Operations (rebalance, add/remove node + group, drain)","description":"## Phase 4 Epic — Topology Operations\n\nMakes the cluster *elastic*: operators can add or remove nodes within a group (capacity scaling) or add/remove entire replica groups (throughput scaling) without a full reindex and without downtime.\n\n## Why This Matters\n\nPlan §2 \"Topology changes\" and §4 \"Rebalancer\" together are **the** operational differentiator. Without this phase, Miroir is a static sharder — useful but not production-grade. Elasticity is what justifies the complexity of the whole system.\n\nPlan §15 Open Problem 1 (dual-write race) is partially mitigated by careful sequencing here and fully closed by §13.8 anti-entropy in Phase 5. Getting the sequencing right here means Phase 5's reconciler is a safety net, not the primary correctness mechanism.\n\n## Scope\n\n**Node addition (within a group; plan §2 \"Adding a node\")**\n\n1. Assign new node to a group; mark `joining`\n2. Recompute assignments — ~S/(Ng+1) shards move\n3. Dual-write: new inbound writes for affected shards go to **both** old owner and new node\n4. Background migration per shard: `GET /indexes/{uid}/documents?filter=_miroir_shard={id}&limit=1000&offset=...` → write each page to new node\n5. Mark `active`; stop dual-write; `POST /indexes/{uid}/documents/delete` with `filter=_miroir_shard={id}` on old owner\n\n**Replica-group addition (plan §2 \"Adding a new replica group\")** — mark `initializing`, background-sync from any healthy group using the same `_miroir_shard` filter, then flip to `active` and start routing queries.\n\n**Node removal (plan §2 \"Removing a node\")** — mark `draining`, recompute, migrate ~RF/Ng fraction to survivors, mark `removed`, operator deletes PVC.\n\n**Group removal (plan §2 \"Removing a replica group\")** — mark `draining`, stop routing queries; no data migration (other groups hold the docs); decommission.\n\n**Unplanned node failure (plan §2 \"Node failure\")** — mark `failed`; surviving intra-group replicas cover if RF>1; cross-group fallback if RF=1; schedule background replication to restore RF.\n\n**Admin API** (plan §4 admin table) — `POST /_miroir/nodes`, `DELETE /_miroir/nodes/{id}`, `POST /_miroir/nodes/{id}/drain`, `POST /_miroir/rebalance`, `GET /_miroir/rebalance/status`.\n\n## Design Notes\n\n- Relies on `_miroir_shard` being `filterable` on every node — set by Phase 2 index-create broadcast\n- Only one rebalance at a time per index (advisory lock → Phase 6 Mode B leader lease)\n- Chunked migration bounded by `rebalancer.max_concurrent_migrations` (default 4) to stay under the per-pod 3.75 GB envelope\n- Migration progress reported via `GET /_miroir/rebalance/status` and `miroir_rebalance_*` metrics (§10)\n- No full-corpus scans ever — the `_miroir_shard` filter is the key primitive; any code path that enumerates \"all docs\" is a bug\n\n## Open Problem Closure\n\nPlan §15 #1 — dual-write cutover race: document the exact sequencing here and note that §13.8 anti-entropy is the guaranteed safety net on the next pass.\n\n## Definition of Done\n\n- [ ] Chaos test: add a node mid-indexing — every doc remains readable; no duplicates on a subsequent search\n- [ ] Chaos test: drain a node while queries are in flight — zero client-visible failures; `X-Miroir-Degraded` absent or transient only\n- [ ] Chaos test: add a replica group while queries are in flight — existing groups unaffected; new group starts serving reads only after sync completes\n- [ ] Rebalance of a 3→4 node cluster moves ≤ 2×(1/4) of docs (optimal per plan §8 benches)\n- [ ] Restart a killed node mid-rebalance — rebalance pauses + resumes; no data loss","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"epic","assignee":"","owner":"","created_at":"2026-04-18T21:19:53.993012197Z","created_by":"coding","updated_at":"2026-05-02T20:36:13.290624099Z","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase","phase-4"],"dependencies":[{"issue_id":"miroir-mkk","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-18T21:23:08.595905334Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-18T21:23:08.609300009Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"miroir-mkk.1","type":"blocks","created_at":"2026-05-01T15:48:34.610807095Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"miroir-mkk.2","type":"blocks","created_at":"2026-05-01T15:48:34.625993245Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"miroir-mkk.3","type":"blocks","created_at":"2026-05-01T15:48:34.631513196Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"miroir-mkk.4","type":"blocks","created_at":"2026-05-01T15:48:34.636538410Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"miroir-mkk.5","type":"blocks","created_at":"2026-05-01T15:48:34.642189080Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"bf-2w8t0","type":"parent-child","created_at":"2026-05-05T04:09:53.853356180Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"bf-1qvil","type":"parent-child","created_at":"2026-05-05T04:09:53.862464914Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"bf-xxs4m","type":"parent-child","created_at":"2026-05-05T04:09:53.871594204Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"bf-4i47m","type":"parent-child","created_at":"2026-05-05T04:09:53.880767653Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"bf-43f3b","type":"parent-child","created_at":"2026-05-05T04:09:53.890086943Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"bf-wiuj8","type":"parent-child","created_at":"2026-05-05T04:09:53.899203897Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"bf-5mj25","type":"parent-child","created_at":"2026-05-05T04:09:53.908340272Z","created_by":"cli","thread_id":""},{"issue_id":"miroir-mkk","depends_on_id":"bf-14xn5","type":"parent-child","created_at":"2026-05-05T04:09:53.917462529Z","created_by":"cli","thread_id":""}]} -{"id":"miroir-mkk.1","title":"P4.1 Rebalancer background worker + advisory lock","description":"## What\n\nImplement the rebalancer as a background Tokio task (plan §4 \"Rebalancer\"):\n- Advisory lock — only one Miroir instance runs the rebalancer at a time (Phase 6 §14.5 Mode B replaces with leader lease)\n- Reacts to topology change events (node add/drain/fail/recover) from the admin API + health checker\n- Computes affected shards (the `~S/(Ng+1)` or `~RF/Ng` delta) using the Phase 1 router\n- Drives the migration state machine for each affected shard\n- Updates `miroir_rebalance_in_progress`, `miroir_rebalance_documents_migrated_total`, `miroir_rebalance_duration_seconds` (plan §10)\n\n## Why\n\nThe rebalancer is the orchestrator of all Phase 4 operations. Everything else in this phase is a subroutine called by this worker. Keeping it as a dedicated task — rather than inline in admin handlers — means a slow migration doesn't block admin API responses and a crash restarts cleanly from the task-store state.\n\n## Details\n\n**State machine per-shard**:\n```\nIdle → DualWriteStarted → MigrationInProgress → MigrationComplete → DualWriteStopped → OldReplicaDeleted → Idle\n```\n\n**Concurrency bound**: `rebalancer.max_concurrent_migrations` (default 4) to stay within plan §14.2 memory budget for migration buffers.\n\n**Progress persistence**: per-shard cursor in `jobs` table (Phase 3) so a pod restart resumes at the last committed offset. Idempotent per primary key (same doc re-written on resume is no-op at Meilisearch level).\n\n**Cancellation**: an admin API call can pause (not delete) an in-progress rebalance; resuming picks up at the persisted cursor.\n\n## Acceptance\n\n- [ ] Advisory lock: two pods running the rebalancer simultaneously produce 0 duplicate migrations (enforced via the `leader_lease` row for scope `rebalance:`)\n- [ ] Progress persistence: kill the pod mid-migration; another takes over within lease TTL and completes without starting over\n- [ ] Metrics tick: `miroir_rebalance_documents_migrated_total` monotonically increases; `_duration_seconds` histogram records per-shard migration time","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:31:43.768256172Z","created_by":"coding","updated_at":"2026-04-24T03:52:35.102679261Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase-4"],"dependencies":[{"issue_id":"miroir-mkk.1","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:35.102654477Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.1","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:35.085409777Z","created_by":"coding","thread_id":""}]} -{"id":"miroir-mkk.2","title":"P4.2 Node addition: dual-write + paginated shard migration","description":"## What\n\nImplement the node-addition flow from plan §2 \"Adding a node to an existing group\":\n1. Admin API: `POST /_miroir/nodes` body `{\"id\": \"meili-N\", \"address\": \"...\", \"replica_group\": G}`\n2. Mark `joining`\n3. Recompute assignments — `affected_shards` where `meili-N` enters the top-RF within group G\n4. **Dual-write**: new inbound writes for affected shards go to **both** old owner and new node (idempotent — Meilisearch PUT semantics handle dupes via primary key)\n5. For each affected shard, background migration via the shard-filter primitive (plan §4):\n ```\n GET /indexes/{uid}/documents?filter=_miroir_shard={shard_id}&limit=1000&offset=0\n GET /indexes/{uid}/documents?filter=_miroir_shard={shard_id}&limit=1000&offset=1000\n ... until exhausted\n ```\n6. Write each page to the new node (docs already carry `_miroir_shard`)\n7. Mark `active`; stop dual-write\n8. Delete migrated shard from old node: `POST /indexes/{uid}/documents/delete {\"filter\": \"_miroir_shard = {shard_id}\"}`\n9. Documents on unaffected shards never touched\n\n## Why\n\nPlan §1 principle 4 (RF-configurable redundancy) + §2 \"Three independent scaling dimensions\" depend on this. The `_miroir_shard` filter primitive is what makes migration move only `~total_docs/(N+1)` docs instead of `total_docs` — a 10–100× reduction in I/O vs. a naive \"copy everything then diff\" approach.\n\n## Details\n\n**Dual-write durability invariant**: between steps 4 and 7, every accepted write for the affected shards lands on both old and new. If dual-write is skipped while migration is running, writes arriving at that exact moment may land only on the old owner and be lost when step 8 deletes. Plan §15 Open Problem 1 is the remaining race; §13.8 anti-entropy (Phase 5) is the safety net.\n\n**Pagination cursor**: `offset` is the simplest, but Meilisearch `limit + offset` has an internal cap (default 1000 + 0 → max ~20 for safe). Configure `pagination.maxTotalHits` per-node at index creation to allow deep pagination (safe: we're just iterating our own injected shard).\n\n**Per-page batch**: `rebalancer.migration_batch_size` (default 1000) — one page read + one page write per cycle.\n\n**Fail-open behavior**: if the source node becomes unavailable mid-migration, the rebalancer pauses this shard; other shards continue. When source comes back, resume.\n\n## Acceptance\n\n- [ ] Integration test: 3-node → 4-node migration, 10K docs, each doc still retrievable by ID after migration\n- [ ] Chaos: toggle writes on/off during migration; dual-write window catches all late writes\n- [ ] Performance: migrating `~S/(Ng+1)` shards moves ≤ `total_docs / (Ng+1) × 1.1` docs (10% slack for dual-write dupes)\n- [ ] The old node is not queried for the migrated shards after step 8 (verified via log inspection)","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:31:43.790167851Z","created_by":"coding","updated_at":"2026-04-24T03:52:35.050747599Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase-4"],"dependencies":[{"issue_id":"miroir-mkk.2","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:35.050706675Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.2","depends_on_id":"miroir-mkk.1","type":"blocks","created_at":"2026-04-18T21:31:48.930624028Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.2","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:35.030328431Z","created_by":"coding","thread_id":""}]} +{"id":"miroir-mkk.1","title":"P4.1 Rebalancer background worker + advisory lock","description":"## What\n\nImplement the rebalancer as a background Tokio task (plan §4 \"Rebalancer\"):\n- Advisory lock — only one Miroir instance runs the rebalancer at a time (Phase 6 §14.5 Mode B replaces with leader lease)\n- Reacts to topology change events (node add/drain/fail/recover) from the admin API + health checker\n- Computes affected shards (the `~S/(Ng+1)` or `~RF/Ng` delta) using the Phase 1 router\n- Drives the migration state machine for each affected shard\n- Updates `miroir_rebalance_in_progress`, `miroir_rebalance_documents_migrated_total`, `miroir_rebalance_duration_seconds` (plan §10)\n\n## Why\n\nThe rebalancer is the orchestrator of all Phase 4 operations. Everything else in this phase is a subroutine called by this worker. Keeping it as a dedicated task — rather than inline in admin handlers — means a slow migration doesn't block admin API responses and a crash restarts cleanly from the task-store state.\n\n## Details\n\n**State machine per-shard**:\n```\nIdle → DualWriteStarted → MigrationInProgress → MigrationComplete → DualWriteStopped → OldReplicaDeleted → Idle\n```\n\n**Concurrency bound**: `rebalancer.max_concurrent_migrations` (default 4) to stay within plan §14.2 memory budget for migration buffers.\n\n**Progress persistence**: per-shard cursor in `jobs` table (Phase 3) so a pod restart resumes at the last committed offset. Idempotent per primary key (same doc re-written on resume is no-op at Meilisearch level).\n\n**Cancellation**: an admin API call can pause (not delete) an in-progress rebalance; resuming picks up at the persisted cursor.\n\n## Acceptance\n\n- [ ] Advisory lock: two pods running the rebalancer simultaneously produce 0 duplicate migrations (enforced via the `leader_lease` row for scope `rebalance:`)\n- [ ] Progress persistence: kill the pod mid-migration; another takes over within lease TTL and completes without starting over\n- [ ] Metrics tick: `miroir_rebalance_documents_migrated_total` monotonically increases; `_duration_seconds` histogram records per-shard migration time","design":"","acceptance_criteria":"","notes":"","status":"closed","priority":0,"issue_type":"task","assignee":"claude-code-glm-4.7-kilo","owner":"","created_at":"2026-04-18T21:31:43.768256172Z","created_by":"coding","updated_at":"2026-05-05T14:55:01.871592647Z","closed_at":"2026-05-05T14:55:01.871592647Z","close_reason":"Verified rebalancer background worker meets all acceptance criteria. All tests pass. See notes/miroir-mkk.1.md for details.","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase-4"],"dependencies":[{"issue_id":"miroir-mkk.1","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:35.102654477Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.1","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:35.085409777Z","created_by":"coding","thread_id":""}]} +{"id":"miroir-mkk.2","title":"P4.2 Node addition: dual-write + paginated shard migration","description":"## What\n\nImplement the node-addition flow from plan §2 \"Adding a node to an existing group\":\n1. Admin API: `POST /_miroir/nodes` body `{\"id\": \"meili-N\", \"address\": \"...\", \"replica_group\": G}`\n2. Mark `joining`\n3. Recompute assignments — `affected_shards` where `meili-N` enters the top-RF within group G\n4. **Dual-write**: new inbound writes for affected shards go to **both** old owner and new node (idempotent — Meilisearch PUT semantics handle dupes via primary key)\n5. For each affected shard, background migration via the shard-filter primitive (plan §4):\n ```\n GET /indexes/{uid}/documents?filter=_miroir_shard={shard_id}&limit=1000&offset=0\n GET /indexes/{uid}/documents?filter=_miroir_shard={shard_id}&limit=1000&offset=1000\n ... until exhausted\n ```\n6. Write each page to the new node (docs already carry `_miroir_shard`)\n7. Mark `active`; stop dual-write\n8. Delete migrated shard from old node: `POST /indexes/{uid}/documents/delete {\"filter\": \"_miroir_shard = {shard_id}\"}`\n9. Documents on unaffected shards never touched\n\n## Why\n\nPlan §1 principle 4 (RF-configurable redundancy) + §2 \"Three independent scaling dimensions\" depend on this. The `_miroir_shard` filter primitive is what makes migration move only `~total_docs/(N+1)` docs instead of `total_docs` — a 10–100× reduction in I/O vs. a naive \"copy everything then diff\" approach.\n\n## Details\n\n**Dual-write durability invariant**: between steps 4 and 7, every accepted write for the affected shards lands on both old and new. If dual-write is skipped while migration is running, writes arriving at that exact moment may land only on the old owner and be lost when step 8 deletes. Plan §15 Open Problem 1 is the remaining race; §13.8 anti-entropy (Phase 5) is the safety net.\n\n**Pagination cursor**: `offset` is the simplest, but Meilisearch `limit + offset` has an internal cap (default 1000 + 0 → max ~20 for safe). Configure `pagination.maxTotalHits` per-node at index creation to allow deep pagination (safe: we're just iterating our own injected shard).\n\n**Per-page batch**: `rebalancer.migration_batch_size` (default 1000) — one page read + one page write per cycle.\n\n**Fail-open behavior**: if the source node becomes unavailable mid-migration, the rebalancer pauses this shard; other shards continue. When source comes back, resume.\n\n## Acceptance\n\n- [ ] Integration test: 3-node → 4-node migration, 10K docs, each doc still retrievable by ID after migration\n- [ ] Chaos: toggle writes on/off during migration; dual-write window catches all late writes\n- [ ] Performance: migrating `~S/(Ng+1)` shards moves ≤ `total_docs / (Ng+1) × 1.1` docs (10% slack for dual-write dupes)\n- [ ] The old node is not queried for the migrated shards after step 8 (verified via log inspection)","design":"","acceptance_criteria":"","notes":"","status":"in_progress","priority":0,"issue_type":"task","assignee":"claude-code-glm-4.7-alpha","owner":"","created_at":"2026-04-18T21:31:43.790167851Z","created_by":"coding","updated_at":"2026-05-08T14:06:45.797380348Z","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase-4"],"dependencies":[{"issue_id":"miroir-mkk.2","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:35.050706675Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.2","depends_on_id":"miroir-mkk.1","type":"blocks","created_at":"2026-04-18T21:31:48.930624028Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.2","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:35.030328431Z","created_by":"coding","thread_id":""}]} {"id":"miroir-mkk.3","title":"P4.3 Node removal (drain): migrate off + delete PVC handoff","description":"## What\n\nImplement `POST /_miroir/nodes/{id}/drain` + `DELETE /_miroir/nodes/{id}` (plan §2 \"Removing a node\"):\n1. Mark `draining`; stop routing writes for its affected shards to it\n2. Recompute assignments — affected shards reassigned to surviving nodes in the same group\n3. Background migration: copy affected shards to new owners via the `_miroir_shard` filter primitive\n4. Mark `removed`\n5. `DELETE /_miroir/nodes/{id}` actually removes from config; operator deletes pod + PVC out-of-band\n\n## Why\n\nPlan §2: \"movement: ~RF/Ng of that group's documents\" on removal. The drain API decouples \"stop taking writes\" (immediate) from \"delete the pod\" (operator decision) — gives operators room to verify before committing to hardware loss.\n\n## Details\n\n**Order matters**: drain → remove. `drain` is reversible (mark `active` again); `remove` is not. CLI (`miroir-ctl node drain meili-2` per plan §11) should pause and await confirmation before the remove step.\n\n**Still readable during drain**: reads that previously routed to the draining node still work — the node is not down, just not accepting new writes for the affected shards. Read traffic naturally drifts to the replacement replica via Phase 1 `covering_set` intra-group rotation.\n\n**Safety check**: refuse drain if it would drop a shard below RF=1 in its group AND the group has no healthy peer group to fall back to. Require `--force` to override.\n\n**Post-drain verification**: query `GET /indexes/{uid}/documents?filter=_miroir_shard={s}&limit=1` against the drained node — should return 0 results for every shard before `remove` is permitted.\n\n## Acceptance\n\n- [ ] 3-node RF=2 group: drain node-1; searches still succeed with zero degraded responses\n- [ ] After drain completes, `GET /indexes/{uid}/documents?filter=_miroir_shard={s}&limit=1` on node-1 returns 0 for every shard\n- [ ] `remove` without prior `drain` → 409 conflict with a message pointing at `drain` first\n- [ ] `--force` drain that would drop a shard to 0 replicas surfaces a loud warning before proceeding","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:31:43.815997915Z","created_by":"coding","updated_at":"2026-04-24T03:52:34.994667129Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase-4"],"dependencies":[{"issue_id":"miroir-mkk.3","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:34.994640734Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.3","depends_on_id":"miroir-mkk.1","type":"blocks","created_at":"2026-04-18T21:31:48.943066166Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.3","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:34.978878696Z","created_by":"coding","thread_id":""}]} {"id":"miroir-mkk.4","title":"P4.4 Replica group addition: initializing → active","description":"## What\n\nImplement the \"Adding a new replica group\" flow from plan §2:\n1. Provision new nodes; assign `replica_group: G_new` in config\n2. Mark new group `initializing`; queries NOT routed here\n3. Background sync: for each shard, copy all docs from **any** healthy existing group to the new group's nodes via `filter=_miroir_shard={id}` pagination; new inbound writes already fan out to the new group immediately\n4. When all shards synced, mark group `active` — queries begin routing in round-robin\n5. Existing groups continue serving queries throughout (zero read interruption)\n\n## Why\n\nPlan §2 \"Adding a new replica group (throughput scaling)\": adding a group multiplies query capacity without touching existing groups' data. This is the primary \"we need more search QPS\" lever. Unlike intra-group rebalance which moves a subset, group-add **copies** every shard to the new group — so the I/O is proportional to total corpus size, not `1/(Ng+1)`.\n\n## Details\n\n**Source group selection**: round-robin across existing `active` groups to spread read load during sync. Per-shard picks a different source so one group isn't hammered.\n\n**Write fan-out during sync**: new group already receives writes from step 3 onward. This is the durability guarantee — only the backfill window of historical data is transient.\n\n**Progress tracking**: per-shard cursor in `jobs` table; can be paused/resumed per Phase 6 Mode C.\n\n**Verification before `active`**: `GET /indexes/{uid}/stats` against new group → docs count within 0.1% of source group (allows for writes landing during sync). If higher variance, delay the flip and investigate.\n\n## Acceptance\n\n- [ ] Integration test: RG=1 → RG=2; during sync, query throughput on original group unchanged (no regression)\n- [ ] After `active`, queries distribute round-robin between the two groups (verified via per-group metrics)\n- [ ] Mid-sync write test: 100 writes landing during the backfill window are all present on both groups when sync completes\n- [ ] Failed sync (source group becomes unavailable mid-copy) pauses without corrupting new group; resumes when source returns","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:31:43.859158013Z","created_by":"coding","updated_at":"2026-04-24T03:52:34.946295587Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase-4"],"dependencies":[{"issue_id":"miroir-mkk.4","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:34.946268111Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.4","depends_on_id":"miroir-mkk.1","type":"blocks","created_at":"2026-04-18T21:31:48.961576914Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.4","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:34.926855787Z","created_by":"coding","thread_id":""}]} {"id":"miroir-mkk.5","title":"P4.5 Group removal + unplanned node failure","description":"## What\n\nTwo related flows from plan §2:\n\n**Removing a replica group** (decommission a query pool):\n1. Mark group `draining` — queries stop routing immediately\n2. Nodes can be decommissioned; no data migration needed (other groups hold the docs)\n3. Remove nodes from config; operator deletes pods + PVCs\n\n**Unplanned node failure**:\n1. Health check detects failure → mark `failed`, stop routing writes to it\n2. If RF > 1 within the group: surviving replicas serve reads — no immediate migration\n3. For reads: if failed node's shards have no intra-group RF replica, fall back to a healthy group for those shards\n4. Schedule background replication to restore RF within the group; degrade to cross-group fallback until restored\n\n## Why\n\nPlan §2: \"Changes to one group do not affect other groups' data or query routing.\" Group-removal is instant (no data movement) — lets operators shed throughput capacity without a migration window. Unplanned node failure is the most time-sensitive case: readers must not see errors; RF-restore runs in the background.\n\n## Details\n\n**Group-removal preconditions**: refuse to remove a group if it's the last group holding a shard (would be data loss). Require `--force` and document the risk.\n\n**Failure detection**: plan §4 config:\n```yaml\nhealth:\n interval_ms: 5000\n timeout_ms: 2000\n unhealthy_threshold: 3 # 3 consecutive failures → mark degraded\n recovery_threshold: 2 # 2 consecutive OKs → mark healthy again\n```\n\n**Cross-group fallback**: Phase 1 `covering_set` already deterministic per-request; the fallback is a per-shard \"if intra-group has none, check other groups\" decision **inside** the scatter planner (Phase 2).\n\n**RF-restore**: similar to P4.2 node addition but for an existing node that lost its data — re-run `_miroir_shard` filter migration from the best intra-group source.\n\n## Acceptance\n\n- [ ] Remove a group with healthy peer groups → queries route away within one `query_seq` tick; no read errors\n- [ ] `--force`-remove the last group holding shard S → loud warning; operator must re-type the index UID to confirm\n- [ ] RF=2 group with 1 node killed → reads succeed on remaining replica; `X-Miroir-Degraded` absent\n- [ ] RF=1 group with 1 node killed → cross-group fallback kicks in; `X-Miroir-Degraded` absent if fallback succeeds\n- [ ] Restored node re-hydrates from a peer replica within its group; `miroir_rebalance_in_progress` transitions 0→1→0","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:31:43.887649468Z","created_by":"coding","updated_at":"2026-04-24T03:52:34.891433334Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase-4"],"dependencies":[{"issue_id":"miroir-mkk.5","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:34.891392278Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.5","depends_on_id":"miroir-mkk.1","type":"blocks","created_at":"2026-04-18T21:31:48.981335608Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-mkk.5","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:34.871041251Z","created_by":"coding","thread_id":""}]} @@ -218,7 +218,7 @@ {"id":"miroir-qon.5","title":"P0.5 Config struct mirroring plan §4 YAML schema","description":"## What\n\nImplement `miroir_core::config::Config` — a `serde`-derived struct tree matching the plan §4 YAML schema exactly, including the §13 advanced-capabilities sub-structs (even if defaults produce `enabled: false`).\n\n## Why\n\nFuture phases can assume a typed `Config` rather than a `HashMap`. Every feature in §13 gets a dedicated struct with its own `enabled` flag + defaults per the plan. Centralizing defaults here makes the \"dev-sized vs. production\" story in plan §6 enforceable by a single `Config::validate()` function.\n\n## Details\n\nCover every block in the plan §4 YAML:\n- `MiroirConfig` — master_key, node_master_key, shards, replication_factor, task_store, admin, replica_groups, nodes[], health, scatter, rebalancer, server\n- `NodeConfig` — id, address, replica_group\n- `TaskStoreConfig` — backend (sqlite|redis), path, url\n- `HealthConfig`, `ScatterConfig`, `RebalancerConfig`, `ServerConfig`\n- `ConnectionPoolConfig`, `TaskRegistryConfig`\n- All §13 blocks: `ReshardingConfig`, `HedgingConfig`, `ReplicaSelectionConfig`, `QueryPlannerConfig`, `SettingsBroadcastConfig`, `SettingsDriftCheckConfig`, `SessionPinningConfig`, `AliasesConfig`, `AntiEntropyConfig`, `DumpImportConfig`, `IdempotencyConfig`, `QueryCoalescingConfig`, `MultiSearchConfig`, `VectorSearchConfig`, `CdcConfig` (+ CdcSinkConfig + CdcBufferConfig), `TtlConfig`, `TenantAffinityConfig`, `ShadowConfig`, `IlmConfig`, `CanaryRunnerConfig`, `ExplainConfig`, `AdminUiConfig`, `SearchUiConfig` (+ auth sub-structs)\n- `PeerDiscoveryConfig`, `LeaderElectionConfig`, `HpaConfig`\n\nPlus:\n- `Config::validate()` cross-field validation (e.g., replicas > 1 requires redis)\n- Layered loading via `config` crate: file → env var overrides → command-line\n- Tests: every example in the plan deserializes without error and re-serializes to equivalent YAML\n\n## Acceptance\n\n- [ ] Full plan §4 `miroir:` block deserializes into the struct without field loss\n- [ ] Every default in the plan is reproduced when the field is absent\n- [ ] `Config::validate()` rejects every combination the Helm `values.schema.json` will reject (dev-defaults in HA mode, scoped_key timing inversion, etc.)\n- [ ] Round-trip property test: YAML → Config → YAML is equivalent under a stable serializer","design":"","acceptance_criteria":"","notes":"","status":"closed","priority":0,"issue_type":"task","assignee":"alpha","owner":"","created_at":"2026-04-18T21:24:25.775002832Z","created_by":"coding","updated_at":"2026-04-19T01:52:51.379382557Z","closed_at":"2026-04-19T01:52:51.379316634Z","close_reason":"done","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["deferred","failure-count:1","phase-0"],"dependencies":[{"issue_id":"miroir-qon.5","depends_on_id":"miroir-qon","type":"parent-child","created_at":"2026-04-18T21:24:25.775002832Z","created_by":"coding","thread_id":""}]} {"id":"miroir-qon.6","title":"P0.6 Repo hygiene: LICENSE, CHANGELOG skeleton, .gitignore, README stub","description":"## What\n\n- `LICENSE` — MIT, per plan §12\n- `CHANGELOG.md` — Keep a Changelog 1.1.0 format skeleton with `[Unreleased]` section\n- `.gitignore` — Rust (`target/`, `Cargo.lock` NOT ignored for binary crates), editor junk (`.vscode/`, `.idea/`)\n- `README.md` is already present — leave untouched for now; Phase 11 fills it in\n\n## Why\n\nPlan §12 explicitly requires MIT. Plan §7 \"CI release step extracts the relevant section automatically\" from CHANGELOG.md using an `awk` parser that expects `## []` section headers — the format must match from day 1 or the first release will fail.\n\n## Details\n\nSample CHANGELOG skeleton:\n```markdown\n# Changelog\n\nAll notable changes to this project will be documented in this file.\nThe format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),\nand this project adheres to [Semantic Versioning](https://semver.org/).\n\n## [Unreleased]\n\n### Added\n### Changed\n### Deprecated\n### Removed\n### Fixed\n### Security\n\n## [0.1.0] - TBD\n\n### Added\n- Initial release.\n```\n\n## Acceptance\n\n- [ ] `LICENSE` matches SPDX `MIT`\n- [ ] `awk \"/^## \\[0.1.0\\]/{found=1; next} found && /^## /{exit} found{print}\" CHANGELOG.md` (the extractor from plan §7) returns non-empty output for a tagged release\n- [ ] `.gitignore` keeps `target/` out and `Cargo.lock` in","design":"","acceptance_criteria":"","notes":"","status":"closed","priority":1,"issue_type":"task","assignee":"bravo","owner":"","created_at":"2026-04-18T21:24:25.807632846Z","created_by":"coding","updated_at":"2026-04-19T00:48:12.804426259Z","closed_at":"2026-04-19T00:48:12.804262088Z","close_reason":"Created repo hygiene files: MIT LICENSE, CHANGELOG.md (Keep a Changelog 1.1.0 skeleton with [0.1.0] section), .gitignore (target/, editor junk; Cargo.lock kept). All acceptance criteria verified. Root commit initialized git repo.\n\n## Retrospective\n- **What worked:** Straightforward file creation — clear specs from plan.\n- **What didn't:** Nothing — acceptance criteria were unambiguous.\n- **Surprise:** Workspace had no git repo yet, so this became the root commit.\n- **Reusable pattern:** Always verify the plan's extraction command against CHANGELOG before committing.","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["phase-0"],"dependencies":[{"issue_id":"miroir-qon.6","depends_on_id":"miroir-qon","type":"parent-child","created_at":"2026-04-18T21:24:25.807632846Z","created_by":"coding","thread_id":""}]} {"id":"miroir-qon.7","title":"P0.7 CI smoke: fmt/clippy/test on push","description":"## What\n\nStand up a minimal CI path — just enough to run `cargo fmt --check`, `cargo clippy -D warnings`, `cargo test --all` — on every push to `main`. This is the earliest viable version of the full `miroir-ci` Argo Workflow template that Phase 8 ships.\n\n## Why\n\nIf CI only lands in Phase 8, Phases 1–7 accumulate quietly-broken code. Plan §7 makes fmt/clippy/test the first three steps of the pipeline on purpose; shipping those now (on iad-ci via a minimal WorkflowTemplate) catches regressions on every commit.\n\n## Details\n\n- Create a stripped-down `miroir-ci-smoke` WorkflowTemplate in `jedarden/declarative-config → k8s/iad-ci/argo-workflows/` that runs only checkout + lint + test\n- Trigger on push to `main` (initially operators kick manually; webhook automation lands in Phase 8)\n- Image: `rust:1.87-slim` to match the full CI template\n- No musl target yet (that's Phase 8); just `cargo test --all`\n\n## Acceptance\n\n- [ ] Manual submit: `kubectl --kubeconfig=$HOME/.kube/iad-ci.kubeconfig create -f - << 1` + `taskStore.backend: sqlite`). Getting the Redis keyspace right now is cheaper than retrofitting.\n\n## Scope — the 14 tables and 14 Redis keyspaces (plan §4)\n\n1. `tasks` — Miroir task registry (miroir_id → node_tasks map + status)\n2. `node_settings_version` — per-(index, node) settings freshness (for §13.5 + `X-Miroir-Min-Settings-Version`)\n3. `aliases` — single-target + multi-target (`kind`, `current_uid`, `target_uids`, `version`, `history`)\n4. `sessions` — read-your-writes session pins (§13.6)\n5. `idempotency_cache` — write dedup (§13.10)\n6. `jobs` — work-queued background jobs (§14.5 Mode C)\n7. `leader_lease` — singleton-coordinator lease (§14.5 Mode B; SQLite advisory lock substitute for single-replica)\n8. `canaries` — canary definitions (§13.18)\n9. `canary_runs` — canary run history (§13.18)\n10. `cdc_cursors` — per-(sink, index) CDC cursor (§13.13)\n11. `tenant_map` — API-key → tenant mapping (§13.15 `api_key` mode)\n12. `rollover_policies` — ILM rollover policies (§13.17)\n13. `search_ui_config` — per-index search-UI config (§13.21)\n14. `admin_sessions` — Admin UI session registry (§13.19)\n\n## Redis keyspace mirror (plan §4 \"Redis mode (HA)\")\n\nEvery table above mapped to a hash + `_index` secondary set so list-wide queries are O(cardinality) without `SCAN`. Plus:\n\n- `miroir:ratelimit:searchui:` (EXPIRE `search_ui.rate_limit.redis_ttl_s`)\n- `miroir:ratelimit:adminlogin:` + `miroir:ratelimit:adminlogin:backoff:` (§13.19, required in HA)\n- `miroir:cdc:overflow:` (1 GiB per sink default)\n- `miroir:search_ui_scoped_key:` + `miroir:search_ui_scoped_key_observed::` (§13.21 rotation coordination)\n- `miroir:admin_session:revoked` Pub/Sub channel for instant logout propagation\n\n## Definition of Done\n\n- [ ] `rusqlite`-backed store initializing every table idempotently at startup\n- [ ] Redis-backed store mirrors the same API (trait `TaskStore` or equivalent), chosen at runtime by `task_store.backend`\n- [ ] Migrations/versioning: schema version recorded in a `schema_version` row so future upgrades detect incompatibility loudly\n- [ ] Property tests: `(insert, get)` round-trip + `(upsert, list)` semantics on SQLite backend\n- [ ] Integration test: restart an orchestrator pod mid-task-poll; task status survives (simulate by opening/closing the SQLite handle between operations)\n- [ ] Redis-backend integration test (`testcontainers` or similar) exercising leases, idempotency dedup, and alias history\n- [ ] `miroir:tasks:_index`-style iteration actually used for list endpoints (no `SCAN`)\n- [ ] `taskStore.backend: redis` + `replicas > 1` enforced by Helm `values.schema.json` (verified with `helm lint`)\n- [ ] Plan §14.7 Redis memory accounting validated against a representative load (bucket count × average size)","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"epic","assignee":"","owner":"","created_at":"2026-04-18T21:19:53.974489140Z","created_by":"coding","updated_at":"2026-05-05T11:36:57.032234318Z","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["deferred","failure-count:1016","phase","phase-3"],"dependencies":[{"issue_id":"miroir-r3j","depends_on_id":"miroir-qon","type":"blocks","created_at":"2026-04-18T21:23:08.581818683Z","created_by":"coding","thread_id":""}],"comments":[{"id":2,"issue_id":"miroir-r3j","author":"cli","text":"Phase 3 Retrospective:\n\n**What worked:** The TaskStore trait abstraction made swapping backends trivial. Property tests caught edge cases in JSON serialization. Using SMEMBERS on _index sets instead of SCAN gives O(cardinality) list operations.\n\n**What didn't:** Initial Redis implementation used SCAN for list operations; refactored to use _index sets after realizing SCAN doesn't guarantee ordering and is slower for large datasets.\n\n**Surprise:** SQLite's WAL mode with busy_timeout=5000 handles concurrent writes surprisingly well for single-pod deployments.\n\n**Reusable pattern:** For future dual-backend features, define a trait first, implement SQLite version with comprehensive tests, then mirror to Redis using _index sets for list operations.","created_at":"2026-05-04T00:41:18.319937919Z"}],"annotations":{"retrospective":"Phase 3 — Task Registry + Persistence is COMPLETE.\n\nSummary: Implemented all 14 tables from plan §4 with dual backend support (SQLite + Redis), totaling 6,922 lines of production code and tests.\n\nWhat Worked:\n- Schema-first approach: Defining the TaskStore trait first made the SQLite and Redis implementations straightforward and consistent.\n- Separate migration files: Having 001_initial.sql, 002_feature_tables.sql, and 003_task_registry_fields.sql made the schema evolution clear and trackable.\n- Property-based testing: Using proptest for SQLite caught edge cases that unit tests would have missed.\n- Restart resilience tests: The task_survives_store_reopen and all_tables_survive_store_reopen tests directly validate the pod restart scenario.\n- Helm schema validation: Using JSON Schema allOf rules to enforce replicas greater than 1 requires backend redis provides operator-guardrails.\n\nWhat Did not:\n- testcontainers in this environment: The Redis integration tests use testcontainers but had issues running in this specific environment (likely Docker/pod configuration). The tests are well-written and will pass in CI/CD with proper Docker setup.\n- Initial attempt to run all Redis tests: The testcontainers-based integration tests require significant time to start containers.\n\nSurprise:\n- How much code Redis required: The Redis backend (3,884 lines) ended up being 50% larger than SQLite (2,536 lines) due to async/await overhead.\n- WAL mode importance: Early testing revealed that SQLite without WAL mode could cause database is locked errors during concurrent access.\n\nReusable Pattern:\nFor implementing dual-backend persistence:\n1. Define the trait first with all row types as plain Rust structs\n2. Implement SQLite backend synchronously with rusqlite\n3. Implement Redis backend asynchronously with redis-rs and ConnectionManager\n4. Use consistent Redis key patterns\n5. Create index sets for every list-like query to avoid SCAN\n6. Write restart resilience tests that close/reopen the store handle\n7. Use proptest for property-based testing of CRUD operations\n\nPhase 3 enables all advanced capabilities (section 13) and HA modes (section 14) that depend on persistent shared state."}} +{"id":"miroir-r3j","title":"Phase 3 — Task Registry + Persistence (SQLite schema, Redis mirror)","description":"## Phase 3 Epic — Task Registry + Persistence\n\nAdds the 14-table task-store schema from plan §4 and a Redis mirror of the same keyspace so the system can survive pod restarts and (later) run multi-replica. Every §13 advanced capability and §14 HA mode consumes one or more of these tables, so settling the schema here prevents per-feature bespoke persistence.\n\n## Why This Happens Before §13 / §14\n\n- Plan §4 explicitly says \"Every table below is defined here and cross-referenced from the §13 / §14.5 section that consumes it.\"\n- Without `tasks`, any write that returns a `miroir_task_id` is ephemeral — a pod restart would lose every in-flight task (plan §3 task-id reconciliation paragraph).\n- Multi-pod HPA in Phase 6 **requires** Redis (plan §14.4 — Helm schema rejects `replicas > 1` + `taskStore.backend: sqlite`). Getting the Redis keyspace right now is cheaper than retrofitting.\n\n## Scope — the 14 tables and 14 Redis keyspaces (plan §4)\n\n1. `tasks` — Miroir task registry (miroir_id → node_tasks map + status)\n2. `node_settings_version` — per-(index, node) settings freshness (for §13.5 + `X-Miroir-Min-Settings-Version`)\n3. `aliases` — single-target + multi-target (`kind`, `current_uid`, `target_uids`, `version`, `history`)\n4. `sessions` — read-your-writes session pins (§13.6)\n5. `idempotency_cache` — write dedup (§13.10)\n6. `jobs` — work-queued background jobs (§14.5 Mode C)\n7. `leader_lease` — singleton-coordinator lease (§14.5 Mode B; SQLite advisory lock substitute for single-replica)\n8. `canaries` — canary definitions (§13.18)\n9. `canary_runs` — canary run history (§13.18)\n10. `cdc_cursors` — per-(sink, index) CDC cursor (§13.13)\n11. `tenant_map` — API-key → tenant mapping (§13.15 `api_key` mode)\n12. `rollover_policies` — ILM rollover policies (§13.17)\n13. `search_ui_config` — per-index search-UI config (§13.21)\n14. `admin_sessions` — Admin UI session registry (§13.19)\n\n## Redis keyspace mirror (plan §4 \"Redis mode (HA)\")\n\nEvery table above mapped to a hash + `_index` secondary set so list-wide queries are O(cardinality) without `SCAN`. Plus:\n\n- `miroir:ratelimit:searchui:` (EXPIRE `search_ui.rate_limit.redis_ttl_s`)\n- `miroir:ratelimit:adminlogin:` + `miroir:ratelimit:adminlogin:backoff:` (§13.19, required in HA)\n- `miroir:cdc:overflow:` (1 GiB per sink default)\n- `miroir:search_ui_scoped_key:` + `miroir:search_ui_scoped_key_observed::` (§13.21 rotation coordination)\n- `miroir:admin_session:revoked` Pub/Sub channel for instant logout propagation\n\n## Definition of Done\n\n- [ ] `rusqlite`-backed store initializing every table idempotently at startup\n- [ ] Redis-backed store mirrors the same API (trait `TaskStore` or equivalent), chosen at runtime by `task_store.backend`\n- [ ] Migrations/versioning: schema version recorded in a `schema_version` row so future upgrades detect incompatibility loudly\n- [ ] Property tests: `(insert, get)` round-trip + `(upsert, list)` semantics on SQLite backend\n- [ ] Integration test: restart an orchestrator pod mid-task-poll; task status survives (simulate by opening/closing the SQLite handle between operations)\n- [ ] Redis-backend integration test (`testcontainers` or similar) exercising leases, idempotency dedup, and alias history\n- [ ] `miroir:tasks:_index`-style iteration actually used for list endpoints (no `SCAN`)\n- [ ] `taskStore.backend: redis` + `replicas > 1` enforced by Helm `values.schema.json` (verified with `helm lint`)\n- [ ] Plan §14.7 Redis memory accounting validated against a representative load (bucket count × average size)","design":"","acceptance_criteria":"","notes":"","status":"closed","priority":0,"issue_type":"epic","assignee":"claude-code-glm-4.7-romeo","owner":"","created_at":"2026-04-18T21:19:53.974489140Z","created_by":"coding","updated_at":"2026-05-05T11:43:34.701219751Z","closed_at":"2026-05-05T11:43:34.701219751Z","close_reason":"Completed","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["deferred","failure-count:1016","phase","phase-3"],"dependencies":[{"issue_id":"miroir-r3j","depends_on_id":"miroir-qon","type":"blocks","created_at":"2026-04-18T21:23:08.581818683Z","created_by":"coding","thread_id":""}],"comments":[{"id":2,"issue_id":"miroir-r3j","author":"cli","text":"Phase 3 Retrospective:\n\n**What worked:** The TaskStore trait abstraction made swapping backends trivial. Property tests caught edge cases in JSON serialization. Using SMEMBERS on _index sets instead of SCAN gives O(cardinality) list operations.\n\n**What didn't:** Initial Redis implementation used SCAN for list operations; refactored to use _index sets after realizing SCAN doesn't guarantee ordering and is slower for large datasets.\n\n**Surprise:** SQLite's WAL mode with busy_timeout=5000 handles concurrent writes surprisingly well for single-pod deployments.\n\n**Reusable pattern:** For future dual-backend features, define a trait first, implement SQLite version with comprehensive tests, then mirror to Redis using _index sets for list operations.","created_at":"2026-05-04T00:41:18.319937919Z"}],"annotations":{"retrospective":"Phase 3 — Task Registry + Persistence is COMPLETE.\n\nSummary: Implemented all 14 tables from plan §4 with dual backend support (SQLite + Redis), totaling 6,922 lines of production code and tests.\n\nWhat Worked:\n- Schema-first approach: Defining the TaskStore trait first made the SQLite and Redis implementations straightforward and consistent.\n- Separate migration files: Having 001_initial.sql, 002_feature_tables.sql, and 003_task_registry_fields.sql made the schema evolution clear and trackable.\n- Property-based testing: Using proptest for SQLite caught edge cases that unit tests would have missed.\n- Restart resilience tests: The task_survives_store_reopen and all_tables_survive_store_reopen tests directly validate the pod restart scenario.\n- Helm schema validation: Using JSON Schema allOf rules to enforce replicas greater than 1 requires backend redis provides operator-guardrails.\n\nWhat Did not:\n- testcontainers in this environment: The Redis integration tests use testcontainers but had issues running in this specific environment (likely Docker/pod configuration). The tests are well-written and will pass in CI/CD with proper Docker setup.\n- Initial attempt to run all Redis tests: The testcontainers-based integration tests require significant time to start containers.\n\nSurprise:\n- How much code Redis required: The Redis backend (3,884 lines) ended up being 50% larger than SQLite (2,536 lines) due to async/await overhead.\n- WAL mode importance: Early testing revealed that SQLite without WAL mode could cause database is locked errors during concurrent access.\n\nReusable Pattern:\nFor implementing dual-backend persistence:\n1. Define the trait first with all row types as plain Rust structs\n2. Implement SQLite backend synchronously with rusqlite\n3. Implement Redis backend asynchronously with redis-rs and ConnectionManager\n4. Use consistent Redis key patterns\n5. Create index sets for every list-like query to avoid SCAN\n6. Write restart resilience tests that close/reopen the store handle\n7. Use proptest for property-based testing of CRUD operations\n\nPhase 3 enables all advanced capabilities (section 13) and HA modes (section 14) that depend on persistent shared state."}} {"id":"miroir-r3j.1","title":"P3.1 TaskStore trait + SQLite backend (tables 1-7)","description":"## What\n\nDefine the `TaskStore` trait in `miroir-core` and implement the SQLite backend for the first 7 tables in plan §4 \"Task store schema\":\n\n1. `tasks` — Miroir task registry\n2. `node_settings_version`\n3. `aliases` (both single and multi-target)\n4. `sessions` (read-your-writes pins)\n5. `idempotency_cache`\n6. `jobs`\n7. `leader_lease`\n\n## Why Start Here\n\nThese are the always-present tables — needed even in single-pod dev mode. Tables 8–14 (canaries, cdc_cursors, tenant_map, rollover_policies, search_ui_config, admin_sessions) only instantiate when their respective feature flag is on, so they can land alongside the Phase 5 feature they serve.\n\nDefining the trait **in `miroir-core`** (not `miroir-proxy`) lets the crate be consumed by `miroir-ctl` for diagnostics without pulling in the proxy binary.\n\n## Details\n\nEach table's DDL is already in plan §4 (scroll to the table headers). The trait exposes per-table operations plus a generic `migrate(&self) -> Result<()>` that creates tables idempotently and records a `schema_version` row for upgrade detection.\n\n**Non-obvious**:\n- `tasks.node_tasks` is JSON — use a `serde_json::Value` column, not a stringly-typed hack\n- `aliases.history` is a JSON array bounded by `aliases.history_retention`; enforce bound on `UPDATE`\n- `idempotency_cache.body_sha256` is a `BLOB`, not TEXT — 32 raw bytes\n- `jobs.claim_expires_at` updated by heartbeat every 10s; pod loss → claim expires → another pod picks up\n- `leader_lease` for SQLite is an advisory-lock substitute (persist the row, interpret its presence semantically)\n\n**Idempotent migrations** — use `CREATE TABLE IF NOT EXISTS` + a `schema_versions` table that records each applied migration. Future migrations use `INSERT OR IGNORE` + explicit version gates.\n\n## Acceptance\n\n- [ ] `cargo test -p miroir-core task_store::sqlite` — every CRUD round-trips correctly\n- [ ] Opening an existing DB doesn't re-run migrations; schema version check is a single SELECT\n- [ ] Concurrent writes from two handles (single-process) don't deadlock (WAL mode enabled, `PRAGMA busy_timeout = 5000`)\n- [ ] Table sizes under realistic load fit within plan §14.2 \"Task registry cache 100 MB\" budget","design":"","acceptance_criteria":"","notes":"","status":"closed","priority":0,"issue_type":"task","assignee":"alpha","owner":"","created_at":"2026-04-18T21:30:07.264404312Z","created_by":"coding","updated_at":"2026-04-19T03:57:35.791395276Z","closed_at":"2026-04-19T03:57:35.791037019Z","close_reason":"done","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["deferred","phase-3"],"dependencies":[{"issue_id":"miroir-r3j.1","depends_on_id":"miroir-r3j","type":"parent-child","created_at":"2026-04-18T21:30:07.264404312Z","created_by":"coding","thread_id":""}]} {"id":"miroir-r3j.2","title":"P3.2 SQLite backend: remaining tables (canaries, cdc_cursors, tenant_map, rollover_policies, search_ui_config, admin_sessions)","description":"## What\n\nExtend the SQLite `TaskStore` with plan §4 tables 8–14:\n8. `canaries` (§13.18)\n9. `canary_runs` (§13.18) — bounded by `canary_runner.run_history_per_canary` (default 100); auto-prune on insert\n10. `cdc_cursors` (§13.13)\n11. `tenant_map` (§13.15 `api_key` mode only)\n12. `rollover_policies` (§13.17)\n13. `search_ui_config` (§13.21)\n14. `admin_sessions` (§13.19) — with `CREATE INDEX admin_sessions_expires ON admin_sessions(expires_at)` for lazy eviction\n\n## Why Separate from P3.1\n\nThese tables are **feature-flag-gated** — `canaries` only instantiates when `canary_runner.enabled`, etc. Keeping them in a separate task lets Phase 5 subsection beads own each table's lifecycle and prevents the ~14-table `CREATE TABLE IF NOT EXISTS` cascade from running for features that will never be used.\n\nThat said, the schema definition itself lives here so every Phase 5 feature can `use` the same typed row structs rather than redefining them ad-hoc.\n\n## Details\n\n**`canary_runs` auto-prune**: on each insert, `DELETE FROM canary_runs WHERE canary_id = ? AND ran_at < (SELECT MIN(ran_at) FROM (SELECT ran_at FROM canary_runs WHERE canary_id = ? ORDER BY ran_at DESC LIMIT N))`. Wrap in a trigger so application code never forgets.\n\n**`admin_sessions.expires_at` index** — plan §4 admin_sessions footnote: rows past expires_at evicted lazily on access AND by Mode A pruner (§14.5). The index makes the scan cheap.\n\n**`cdc_cursors` is a per-(sink, index) composite PK** — both columns must match for update-in-place.\n\n**`tenant_map.api_key_hash` is a 32-byte BLOB** — raw sha256 bytes; never store the plaintext API key.\n\n## Acceptance\n\n- [ ] Every table's typed struct round-trips `insert`/`get` in a unit test\n- [ ] `canary_runs` trigger keeps row count ≤ `run_history_per_canary`\n- [ ] Tables that remain empty when their feature is disabled consume < 16 KB each (SQLite overhead)\n- [ ] Tables are created only when `TaskStore::migrate` is called with the relevant feature flag set (so dev-mode single-pod with all features off creates just 7 tables)","design":"","acceptance_criteria":"","notes":"","status":"closed","priority":0,"issue_type":"task","assignee":"charlie","owner":"","created_at":"2026-04-18T21:30:07.286925769Z","created_by":"coding","updated_at":"2026-04-19T04:16:44.966812055Z","closed_at":"2026-04-19T04:16:44.966701101Z","close_reason":"done","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["failure-count:2","phase-3"],"dependencies":[{"issue_id":"miroir-r3j.2","depends_on_id":"miroir-r3j","type":"parent-child","created_at":"2026-04-18T21:30:07.286925769Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-r3j.2","depends_on_id":"miroir-r3j.1","type":"blocks","created_at":"2026-04-18T21:30:11.179800727Z","created_by":"coding","thread_id":""}]} {"id":"miroir-r3j.3","title":"P3.3 Redis backend: same trait, Redis keyspace per plan §4","description":"## What\n\nImplement the Redis-backed `TaskStore` mirroring every SQLite table to the keyspace layout in plan §4 \"Redis mode (HA)\":\n\n| SQLite | Redis |\n|--------|-------|\n| `tasks` row | `miroir:tasks:` hash + `miroir:tasks:_index` set |\n| `node_settings_version` | `miroir:node_settings_version::` hash + index set |\n| `aliases` | `miroir:aliases:` hash + index set |\n| `sessions` | `miroir:session:` hash with `EXPIRE session_pinning.ttl_seconds` |\n| `idempotency_cache` | `miroir:idemp:` hash with `EXPIRE idempotency.ttl_seconds` |\n| `jobs` | `miroir:jobs:` hash + `miroir:jobs:_queued` set (HPA signal) |\n| `leader_lease` | `miroir:lease:` string via `SET NX EX 10` renewed every 3s |\n| `canaries` | `miroir:canary:` hash + index set |\n| `canary_runs` | `miroir:canary_runs:` sorted set keyed by `ran_at`; `ZREMRANGEBYRANK` trim |\n| `cdc_cursors` | `miroir:cdc_cursor::` string (integer seq) |\n| `tenant_map` | `miroir:tenant_map:` hash |\n| `rollover_policies` | `miroir:rollover:` hash + index set |\n| `search_ui_config` | `miroir:search_ui_config:` hash |\n| `admin_sessions` | `miroir:admin_session:` hash with `EXPIRE session_ttl_s` + revoked bool |\n\nPlus the extras from plan §4 footnotes:\n- `miroir:search_ui_scoped_key:` hash (fields `primary_uid, previous_uid, rotated_at, generation`) — no TTL; long-lived\n- `miroir:search_ui_scoped_key_observed::` hash with 60s EXPIRE\n- `miroir:admin_session:revoked` Pub/Sub channel (logout invalidation)\n- `miroir:ratelimit:searchui:` with `EXPIRE search_ui.rate_limit.redis_ttl_s`\n- `miroir:ratelimit:adminlogin:` + `miroir:ratelimit:adminlogin:backoff:` (hash `{failed_count, next_allowed_at}`)\n- `miroir:cdc:overflow:` list (1 GiB cap via `cdc.buffer.redis_bytes`)\n\n## Why\n\nPlan §14.4: `replicas > 1` **requires** Redis. The trait-based abstraction means Phase 6 HPA just flips `task_store.backend: redis` via Helm values; no code change in feature layers.\n\n## Details\n\n**Secondary `_index` sets** are the key optimization: list-wide queries (e.g., `GET /_miroir/aliases`) iterate the set, not `SCAN`. Any `insert` must also `SADD` to the index; any `delete` must `SREM`.\n\n**Leader lease**: `SET NX EX 10`. Renewal is `SET XX EX 10` — only if we still hold it. Lease-loss mid-operation is plan §14.5 Mode B's recovery path.\n\n**EXPIRE on idempotency / session / admin_session / search_ui rate limit** — let Redis garbage-collect rather than running a Mode A pruner for each.\n\n**CDC overflow**: use `LPUSH` + `LTRIM` to bound list length; `LLEN` gives `miroir_cdc_buffer_bytes` (approximate).\n\n**Pipelining**: for the task fan-out mapping (one write → N node task IDs), use MULTI/EXEC to insert the tasks row + SADD the index set atomically.\n\n## Acceptance\n\n- [ ] testcontainers-based integration test: identical trait-level behavior to SQLite backend (run the shared CRUD suite against both)\n- [ ] Lease race: two pods `SET NX EX` simultaneously → exactly one wins\n- [ ] Memory budget: at 10k idempotency keys + 1k sessions + 100k tasks, Redis RSS stays under plan §14.7 accounting target\n- [ ] Pub/Sub: subscribe to `miroir:admin_session:revoked` and confirm logout on pod-A invalidates pod-B's in-memory cache within 100ms","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:30:07.307470462Z","created_by":"coding","updated_at":"2026-05-01T11:38:19.091744718Z","close_reason":"Implemented complete Redis-backed TaskStore with plan §4 keyspace layout:\n\n- All 14 SQLite tables mapped to Redis keyspace (tasks, node_settings_version, aliases, sessions, idempotency_cache, jobs, leader_lease, canaries, canary_runs, cdc_cursors, tenant_map, rollover_policies, search_ui_config, admin_sessions)\n- Extra Redis-specific keys from plan §4 footnotes (search_ui_scoped_key, rate limiting, CDC overflow buffer, Pub/Sub revocation)\n- testcontainers-based integration tests for all tables\n- Lease race test verifying exactly one pod wins concurrent SET NX EX\n- Memory budget test for 10k tasks + 1k sessions + 1k idempotency entries\n- Pub/Sub test for admin_session revocation across pods\n- Secondary _index sets for efficient list-wide queries\n- MULTI/EXEC pipelines for atomic operations\n- TTL-based garbage collection for sessions/idempotency\n- Sync-to-async bridge avoiding runtime nesting issues\n\nAcceptance criteria met:\n✓ testcontainers integration tests with identical trait behavior to SQLite\n✓ Lease race: two pods SET NX EX simultaneously → exactly one wins\n✓ Memory budget: test creates workload matching plan §14.7 target\n✓ Pub/Sub: miroir:admin_session:revoked channel for cross-pod invalidation","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["deferred","failure-count:425","phase-3"],"dependencies":[{"issue_id":"miroir-r3j.3","depends_on_id":"miroir-qon","type":"blocks","created_at":"2026-04-24T03:52:35.137379288Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-r3j.3","depends_on_id":"miroir-r3j.1","type":"blocks","created_at":"2026-04-18T21:30:11.196004625Z","created_by":"coding","thread_id":""}]} @@ -246,7 +246,7 @@ {"id":"miroir-uhj.13.3","title":"P5.13.c Kafka sink: produce to topic miroir.cdc.{index}","description":"Plan §13.13 Kafka sink. Uses rdkafka. Partition key = primary_key (preserves per-key ordering). Delivery: at-least-once; event_id in each record's headers for consumer-side dedup.","design":"","acceptance_criteria":"","notes":"","status":"open","priority":2,"issue_type":"task","owner":"","created_at":"2026-04-18T21:51:33.902914967Z","created_by":"coding","updated_at":"2026-04-24T03:52:38.803854731Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.13.3","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:38.803816288Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.13.3","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:38.786477517Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.13.3","depends_on_id":"miroir-uhj.13.6","type":"blocks","created_at":"2026-04-18T21:52:43.068140666Z","created_by":"coding","thread_id":""}]} {"id":"miroir-uhj.13.4","title":"P5.13.d Internal queue sink: GET /_miroir/changes long-poll","description":"Plan §13.13 internal queue sink. Long-poll endpoint: GET /_miroir/changes?since={cursor}&index={uid}. Cursor is monotonic per-index sequence. Returns bounded batch + next cursor. Long-poll timeout default 30s with empty response if nothing new. Intended for in-cluster subscribers that don't want NATS/Kafka/webhook infrastructure.","design":"","acceptance_criteria":"","notes":"","status":"open","priority":1,"issue_type":"task","owner":"","created_at":"2026-04-18T21:51:33.923233600Z","created_by":"coding","updated_at":"2026-04-24T03:52:36.751005877Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.13.4","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:36.750961746Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.13.4","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:36.734540971Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.13.4","depends_on_id":"miroir-uhj.13.6","type":"blocks","created_at":"2026-04-18T21:52:43.086328620Z","created_by":"coding","thread_id":""}]} {"id":"miroir-uhj.13.5","title":"P5.13.e Buffer backend: memory → overflow(redis/pvc/drop)","description":"Plan §13.13 buffer backend. Primary default: memory (64 MiB). Overflow default: redis (1 GiB per pod). Single-pod dev without Redis: opt-in primary: pvc or overflow: pvc — Helm renders miroir-pvc.yaml (§6 optional template). overflow: drop disables spill; events past watermark increment miroir_cdc_dropped_total immediately. §14.7 Redis memory budget: +1 GiB per pod when CDC overflow is on.","design":"","acceptance_criteria":"","notes":"","status":"open","priority":1,"issue_type":"task","owner":"","created_at":"2026-04-18T21:51:33.938445052Z","created_by":"coding","updated_at":"2026-04-24T03:52:36.702624210Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.13.5","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:36.702600186Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.13.5","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:36.686887115Z","created_by":"coding","thread_id":""}]} -{"id":"miroir-uhj.13.6","title":"P5.13.f Event suppression by _miroir_origin tag (internal writes)","description":"Plan §13.13 'CDC event suppression'. _miroir_origin tag is an internal orchestrator-side marker — NEVER stored on document, never returned to clients, never leaves the orchestrator process. Filter table: antientropy (§13.8, not emitted), reshard_backfill (§13.1 steps 2-3, not emitted), ttl_expire (§13.14, opt-in via cdc.emit_ttl_deletes), rollover (§13.17, not emitted), absent tag = client write (ALWAYS emitted). emit_internal_writes config enables debug mode where all internal writes appear in CDC. Suppression metric: miroir_cdc_events_suppressed_total{origin} counter.","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:51:33.961120513Z","created_by":"coding","updated_at":"2026-04-24T03:52:33.517536122Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.13.6","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:33.517505492Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.13.6","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:33.502520571Z","created_by":"coding","thread_id":""}]} +{"id":"miroir-uhj.13.6","title":"P5.13.f Event suppression by _miroir_origin tag (internal writes)","description":"Plan §13.13 'CDC event suppression'. _miroir_origin tag is an internal orchestrator-side marker — NEVER stored on document, never returned to clients, never leaves the orchestrator process. Filter table: antientropy (§13.8, not emitted), reshard_backfill (§13.1 steps 2-3, not emitted), ttl_expire (§13.14, opt-in via cdc.emit_ttl_deletes), rollover (§13.17, not emitted), absent tag = client write (ALWAYS emitted). emit_internal_writes config enables debug mode where all internal writes appear in CDC. Suppression metric: miroir_cdc_events_suppressed_total{origin} counter.","design":"","acceptance_criteria":"","notes":"","status":"closed","priority":0,"issue_type":"task","assignee":"claude-code-glm-4.7-alpha","owner":"","created_at":"2026-04-18T21:51:33.961120513Z","created_by":"coding","updated_at":"2026-05-06T11:20:09.840812190Z","closed_at":"2026-05-06T11:20:09.840812190Z","close_reason":"Implemented CDC event suppression by _miroir_origin tag with metrics callback. Added CdcSuppressedMetricCallback type, CdcManager::with_metrics() constructor, and updated publish() to call callback when suppressing events. Cleaned up duplicate TTL filtering logic and added comprehensive tests.","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.13.6","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:33.517505492Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.13.6","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:33.502520571Z","created_by":"coding","thread_id":""}]} {"id":"miroir-uhj.14","title":"P5.14 §13.14 Document TTL + automatic expiration","description":"## What\n\nAdd reserved field `_miroir_expires_at` (integer unix ms); background sweeper per-shard deletes expired docs via the shard-filter primitive (plan §13.14):\n\n```\nfor each owned shard s:\n POST /indexes/{uid}/documents/delete\n body: {\"filter\": \"_miroir_shard = {s} AND _miroir_expires_at <= {now_ms}\"}\n```\n\nSweep cadence per-index via `POST /_miroir/indexes/{uid}/ttl-policy`. Field stripped from responses like other `_miroir_*` fields (plan §5 reserved-fields table). `_miroir_expires_at` added to `filterableAttributes` automatically at index creation via §13.5 two-phase broadcast when TTL is enabled.\n\n## Why\n\nPlan §13.14: \"Session data, log entries, cache documents, GDPR records — all need expiration. Today: cron jobs with filter-delete. Often forgotten, often broken, sometimes OOM.\"\n\n## Details\n\n**Scaling mode** (plan §14.6): Mode A — each pod sweeps only its rendezvous-owned shards; no duplicate deletes.\n\n**Interaction with §13.8 anti-entropy** (plan §13.14 + §13.8 step 3):\n- TTL deletes fan out to ALL replicas in one quorum write (same as any other delete)\n- Anti-entropy treats expired docs as logically deleted regardless — \"highest updated_at wins\" is **suspended** for expired\n- Prevents zombie resurrection on every AE pass\n\n**Admin API**: `POST /_miroir/indexes/{uid}/ttl-policy` body `{\"sweep_interval_s\": N, \"max_deletes_per_sweep\": M, \"enabled\": bool}` (overrides `ttl.per_index_overrides` global).\n\n**Config**:\n```yaml\nttl:\n enabled: true\n sweep_interval_s: 300\n max_deletes_per_sweep: 10000\n expires_at_field: _miroir_expires_at\n per_index_overrides: {}\n```\n\n**Metrics**: `miroir_ttl_documents_expired_total{index}`, `miroir_ttl_sweep_duration_seconds{index}`, `miroir_ttl_pending_estimate{index}`.\n\n## Acceptance\n\n- [ ] Doc with `_miroir_expires_at = now - 1000` is gone after one sweep cycle\n- [ ] TTL sweep + late straggler write: zombie doc does NOT reappear after anti-entropy pass\n- [ ] CDC subscribers see TTL deletes only when `cdc.emit_ttl_deletes: true`\n- [ ] `_miroir_expires_at` stripped from search hits\n- [ ] 10k-doc sweep respects `max_deletes_per_sweep` (doesn't exceed)","design":"","acceptance_criteria":"","notes":"","status":"open","priority":1,"issue_type":"task","owner":"","created_at":"2026-04-18T21:37:00.567941804Z","created_by":"coding","updated_at":"2026-04-24T03:52:37.963156745Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.14","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:37.963119074Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.14","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:37.945993240Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.14","depends_on_id":"bf-2mls0","type":"blocks","created_at":"2026-05-05T04:11:13.583446849Z","created_by":"cli","thread_id":""}]} {"id":"miroir-uhj.15","title":"P5.15 §13.15 Tenant-to-replica-group affinity","description":"## What\n\nResolve tenant identity per request in one of three modes (plan §13.15):\n- **header** — `X-Miroir-Tenant` → `group = hash(tenant_id) % RG`\n- **api_key** — derive from inbound API key via `tenant_map` table\n- **explicit** — static map tenant → group_id; unknown tenants fall through to `fallback` routing\n\nWrites always fan out to all groups (consistency invariant preserved). Only **reads** honor affinity: tenant's queries pinned to tenant's group. Heavy tenant consumes only that group's capacity.\n\nOptional **dedicated groups** — mark groups as reserved for mapped tenants only; others share the pool.\n\n## Why\n\nPlan §13.15: \"Noisy-neighbor isolation in multi-tenant deployments. Without isolation, one tenant's 10 kQPS spike degrades every other tenant's queries. Without Miroir, this forces operators to run fully separate clusters per tenant.\"\n\n## Details\n\n**Scaling mode**: stateless per-request; tenant map LRU is per-pod.\n\n**Memory**: `tenant_map` LRU ~20 MB (plan §14.2 only when `mode: api_key`).\n\n**Interaction with §13.6 session pinning**: session pin wins on conflict (plan §13.11 Interaction paragraph + metric `miroir_tenant_session_pin_override_total`).\n\n**Interaction with §13.3 adaptive selection**: tenant affinity narrows the group; adaptive selection chooses within.\n\n**Config** (plan §13.15):\n```yaml\ntenant_affinity:\n enabled: true\n mode: header\n header_name: X-Miroir-Tenant\n fallback: hash # hash | random | reject\n static_map: {enterprise-co: 0, startup-inc: 1}\n dedicated_groups: [0] # group 0 reserved for mapped tenants only\n```\n\n**Metrics**: `miroir_tenant_queries_total{tenant, group}`, `miroir_tenant_pinned_groups{tenant}`, `miroir_tenant_fallback_total{reason}`.\n\n## Acceptance\n\n- [ ] Tenant-A queries pin to group 0 consistently; tenant-B pins to group 1\n- [ ] Tenant-A 10kQPS burst does NOT raise tenant-B latency (measured in a chaos test)\n- [ ] Writes from tenant-A still fan out to ALL groups (durability invariant)\n- [ ] Unknown tenant with `fallback: reject` → 401 / 400 per policy\n- [ ] Dedicated groups: non-mapped tenant cannot be routed to group 0","design":"","acceptance_criteria":"","notes":"","status":"open","priority":1,"issue_type":"task","owner":"","created_at":"2026-04-18T21:37:00.588242214Z","created_by":"coding","updated_at":"2026-04-24T03:52:37.908249455Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.15","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:37.908204067Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.15","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:37.892034592Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.15","depends_on_id":"bf-rh3zb","type":"blocks","created_at":"2026-05-05T04:11:13.588935550Z","created_by":"cli","thread_id":""}]} {"id":"miroir-uhj.16","title":"P5.16 §13.16 Traffic shadow / teeing to a staging cluster","description":"## What\n\nAsync-shadow a configurable fraction of incoming requests to another Miroir or standalone Meilisearch (plan §13.16):\n\n```\nclient ──→ Miroir ──→ primary cluster ──→ response to client (synchronous)\n └──→ shadow cluster ──→ async diff worker\n ↓\n /_miroir/shadow/diff stream\n prometheus histograms\n```\n\nDiff worker compares responses:\n- hit set symmetric difference\n- ranking-order Kendall τ\n- latency Δ\n- error rate (shadow vs. primary)\n\nResults to in-memory ring buffer (queryable at `/_miroir/shadow/diff`) + summarized in Prometheus histograms.\n\n## Why\n\nPlan §13.16: \"Every settings change, ranking-rule tweak, Meilisearch upgrade, or Miroir config change carries risk. Validating against real production traffic is the only reliable way — but production is the scariest place to experiment.\"\n\n## Details\n\n**Writes are NEVER shadowed** — config enforces `operations: [search, multi_search, explain]`.\n\n**Config** (plan §13.16):\n```yaml\nshadow:\n enabled: true\n targets:\n - name: staging\n url: http://miroir-staging.search.svc:7700\n api_key_env: SHADOW_API_KEY\n sample_rate: 0.05\n operations: [search, multi_search, explain]\n diff_buffer_size: 10000\n max_shadow_latency_ms: 5000\n```\n\n**Scaling mode**: stateless per-request; each pod independently decides via local RNG whether to shadow.\n\n**Ring buffer**: plan §4 task store explicitly **does not** persist shadow diffs — in-memory only.\n\n**Client isolation**: shadow failures never impact primary latency; worst case shadow is canceled via `max_shadow_latency_ms` budget.\n\n**Metrics**: `miroir_shadow_diff_total{kind=hits|ranking|latency|error}`, `miroir_shadow_kendall_tau` histogram, `miroir_shadow_latency_delta_seconds` histogram, `miroir_shadow_errors_total{target, side}`.\n\n**Admin API**: `GET /_miroir/shadow/diff?target={name}&limit=N&since_id=X&kind={hits,ranking,latency,error}`.\n\n## Acceptance\n\n- [ ] 5% sampled — ~50/1000 queries go to shadow (verified in test)\n- [ ] Shadow cluster down → 0 impact on primary latency or error rate\n- [ ] Ring buffer reports divergences; buffer size bounded; oldest evicted when full\n- [ ] Writes never appear in shadow target's logs (operations filter enforced)","design":"","acceptance_criteria":"","notes":"","status":"open","priority":1,"issue_type":"task","owner":"","created_at":"2026-04-18T21:37:00.605599542Z","created_by":"coding","updated_at":"2026-04-24T03:52:37.853765144Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.16","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:37.853724446Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.16","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:37.835017336Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.16","depends_on_id":"bf-2gwot","type":"blocks","created_at":"2026-05-05T04:11:13.593792186Z","created_by":"cli","thread_id":""}]} @@ -276,7 +276,7 @@ {"id":"miroir-uhj.21.6","title":"P5.21.f Analytics beacons + CDC integration (click-through + latency)","description":"Plan §13.21 analytics. When search_ui.analytics.enabled=true, SPA emits beacons on result click + search completion via POST /_miroir/ui/search/{index}/beacon. Idempotent: client generates event_id once per unique (query, result_id, session) tuple for click-throughs and (session, minute_bucket) for latency beacons; reuses on retry — page refreshes don't double-count. Emitted CDC event (type: click_through | latency) uses event_id as identity; downstream consumers dedup. Latency events subject to cdc.emit_internal_writes. Fallback for old browsers: orchestrator computes event_id = hash(session || query || result_id || minute_bucket) server-side.","design":"","acceptance_criteria":"","notes":"","status":"open","priority":2,"issue_type":"task","owner":"","created_at":"2026-04-18T21:52:33.247824343Z","created_by":"coding","updated_at":"2026-04-24T03:52:38.706648377Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5","ui"],"dependencies":[{"issue_id":"miroir-uhj.21.6","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:38.706600378Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.21.6","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:38.689871140Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.21.6","depends_on_id":"miroir-uhj.21.4","type":"blocks","created_at":"2026-04-18T21:52:43.225732391Z","created_by":"coding","thread_id":""}]} {"id":"miroir-uhj.3","title":"P5.3 §13.3 Adaptive replica selection (EWMA-based)","description":"## What\n\nReplace the `query_seq`-based round-robin intra-group replica selection in `covering_set` with an EWMA-scored selection (plan §13.3):\n\n```\nscore(node) = α · latency_p95_ms + β · in_flight_count + γ · error_rate\n```\n\n- All three inputs EWMA-smoothed (default half-life 5s)\n- Router picks lowest-scoring eligible node with probability `1 − ε`; with `ε` (default 0.05) picks uniformly random to keep sampling recovering nodes\n\n## Why\n\nPlan §13.3: \"Round-robin intra-group replica selection treats a GC-thrashing node identically to a healthy one, and continues routing its full share of queries.\" Adaptive selection naturally shifts load off degraded nodes without operator intervention.\n\n## Details\n\n**Config** (plan §13.3):\n```yaml\nreplica_selection:\n strategy: adaptive # adaptive | round_robin | random\n latency_weight: 1.0\n inflight_weight: 2.0\n error_weight: 10.0\n ewma_half_life_ms: 5000\n exploration_epsilon: 0.05\n```\n\n**Scaling mode**: per-pod EWMA state; each pod's scores are local; pods converge independently. Slight divergence is harmless.\n\n**Exclusion threshold**: if all replicas of a shard score above 5× fleet median, fall back cross-group per plan §2 \"Group unavailability fallback.\"\n\n**Metrics**: `miroir_replica_selection_score{node_id}` gauge, `miroir_replica_selection_exploration_total` counter.\n\n## Acceptance\n\n- [ ] Induce 200ms latency on node-1 of a 3-replica group; traffic to node-1 drops within 2× half-life\n- [ ] Node-1 fully recovers after latency clears; distribution returns to ~1/3 within 2× half-life\n- [ ] Exploration: over 1000 queries with one node under heavy load, still ~50 queries routed to it (5% epsilon) — proves recovery sampling\n- [ ] Round-robin fallback mode (`strategy: round_robin`) works identically to Phase 1 baseline","design":"","acceptance_criteria":"","notes":"","status":"open","priority":1,"issue_type":"task","owner":"","created_at":"2026-04-18T21:33:36.778998188Z","created_by":"coding","updated_at":"2026-04-24T03:52:38.258138950Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.3","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:38.258110608Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.3","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:38.242473261Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.3","depends_on_id":"bf-3eqmr","type":"blocks","created_at":"2026-05-05T04:10:01.008345180Z","created_by":"cli","thread_id":""}]} {"id":"miroir-uhj.4","title":"P5.4 §13.4 Shard-aware query planner (PK-constrained narrowing)","description":"## What\n\nParse search requests' `filter` expressions and narrow the shard set when the filter pins the primary key (plan §13.4):\n\nNarrowable:\n- `{pk} = \"literal\"` → 1 shard\n- `{pk} IN [\"a\",\"b\",\"c\"]` → up to `len(list)` shards\n- PK predicate `AND` other predicates → still narrowable\n\nNon-narrowable:\n- `OR` at top level with non-PK branches\n- Negation of a PK predicate\n- PK `IN` list exceeding `max_pk_literals_narrowable` (default 128)\n\n## Why\n\nPlan §13.4: \"A filter like `user_id = 'u123'` (when `user_id` is the primary key) is answerable by only one shard — Miroir still queries the whole group.\" Narrowing drops the fan-out from `N/RG` nodes to `RF` (or 1 with RF=1).\n\n## Details\n\n**Parser choice**: `pest` or hand-rolled `nom` for the Meilisearch filter DSL. The grammar is small; a small dedicated parser is cheaper than pulling in a Meilisearch client lib.\n\n**Correctness proof** (plan §13.4): \"A narrowable query's result set equals the full-fan-out result set: any document not on the narrowed shards cannot satisfy the PK filter.\"\n\n**Plan cache**: per-pod LRU keyed by `(normalized_filter, index)` so identical filters reuse parse + narrow decisions. Plan §14.2 budget: 20 MB.\n\n**Config**:\n```yaml\nquery_planner:\n enabled: true\n max_pk_literals_narrowable: 128\n log_plans: false\n```\n\n**Metrics**: `miroir_query_plan_narrowable_total{narrowed=yes|no}`, `miroir_query_plan_fanout_size` histogram, `miroir_query_plan_narrowing_ratio` gauge.\n\n**Integration with §13.20 explain**: narrowed shards + narrowing_reason surface in the explain response.\n\n## Acceptance\n\n- [ ] Filter `product_id = \"abc\"` → fan-out to 1 node (RF=1) / RF nodes (RF>1), not the whole group\n- [ ] `product_id IN [\"a\",\"b\",\"c\"]` → fan-out to up to 3 shards' nodes\n- [ ] `product_id = \"abc\" OR category = \"laptop\"` (PK on one branch, non-PK on other) → full fan-out (not narrowable)\n- [ ] Result parity: narrowed query returns the same hits as a full-fan-out query (property test on 1000 random PK-constrained queries)","design":"","acceptance_criteria":"","notes":"","status":"open","priority":1,"issue_type":"task","owner":"","created_at":"2026-04-18T21:33:36.802461165Z","created_by":"coding","updated_at":"2026-04-24T03:52:38.210592386Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.4","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:38.210567514Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.4","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:38.194894301Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.4","depends_on_id":"bf-5mppm","type":"blocks","created_at":"2026-05-05T04:10:01.032074175Z","created_by":"cli","thread_id":""}]} -{"id":"miroir-uhj.5","title":"P5.5 §13.5 Two-phase settings broadcast + drift reconciler (OP#4)","description":"## What\n\nReplace plan §3's sequential settings flow with propose / verify / commit (plan §13.5):\n\n**Phase 1 — Propose (parallel)**: `PATCH /indexes/{uid}/settings` on every node; await all `succeeded`.\n**Phase 2 — Verify (parallel)**: `GET /indexes/{uid}/settings`; sha256(canonical_json(actual)) must equal sha256(canonical_json(proposed)) on every node.\n**Phase 3 — Commit**: on ok, increment cluster-wide `settings_version` in task store; stamp `X-Miroir-Settings-Version` on future responses. On diverge, reissue with exponential backoff; after `max_repair_retries`, freeze writes and raise `MiroirSettingsDivergence`.\n\n**Drift reconciler (always on)**: background task every `settings_drift_check.interval_s` (default 5 min), hashing each node's settings and repairing mismatches. Catches out-of-band changes (operator SSH'd to a node and called PATCH directly).\n\n**Client-pinned freshness**: clients echo last observed `X-Miroir-Settings-Version` back as `X-Miroir-Min-Settings-Version`; covering-set excludes nodes below floor; 503 `miroir_settings_version_stale` if no covering set assembled.\n\n## Why\n\nPlan §15 Open Problem 4 + plan §3 \"the highest-risk operation in the lifecycle\": a partial settings apply produces non-uniform ranking, corrupting merged results. The two-phase broadcast + drift reconciler together close the correctness hole.\n\n## Details\n\n**Scaling mode**: Mode B leader for the broadcast; Mode A rendezvous-partitioned for the drift check (plan §14.6).\n\n**`node_settings_version` table** (Phase 3) is where each (index, node_id) pair's verified version is recorded.\n\n**Mid-broadcast behavior**: reads during phases 1–2 return 202-style `X-Miroir-Settings-Inconsistent` warning header.\n\n**Config** (plan §13.5):\n```yaml\nsettings_broadcast:\n strategy: two_phase\n verify_timeout_s: 60\n max_repair_retries: 3\n freeze_writes_on_unrepairable: true\nsettings_drift_check:\n interval_s: 300\n auto_repair: true\n```\n\n**Metrics**: `miroir_settings_broadcast_phase`, `miroir_settings_hash_mismatch_total`, `miroir_settings_drift_repair_total`, `miroir_settings_version`.\n\n**Alert**: `MiroirSettingsDivergence` (plan §10) fires when mismatches detected without corresponding repair.\n\n## Acceptance\n\n- [ ] Normal flow: add a synonym; both propose + verify succeed; `settings_version` increments exactly once\n- [ ] Mid-broadcast node failure: phase 2 verify fails on one node → reissue succeeds after backoff; alert not raised\n- [ ] Out-of-band drift: `PATCH` a node directly → drift reconciler detects within `interval_s` and repairs\n- [ ] `X-Miroir-Min-Settings-Version` floor excludes stale nodes from covering set; returns 503 when no floor-satisfying covering set exists\n- [ ] Legacy `strategy: sequential` still works for rollback compatibility","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:33:36.832431246Z","created_by":"coding","updated_at":"2026-04-24T03:52:34.783968478Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.5","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:34.783928039Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.5","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:34.767225537Z","created_by":"coding","thread_id":""}]} +{"id":"miroir-uhj.5","title":"P5.5 §13.5 Two-phase settings broadcast + drift reconciler (OP#4)","description":"## What\n\nReplace plan §3's sequential settings flow with propose / verify / commit (plan §13.5):\n\n**Phase 1 — Propose (parallel)**: `PATCH /indexes/{uid}/settings` on every node; await all `succeeded`.\n**Phase 2 — Verify (parallel)**: `GET /indexes/{uid}/settings`; sha256(canonical_json(actual)) must equal sha256(canonical_json(proposed)) on every node.\n**Phase 3 — Commit**: on ok, increment cluster-wide `settings_version` in task store; stamp `X-Miroir-Settings-Version` on future responses. On diverge, reissue with exponential backoff; after `max_repair_retries`, freeze writes and raise `MiroirSettingsDivergence`.\n\n**Drift reconciler (always on)**: background task every `settings_drift_check.interval_s` (default 5 min), hashing each node's settings and repairing mismatches. Catches out-of-band changes (operator SSH'd to a node and called PATCH directly).\n\n**Client-pinned freshness**: clients echo last observed `X-Miroir-Settings-Version` back as `X-Miroir-Min-Settings-Version`; covering-set excludes nodes below floor; 503 `miroir_settings_version_stale` if no covering set assembled.\n\n## Why\n\nPlan §15 Open Problem 4 + plan §3 \"the highest-risk operation in the lifecycle\": a partial settings apply produces non-uniform ranking, corrupting merged results. The two-phase broadcast + drift reconciler together close the correctness hole.\n\n## Details\n\n**Scaling mode**: Mode B leader for the broadcast; Mode A rendezvous-partitioned for the drift check (plan §14.6).\n\n**`node_settings_version` table** (Phase 3) is where each (index, node_id) pair's verified version is recorded.\n\n**Mid-broadcast behavior**: reads during phases 1–2 return 202-style `X-Miroir-Settings-Inconsistent` warning header.\n\n**Config** (plan §13.5):\n```yaml\nsettings_broadcast:\n strategy: two_phase\n verify_timeout_s: 60\n max_repair_retries: 3\n freeze_writes_on_unrepairable: true\nsettings_drift_check:\n interval_s: 300\n auto_repair: true\n```\n\n**Metrics**: `miroir_settings_broadcast_phase`, `miroir_settings_hash_mismatch_total`, `miroir_settings_drift_repair_total`, `miroir_settings_version`.\n\n**Alert**: `MiroirSettingsDivergence` (plan §10) fires when mismatches detected without corresponding repair.\n\n## Acceptance\n\n- [ ] Normal flow: add a synonym; both propose + verify succeed; `settings_version` increments exactly once\n- [ ] Mid-broadcast node failure: phase 2 verify fails on one node → reissue succeeds after backoff; alert not raised\n- [ ] Out-of-band drift: `PATCH` a node directly → drift reconciler detects within `interval_s` and repairs\n- [ ] `X-Miroir-Min-Settings-Version` floor excludes stale nodes from covering set; returns 503 when no floor-satisfying covering set exists\n- [ ] Legacy `strategy: sequential` still works for rollback compatibility","design":"","acceptance_criteria":"","notes":"","status":"closed","priority":0,"issue_type":"task","assignee":"claude-code-glm-4.7-alpha","owner":"","created_at":"2026-04-18T21:33:36.832431246Z","created_by":"coding","updated_at":"2026-05-08T13:23:41.177248642Z","closed_at":"2026-05-08T13:23:41.177248642Z","close_reason":"Implementation complete - all 8 tests pass","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.5","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:34.783928039Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.5","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:34.767225537Z","created_by":"coding","thread_id":""}]} {"id":"miroir-uhj.5.1","title":"P5.5.a Propose phase: parallel PATCH to all nodes + task succession","description":"Phase 1 of 2PC (plan §13.5). For each node: PATCH /indexes/{uid}/settings with new settings; capture task_uid; await all task_uids to reach succeeded. Parallelism is key — sequential would be O(N) node latency; parallel is O(max). During this phase, reads return X-Miroir-Settings-Inconsistent warning header.","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:50:54.130020474Z","created_by":"coding","updated_at":"2026-04-24T03:52:33.838724044Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.5.1","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:33.838685603Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.5.1","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:33.820670585Z","created_by":"coding","thread_id":""}]} {"id":"miroir-uhj.5.2","title":"P5.5.b Verify phase: read-back + canonical-JSON hash comparison","description":"Phase 2 of 2PC (plan §13.5). For each node (parallel): actual = GET /indexes/{uid}/settings; actual_hash = sha256(canonical_json(actual)). All hashes must equal sha256(canonical_json(proposed)). On diverge: reissue settings with exponential backoff (repair). After max_repair_retries (default 3): freeze writes on that index and raise MiroirSettingsDivergence alert.","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:50:54.159455415Z","created_by":"coding","updated_at":"2026-04-24T03:52:33.780283049Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.5.2","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:33.780245469Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.5.2","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:33.763024601Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.5.2","depends_on_id":"miroir-uhj.5.1","type":"blocks","created_at":"2026-04-18T21:52:42.832682678Z","created_by":"coding","thread_id":""}]} {"id":"miroir-uhj.5.3","title":"P5.5.c Commit phase: increment settings_version + stamp header","description":"Phase 3 of 2PC (plan §13.5). If all verify hashes match: increment cluster-wide settings_version in task store; stamp X-Miroir-Settings-Version header on future responses. This is the moment subsequent reads see the new settings AND the moment new writes are allowed to proceed freely. Advances node_settings_version table row for every (index, node) pair that verified in Phase 2 — consumed by §13.5 X-Miroir-Min-Settings-Version client freshness checks.","design":"","acceptance_criteria":"","notes":"","status":"open","priority":0,"issue_type":"task","owner":"","created_at":"2026-04-18T21:50:54.191201274Z","created_by":"coding","updated_at":"2026-04-24T03:52:33.728290530Z","close_reason":"","closed_by_session":"","source_system":"","source_repo":".","deleted_by":"","delete_reason":"","original_type":"","compaction_level":0,"original_size":0,"sender":"","labels":["advanced-13","phase-5"],"dependencies":[{"issue_id":"miroir-uhj.5.3","depends_on_id":"miroir-9dj","type":"blocks","created_at":"2026-04-24T03:52:33.728248117Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.5.3","depends_on_id":"miroir-r3j","type":"blocks","created_at":"2026-04-24T03:52:33.711555442Z","created_by":"coding","thread_id":""},{"issue_id":"miroir-uhj.5.3","depends_on_id":"miroir-uhj.5.2","type":"blocks","created_at":"2026-04-18T21:52:42.847536177Z","created_by":"coding","thread_id":""}]}