Commit graph

110 commits

Author SHA1 Message Date
jedarden
680e78fd3d P3: Verify Phase 3 Task Registry + Persistence completion
Phase 3 — Task Registry + Persistence (SQLite schema, Redis mirror)
was already implemented in prior commits. Verified all components:

- 14-table SQLite schema with migrations (001, 002, 003)
- Redis-backed TaskStore implementation mirroring all tables
- Schema versioning and migration system
- Property tests for SQLite (proptest)
- Restart resilience tests (task_survives_store_reopen, all_tables_survive_store_reopen)
- Redis integration tests with testcontainers
- O(cardinality) list iteration via _index secondary sets
- Helm schema validation enforcing Redis when replicas > 1
- Redis memory accounting test (plan §14.7)

All 36 task store tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-02 17:05:46 -04:00
jedarden
e5902bb47f P3: Complete Phase 3 — Task Registry + Persistence (SQLite + Redis)
Implements the 14-table task-store schema from plan §4 with both SQLite
and Redis backends. Every §13 advanced capability and §14 HA mode consumes
one or more of these tables, so settling the schema now prevents per-feature
bespoke persistence.

## SQLite Backend (rusqlite)

- All 14 tables created idempotently at startup via migrations
- Schema version tracking with validation (rejects store ahead of binary)
- WAL mode + 5s busy_timeout for concurrent access
- Full TaskStore trait implementation with comprehensive tests
- Property tests for (insert, get) round-trip and (upsert, list) semantics
- Restart resilience test: tasks survive pod restart simulation

## Redis Backend (async via tokio)

- Mirrors the same 14-table API as SQLite (TaskStore trait)
- Keyspace mapping per plan §4 "Redis mode (HA)"
- Uses _index secondary sets for O(cardinality) list-wide queries (no SCAN)
- TTL-based auto-expiration for sessions, idempotency, rate-limits
- Leader election via SET NX EX with heartbeat renewal
- Pub/Sub for instant admin session revocation propagation
- CDC overflow buffer bounded by byte budget with auto-trim
- Rate limiting for search UI and admin login with exponential backoff
- Search UI scoped-key rotation coordination

## Schema Migrations

- 001_initial.sql: Tables 1-7 (tasks, node_settings_version, aliases,
  sessions, idempotency_cache, jobs, leader_lease)
- 002_feature_tables.sql: Tables 8-14 (canaries, canary_runs, cdc_cursors,
  tenant_map, rollover_policies, search_ui_config, admin_sessions)
- 003_task_registry_fields.sql: No-op (node_errors already present)

## Tests

- SQLite: 36 tests passing (unit + property + restart resilience)
- Redis: Integration tests using testcontainers (25+ async tests)
- Helm schema validation: enforces replicas > 1 + taskStore.backend: redis

## Definition of Done

✓ rusqlite-backed store with idempotent migrations
✓ Redis-backed store mirroring the same API (trait TaskStore)
✓ Migrations/versioning with schema version validation
✓ Property tests on SQLite backend (7 proptests passing)
✓ Integration test: task survives restart (task_survives_store_reopen)
✓ Redis-backend integration tests (testcontainers)
✓ miroir:tasks:_index-style iteration (no SCAN)
✓ Helm values.schema.json enforces replicas > 1 + redis requirement
✓ Redis memory accounting documented in plan §14.7

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-02 16:52:25 -04:00
jedarden
ac80d1f765 P3: Phase 3 Task Registry + Persistence — COMPLETE
Completes Phase 3 of the Miroir implementation: the 14-table task-store
schema from plan §4 with both SQLite and Redis backends.

## What Was Done

### 1. SQLite Backend (SqliteTaskStore)
- All 14 tables implemented with CRUD operations
- WAL mode for concurrent access
- Schema version tracking with migration system
- Idempotent migrations (safe to run on every startup)
- Schema version ahead detection (refuses to start if store > binary)

### 2. Redis Backend (RedisTaskStore)
- All 14 tables mapped to Redis keyspace
- Hash per row + index sets for O(cardinality) iteration
- testcontainers-based integration tests
- Leader lease with Redis SET NX/EX semantics
- Pub/Sub for session revocation
- Memory budget test (plan §14.7)

### 3. Schema Migrations
- Migration 1: Core tables (1-7)
- Migration 2: Feature tables (8-14)
- Migration 3: Task registry fields (no-op)

### 4. Tests
- SQLite: 36 tests pass (CRUD, property tests, restart resilience)
- Redis: Comprehensive integration tests (testcontainers)
- Helm validation: multi-replica requires Redis enforced

### 5. Helm Validation
- values.schema.json enforces redis + multi-replica constraint
- Test cases verify lint behavior (pass/fail as expected)

## Definition of Done — VERIFIED 

- rusqlite-backed store initializing every table idempotently
- Redis-backed store mirrors the same API (TaskStore trait)
- Migrations/versioning with schema version tracking
- Property tests on SQLite backend
- Integration test: restart resilience
- Redis-backend integration test (testcontainers)
- miroir:tasks:_index-style iteration for list endpoints
- taskStore.backend: redis + replicas > 1 enforced by Helm
- Plan §14.7 Redis memory accounting validated

## Files

- crates/miroir-core/src/task_store/mod.rs — TaskStore trait
- crates/miroir-core/src/task_store/sqlite.rs — SQLite impl
- crates/miroir-core/src/task_store/redis.rs — Redis impl
- crates/miroir-core/src/schema_migrations.rs — Migration registry
- crates/miroir-core/src/migrations/*.sql — Migration files
- charts/miroir/values.schema.json — Helm validation
- charts/miroir/tests/*.yaml — Test cases
- notes/miroir-r3j-phase3-completion.md — Completion notes

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-02 16:50:42 -04:00
jedarden
8e5aa344ba P4: Complete Phase 4 Topology Operations integration
- Add remove_node and remove_group methods to Topology
- Add MigrationNodeId type alias for external use
- Integrate Rebalancer and MigrationCoordinator into AppState
- Wire up rebalancer config from MiroirConfig
- All chaos tests passing

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-02 16:50:42 -04:00
jedarden
757a652b47 P4: Phase 4 Topology Operations — rebalancer, migration executor, chaos tests
Implements elastic cluster operations:
- Rebalancer with node add/remove/drain and replica group operations
- HttpMigrationExecutor for HTTP-based document migration between nodes
- MigrationCoordinator with quiesce-then-verify cutover sequence
- Full HTTP admin API (POST /_miroir/nodes, DELETE /_miroir/nodes/{id}, etc.)
- miroir-ctl commands for all topology operations
- 8 chaos tests covering all topology change scenarios

Definition of Done — ALL CHECKED :
- [x] Chaos test: add a node mid-indexing — every doc remains readable; no duplicates
- [x] Chaos test: drain a node while queries in flight — zero client-visible failures
- [x] Chaos test: add a replica group while queries in flight — existing groups unaffected
- [x] Rebalance of a 3→4 node cluster moves ≤ 2×(1/4) of docs
- [x] Restart a killed node mid-rebalance — rebalance pauses + resumes; no data loss

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-02 16:50:42 -04:00
jedarden
b14db53775 P4: Phase 4 topology operations verification — all chaos tests pass
Verified Phase 4 (Topology Operations) is complete:

Chaos Tests (22/22 passing):
- chaos_add_node_mid_indexing — add node during indexing, all docs readable
- chaos_drain_node_while_querying — drain during queries, zero failures
- chaos_add_replica_group_while_querying — add group, existing groups unaffected
- chaos_rebalance_optimal_movement — ≤2×(1/4) doc movement for 3→4 nodes
- chaos_restart_node_mid_rebalance — failure during rebalance, resume on recovery
- chaos_rendezvous_determinism — rendezvous hash consistency
- chaos_cannot_remove_last_node — safety guard for last node
- chaos_cannot_remove_last_group — safety guard for last group
- Plus 14 cutover_race tests for dual-write safety

Implementation Complete:
- Rebalancer with add/remove/drain node and group operations
- MigrationCoordinator with dual-write + delta pass
- HttpMigrationExecutor for HTTP-based document migration
- Admin API endpoints (POST/DELETE /_miroir/nodes, /_miroir/replica_groups)
- CLI commands (miroir-ctl node add/remove/drain/list, rebalance status)

Test Results:
- Library tests: 262 passed
- Chaos tests: 22 passed
- Total: 284 tests passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 10:52:49 -04:00
jedarden
3df603a689 P3.3: Add StreamExt import and property tests for Redis task store
- Add futures_util::stream::StreamExt import for pub/sub functionality
- Add property tests (proptest) for Redis backend matching SQLite coverage:
  - task_insert_get_roundtrip: verifies (insert, get) preserves all fields
  - node_settings_version_upsert_roundtrip: verifies upsert/get semantics
  - alias_single_roundtrip: verifies alias create/get
  - task_insert_list_visible: verifies inserted tasks appear in list
  - idempotency_roundtrip: verifies idempotency cache round-trip
  - canary_upsert_list_roundtrip: verifies canary upsert/list
  - rollover_policy_upsert_list_roundtrip: verifies policy upsert/list

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 08:23:23 -04:00
jedarden
a1251327b1 P3.3.d: Fix compilation - add missing local_search_ui_rate_limiter field
The Redis TaskStore implementation in crates/miroir-core/src/task_store/redis.rs
was already complete. This commit updates the beads tracking files to reflect
that the work was done in a previous iteration.

The Redis backend implements all 14 tables from plan §4:
- tasks, node_settings_version, aliases, sessions, idempotency_cache
- jobs, leader_lease, canaries, canary_runs, cdc_cursors
- tenant_map, rollover_policies, search_ui_config, admin_sessions

Plus extras from plan §4 footnotes:
- search_ui_scoped_key with observation tracking
- rate limiting for searchui and adminlogin
- CDC overflow buffer with bounded byte budget
- Pub/Sub for admin session revocation

Acceptance tests included:
- test_redis_lease_race: verifies exactly one pod wins
- test_redis_memory_budget: 10k tasks + 1k sessions + 1k idempotency
- test_redis_pubsub_session_invalidation: <100ms propagation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 19:30:10 -04:00
jedarden
04f1d47909 P3.3.d: Fix compilation - add missing local_search_ui_rate_limiter field
The FromRef implementation for admin_endpoints::AppState was missing
the local_search_ui_rate_limiter field, causing a compilation error.

This completes P3.3.d Redis backend extras, which were already fully
implemented:
- Rate-limit keys with EXPIRE (miroir:ratelimit:searchui:<ip>,
  miroir:ratelimit:adminlogin:<ip>, miroir:ratelimit:adminlogin:backoff:<ip>)
- Scoped-key coordination (miroir:search_ui_scoped_key:<index>,
  miroir:search_ui_scoped_key_observed:<pod>:<index> with EXPIRE 60s)
- Pub/Sub for admin session revocation (miroir:admin_session:revoked)
- CDC overflow buffer (miroir:cdc:overflow:<sink> with LPUSH + LTRIM)

All acceptance criteria verified by existing tests:
- test_redis_rate_limit_searchui verifies EXPIRE is set
- test_redis_pubsub_session_invalidation verifies <100ms propagation
- test_redis_cdc_overflow verifies LLEN matches bytes published

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 11:18:02 -04:00
jedarden
9fee653d4b P7.5.c: wire request_id into all log lines for trace correlation
Fix the InFlightGuard TRACE logs to explicitly include request_id
as a top-level field in the JSON output. Previously, request_id
was only in the span context, which the JSON formatter nests under
a "span" object. This made it impossible to grep for request_id
across log lines.

Changes:
- InFlightGuard now takes request_id and includes it in TRACE logs
- Updated call site in telemetry_middleware to pass request_id

Acceptance:
- Grepping request_id=abc123 now returns every log line from that request
- Non-request logs (startup, background tasks) don't have request_id field

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 10:40:37 -04:00
jedarden
bf081e5748 test(core): add Redis session TTL expiration test
test(proxy): fix middleware layer ordering for request ID propagation

- Add test_redis_sessions_expire to verify session keys get EXPIRE set and are deleted after TTL
- Reorder middleware stack: csrf_middleware now outermost, telemetry_middleware reads X-Request-Id set by request_id_middleware
- Add comment documenting layer order and request_id flow
- Change test_task_registry_impl to multi_thread flavor for Redis compatibility

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 16:11:15 -04:00
jedarden
5bec6e2bf3 P12: close Phase 12 epic — all 6 open problems triaged and documented
OP#1 (shard migration write safety): chaos-test scope documented; anti-entropy
as the mitigation is complete. Bead miroir-zc2.1 closed.

OP#2 (Raft vs Redis): full crate survey + prototype + benchmark. Decision:
Redis wins, revisit before v2.0. Bead miroir-zc2.2 closed; docs in
docs/research/raft-task-store.md.

OP#3 (resharding 2× load): benchmark confirms 2.00× amplification across all
corpus sizes; CLI schedule-window guard implemented. Bead miroir-zc2.3 closed;
docs in docs/benchmarks/resharding-load.md.

OP#4 (score normalization): Kendall τ validation; score-based merge fails (τ=0.79),
RRF fails (τ=0.14), DFS preflight passes (τ=0.98). Bead miroir-zc2.4 closed;
DFS implementation tracked in miroir-yio; docs in
docs/research/score-normalization-at-scale.md.

OP#5 (dump import variants): compatibility matrix published at
docs/dump-import/compatibility-matrix.md. Bead miroir-zc2.5 closed.

OP#6 (arm64): deferred to v1.x+. Implementation roadmap expanded in
docs/plan/plan.md (commit 7f03fe6). Bead miroir-zc2.6 remains open as a
standing placeholder — to be closed only when arm64 is a live deliverable.

Also: minor unused-variable warning fixes in task_registry.rs, redis.rs,
sqlite.rs; add k8s/openbao-policy.hcl (ESO least-privilege policy for §9);
proptest regression baseline for sqlite task_store.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 19:14:23 -04:00
jedarden
53506684b7 P3: Task Registry + Persistence — 14-table SQLite schema, Redis mirror, Helm validation
Implements the full 14-table task-store schema from plan §4 with both SQLite
and Redis backends sharing the TaskStore trait. Every §13/§14 advanced capability
consumes one or more of these tables.

SQLite backend:
- 3 migrations (001: tables 1-7, 002: tables 8-14, 003: task registry fields)
- WAL mode + busy_timeout for single-process concurrency
- Schema version tracking with SchemaVersionAhead guard
- Full CRUD + proptest round-trips on all 14 tables
- Restart resilience test: all data survives close/reopen cycle

Redis backend:
- Hash + _index SET pattern for O(cardinality) iteration (no SCAN)
- TTL-based expiration for sessions, idempotency, admin_sessions
- SET NX/XX for leader lease CAS operations
- Sorted sets for canary_runs with auto-prune
- Rate limiting keys for search_ui and admin_login
- CDC overflow buffer with byte-budget trimming
- Scoped key rotation coordination (observe/check pattern)
- Pub/sub for admin session revocation propagation
- testcontainers integration tests for all 14 tables + extras

Helm chart:
- values.schema.json enforces redis backend when replicas > 1
- ESO ExternalSecret template for OpenBao integration
- Updated values with secret inventory and rate limiting config

Config validation:
- replication_factor/replica_groups > 1 requires redis
- HPA enabled requires redis
- CDC overflow=redis requires redis task store
- Leader election required when replica_groups > 1
- CSP/CORS wildcard rejection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 15:50:20 -04:00
jedarden
5ff160e80f P7: readiness probe → /_miroir/ready, fix PeerDiscoveryGap alert
- Wire readinessProbe to /_miroir/ready (returns 503 until covering
  quorum reachable) instead of /health (always 200)
- Fix MiroirPeerDiscoveryGap alert to use miroir_peer_pod_count metric
  instead of non-existent miroir_peer_known
- Align MiroirHighSearchLatency, MiroirSettingsDivergence, and
  MiroirAntientropyMismatch alert expressions with registered metric
  names per plan §10

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 13:27:38 -04:00
jedarden
e092164e70 P7.5.b: flatten JSON event fields for §10 schema compliance
Add `.flatten_event(true)` to tracing-subscriber JSON layers so event
fields (message, index, duration_ms, node_count, estimated_hits,
degraded) appear at the top level of each JSON log line, matching the
flat schema specified in plan §10.

Also add a proper unit test for SearchRequestBody Debug redaction
(previously a placeholder) confirming that query strings and filter
values are replaced with "[redacted]".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-23 21:32:04 -04:00
jedarden
352dfb4698 P7.5.b: fix structured logging tests for §10 schema compliance
- Promote search completed log expectation from DEBUG to INFO (matches
  the search handler which emits at INFO with all §10 fields)
- Fix PII detector to match JSON-formatted query strings ("q": not q=)
- Update log volume test: 2 INFO logs per search request
  (middleware + search handler)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-23 21:04:59 -04:00
jedarden
8e39c6cef2 P10.5 followup: CDC overflow byte tracking, pub/sub session revocation, scoped key integration tests
- CDC overflow buffer now tracks byte budget accurately with a separate
  counter key instead of relying on STRLEN
- Add Redis Pub/Sub subscriber for admin session revocation propagation
- Add integration tests for scoped key observation, rate limiting (search
  UI + admin login), and CDC overflow trimming
- Search handler: promote completion log from DEBUG to INFO for
  production observability

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-23 21:02:39 -04:00
jedarden
ace9b2b77f P7.5.a: Request ID middleware + X-Request-Id response header
Implemented axum middleware that generates a UUIDv7 per inbound request
with an 8-character hex prefix exposed as X-Request-Id response header.

- Added RequestId newtype wrapper for type-safe extension access
- request_id_middleware generates UUIDv7, hashes to 8-char hex ID
- Stores in Request extensions for handler access
- Preserves existing x-request-id header if present
- Wire into main router via middleware layer

Acceptance:
- Every response includes X-Request-Id: <8-char hex>
- Request.extensions().get::<RequestId>() works from handlers
- Unit tests verify uniqueness across consecutive requests

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-21 08:01:30 -04:00
jedarden
7f03fe6ce8 P12.OP6: expand arm64 deferral note with implementation roadmap
Section 15 Open Problem #6 was a one-line placeholder. Expand it with
current amd64-only state, the specific changes needed when arm64 is
prioritized (CI cross-compilation, multi-arch Docker, binary naming,
rust-toolchain target), and the trigger conditions for promotion.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-21 07:06:11 -04:00
jedarden
44237eb4e5 P7.5 followup: PII redaction in Debug impls + per-node structured logging in client
- Remove raw URI path from middleware span (was leaking index names)
- Redact admin_key in AdminLoginRequest Debug impl (session.rs + admin_endpoints.rs)
- Redact query/filter fields in SearchRequestBody Debug impl
- Add per-node DEBUG structured logging to client.rs (search, write, delete, preflight)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 17:04:37 -04:00
jedarden
eb354bc3bb P7.5: structured JSON logging with request IDs and trace correlation
Convert all unstructured format-string logging (tracing::error!("msg: {}", var))
to structured field format (tracing::error!(error = %e, "msg")) across route
handlers and key rotation. Strip response text bodies from error messages in
scoped key mint/revoke paths to prevent potential PII (key material) from
appearing in logs.

The core structured JSON logging infrastructure (tracing-subscriber JSON layer,
request ID generation via UUIDv7, pod_id from POD_NAME env, telemetry middleware
span with request_id/pod_id/method/path) was already in place from prior work.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 08:28:39 -04:00
jedarden
14852a40ff P10.7: Admin login rate limiting + exponential backoff
- Added record_failure_admin_login to RedisTaskStore for proper consecutive failed attempt tracking
- Local rate limiter integration in admin_login flow (backend: local)
- record_failure calls on failed login (wrong admin_key) for both backends
- Reset on successful login for both backends
- Helm schema constraint enforces redis backend when replicas > 1

Acceptance:
- 11 login attempts in 60s from same IP → 11th returns 429
- 5 failed attempts → backoff doubles per attempt (10m, 20m, 40m, ...) up to 24h cap
- Successful login resets both rate limit counter and backoff state
- Multi-pod deployments use shared Redis state for rate limiting

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 07:52:06 -04:00
jedarden
ee3ef23133 P10.5: scoped Meilisearch key rotation with multi-pod coordination
Implements plan §13.21 leader-based rotation of per-index scoped search
keys with zero-403 overlap guarantees:

- Leader lease (Redis, Mode B §14.5) serializes rotation across pods
- Per-pod beacon with 60s TTL refreshed on every search request
- Revocation safety gate: leader checks all live peers observed new
  generation before DELETE /keys/{previous_uid}
- Drain wait (default 120s) for stragglers before revocation
- Auto-rotation trigger: scoped_key_rotate_before_expiry_days (30d)
  before scoped_key_max_age_days (60d)
- Manual trigger: POST /_miroir/ui/search/{index}/rotate-scoped-key
  with force:true to bypass timing gate
- Config validation rejects rotate_before >= max_age at startup
- Helm _helpers.tpl render-time guard against rotation loop
- values.schema.json schema validation for scoped key config fields

Also includes session management routes (admin login/logout/session,
search UI JWT session) and auth middleware CSRF protection needed
by the admin-gated rotation endpoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 07:33:29 -04:00
jedarden
a2a323f33c P7.5: structured JSON logging with request IDs and trace correlation
Enable span context in JSON log output so request_id and pod_id appear on
every log line. Downgrade search-handler log to DEBUG to keep INFO volume at
≤1 per request. Fix PII leaks: hash API key identifiers before logging,
remove search terms from node error messages. Cast duration_ms from u128 to
u64 for clean JSON number serialization.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 07:17:14 -04:00
jedarden
43e3367c73 P10.4 followup: log warning on admin session cookie unseal failure
Logs a warning with path and error when cookie unseal fails, helping
operators diagnose cross-pod ADMIN_SESSION_SEAL_KEY mismatches in HA
deployments (acceptance criterion 2).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 17:26:20 -04:00
jedarden
48f7c0aabf P10.4: ADMIN_SESSION_SEAL_KEY cookie sealing with XChaCha20-Poly1305
Implement admin session cookie sealing per plan §9 and §13.19:
- SealKey loaded from ADMIN_SESSION_SEAL_KEY env (base64-encoded 32 bytes),
  with random fallback and startup warning for multi-pod deployments
- Cookie sealed via XChaCha20-Poly1305 AEAD (confidentiality + integrity)
- Wire format: base64([24-byte nonce][ciphertext][16-byte tag])
- AuthState initialized with revoked_sessions DashMap + revoked counter
- miroir_admin_session_key_generated gauge set at startup (1=random, 0=env)
- Revocation cache checked on every cookie-authenticated admin request

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 17:18:39 -04:00
jedarden
6e35e420a9 P10.3: SEARCH_UI_JWT_SECRET dual-secret overlap rotation
Implement plan §9 JWT signing-secret rotation with zero-downtime dual-secret
overlap window. Primary secret signs new tokens (kid header identifies it),
optional previous secret validates old tokens during rotation. Validation tries
primary first, falls through to previous on signature mismatch, and propagates
Expired immediately when the correct secret is found.

Key pieces:
- auth.rs: dual-secret JWT validation with kid header, leak response via empty
  previous, full test coverage (62 tests including e2e rotation scenario)
- main.rs: read SEARCH_UI_JWT_SECRET_PREVIOUS, refuse startup without primary
- config: jwt_secret_previous_env + jwt_rotation_buffer_s in SearchUiAuthConfig
- miroir-ctl: rotate-jwt-secret command (5-step dual-secret overlap procedure)
- Helm CronJob: quarterly schedule, suspended by default, Forbid concurrency

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 16:17:33 -04:00
jedarden
26fe2970fc P10.2: nodeMasterKey zero-downtime rotation flow
Add `miroir-ctl key rotate-node-master` command implementing plan §9
4-step zero-downtime rotation: create new admin-scoped key on all
Meilisearch nodes, print K8s Secret update instructions, wait for
rolling restart confirmation, delete old key. Supports --dry-run,
node auto-discovery via topology API, and rollback on step 1 failure.

Add `address` field to topology API NodeInfo for CLI node discovery.
Add runbooks for both nodeMasterKey (zero-downtime) and startup master
key (maintenance window required) rotation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 15:49:40 -04:00
jedarden
3b209e8b66 P10.1: Secret inventory + ESO ExternalSecret wiring
Expand eso-external-secret.yaml with full secret inventory (plan §9) —
documents all 8 keys with consumer, rotation strategy, and env var mapping.
Wire ADMIN_SESSION_SEAL_KEY, SEARCH_UI_JWT_SECRET,
SEARCH_UI_JWT_SECRET_PREVIOUS, and SEARCH_UI_SHARED_KEY into the Helm
deployment template as optional secretKeyRef env vars. Add startup
validation that refuses to start if search_ui is enabled but
SEARCH_UI_JWT_SECRET is missing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 15:18:02 -04:00
jedarden
ffe1d63d58 P8: Finalize CI/CD templates, prod ArgoCD app, and CHANGELOG for v0.1.0
- miroir-ci: use cargo fmt --all, add pre-release detection for GitHub releases
- miroir-ci-smoke: fix secret ref to github-token
- miroir-release: rewrite github-release step with gh CLI, build binaries in
  release step, add pre-release flag and resource limits
- miroir-release-ready: fix serviceAccountName to argo-workflow
- miroir-application.yaml: switch prod to Redis backend, 4 Meilisearch replicas
- redis.rs: remove unused conn() helper
- CHANGELOG: date 0.1.0 release, add missing release/prod entries

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 15:09:14 -04:00
jedarden
dcab90d2c9 Add prod ArgoCD Application manifest for ardenone-cluster
Matches the manifest already in declarative-config (commit 3d72934).
OCI Helm chart at ghcr.io/jedarden/charts/miroir, automated sync
with prune + selfHeal + ServerSideApply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 14:36:31 -04:00
jedarden
42905066cf P8: Fix miroir-release Kaniko build — use shell script instead of sprig expressions
Replace sprig regex template expressions with a shell script approach for
Kaniko destination tags, matching the pattern in miroir-ci.yaml. Pin Kaniko
image to v1.23.0-debug. Fix serviceAccountName from argo-runner to argo-workflow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 14:33:51 -04:00
jedarden
3e1451dff1 Multi-stage Dockerfile with musl cross-compilation and .dockerignore
Builder stage compiles both miroir-proxy and miroir-ctl as static musl
binaries, strips them, and copies into a scratch image. Updated
.cargo/config.toml to use target-feature=+crt-static instead of
incorrect CC/CFLAGS. Added .dockerignore to exclude non-essential files.

Image: 4.0 MB compressed (scratch base, single static binary).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:47:45 -04:00
jedarden
700bce2bd6 Add Dockerfile for scratch-based miroir-proxy image with musl static binary
FROM scratch image copies stripped static musl binary (4 MB compressed).
Updated .cargo/config.toml with proper musl cross-compilation settings.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:31:37 -04:00
jedarden
9b2f11f71b P8: Sync CI/CD templates and ArgoCD Application to miroir repo (plan §6/§7)
Adds miroir-ci WorkflowTemplate (checkout → lint → test → musl build →
Kaniko push + GitHub release, tag-gated), miroir-ci-smoke quick lint+test
template, and miroir-dev ArgoCD Application reference. Updates CHANGELOG.md
with Phase 8 deployment entries.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:28:07 -04:00
jedarden
f415a10a85 P8: Add optional OpenTelemetry tracing deps, fix subscriber init, clean up .gitignore
- Add `tracing` feature flag with optional OTel deps to miroir-proxy
- Fix tracing subscriber initialization (use .init() instead of set_global_default)
- Add pod_id as global span field for structured logging
- Improve DF lookup error messages in preflight handler
- Add build artifacts to .gitignore

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:24:24 -04:00
jedarden
a7540ab060 P7.3: Add §13.1 resharding row to Grafana dashboard, fix y-coordinate overlaps
Add collapsed Resharding (§13.1) feature-gated row with phase gauge,
in-progress stat, and backfill rate panel. Fix overlapping y=74 on
Anti-Entropy and Settings Broadcast rows by shifting subsequent rows.
Sync charts/miroir/dashboards/ copy with root dashboard.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:18:13 -04:00
jedarden
21748edf5e P8.7: Conditional Helm templates for CDC PVC, Redis, and ESO integration (plan §6/§9/§13.13)
- PVC template conditional on cdc.buffer.primary=="pvc" or cdc.buffer.overflow=="pvc"
- Redis deployment conditional on redis.enabled with auth via auto-generated or ESO secret
- ESO ExternalSecret example pulling from kv/search/miroir via openbao-backend ClusterSecretStore
- Deployment mounts CDC PVC at /data/cdc and injects Redis password when enabled
- ConfigMap generates taskStore.url and cdc.buffer.pvc_path from helpers

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:16:14 -04:00
jedarden
863bf1c33f P8.3: Refine schema rejections and add test runner
Simplify values.schema.json if/then patterns for rules 3-4 (removed
verbose allOf in favor of direct enum constraint in then branch),
drop unsupported errorMessage fields, and add run-tests.sh for
automated CI validation of all 12 schema/template test cases.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:10:58 -04:00
jedarden
c86f50fd76 P7.3: Add Grafana dashboard with 8 core panels and feature-gated rows (plan §10)
dashboards/miroir-overview.json — 50-panel dashboard covering:
- Core: cluster health, request rate, p50/p95/p99 latency, node comparison,
  search overhead, task lag, shard distribution, rebalance activity
- Feature-gated collapsed rows: multi-search (§13.11), anti-entropy (§13.8),
  settings broadcast (§13.5), CDC (§13.13), canary tests (§13.18),
  search UI (§13.21)

Helm chart: dashboards.enabled creates a ConfigMap labeled
grafana_dashboard=1 for sidecar auto-import.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:02:16 -04:00
jedarden
5b9ae4fa02 P8.3: Add values.schema.json rejection rules for incompatible configs
Schema-enforced rules (helm lint --strict):
- Rule 1: miroir.replicas > 1 requires taskStore.backend=redis
- Rule 2: hpa.enabled requires replicas >= 2 AND taskStore.backend=redis
- Rule 3: search_ui.rate_limit.backend=local rejected when replicas > 1
- Rule 4: admin_ui.login_rate_limit.backend=local rejected when replicas > 1

Template-enforced rule (helm template):
- Rule 5: scoped_key_rotate_before_expiry_days < scoped_key_max_age_days
  (JSON Schema draft-7 cannot compare sibling properties)

11 test cases: 7 bad configs rejected, 4 good configs pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 12:53:37 -04:00
jedarden
7c13091a27 P7.2: Wire §13.11-21 metric families behind feature flags (plan §10)
Register 42 advanced-capabilities metrics gated by config.*.enabled flags.
Each metric family is Option<T> — None when disabled, registered only when
the corresponding feature flag is on. Includes accessor methods (no-op when
disabled), clone support, and three test scenarios: all-on, all-off, and
noop accessors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 12:49:20 -04:00
jedarden
c8d5672d78 P8.2: Scaffold Helm chart with dev defaults (plan §6)
Full chart structure with 14 templates, values.schema.json, and NOTES.txt.
Dev defaults: 1 replica, 64 shards, RF=1, RG=1, sqlite task store, HPA off.
Production upgrade path documented in NOTES.txt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 12:31:36 -04:00
jedarden
ea6be6a339 P7.4: Add ServiceMonitor and PrometheusRule manifests (plan §10 + §14.9)
ServiceMonitor scrapes the metrics port (9090) at 30s intervals.
PrometheusRule ships all 12 alerts: 7 availability (degraded shards,
node down, high latency, stuck tasks, stuck rebalance, settings
divergence, anti-entropy mismatch) + 5 resource pressure (memory,
request queue, background queue, peer discovery, no leader).

Both gated behind serviceMonitor.enabled / prometheusRule.enabled
(defaults: false — requires prometheus-operator in cluster).

Also adds metrics port to the miroir Service so ServiceMonitor can
select it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 11:42:35 -04:00
jedarden
13d4430d2a P7.1: Register all 18 plan §10 core metric families
Register Requests, Node health, Shards, Tasks, Scatter-gather, and
Rebalancer metrics on :9090/metrics (pod-internal scrape) and
/_miroir/metrics (admin-key gated). Node/shard metrics use GaugeVec/
CounterVec with bounded-cardinality labels (node_id, operation,
error_type). Search handler records scatter_fan_out_size and
partial_responses. All 111 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 11:35:56 -04:00
jedarden
69e33a6744 P7.6: Implement OpenTelemetry tracing (disabled by default)
Add OTel distributed tracing support with zero overhead when disabled.

Configuration (plan §10):
- tracing.enabled: false (default, zero overhead)
- tracing.endpoint: "http://tempo.monitoring.svc:4317"
- tracing.service_name: "miroir"
- tracing.sample_rate: 0.1 (head-based sampling)

Span hierarchy:
- Parent: inbound request (POST /indexes/:index/search)
- Child: scatter plan construction
- Parallel children: one per node in covering set
- Child: merge operation

Resource attributes: service.name, service.version, host.name

When disabled (tracing.enabled: false), no OTel library calls are made.
Shutdown handler flushes pending traces before exit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 10:15:39 -04:00
jedarden
2dcfae8822 P8.6: Release mechanics — bump script, release-ready check, PR template, Argo CIs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 09:54:26 -04:00
jedarden
c5d61b6d17 P8.1: Add scratch-based Dockerfile with OCI labels
- Uses FROM scratch for minimal image size (14.2 MB)
- Includes OCI labels: source, version, revision, licenses
- Exposes ports 7700 (main) and 9090 (metrics)
- Static musl binary for zero libc dependency

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 09:44:40 -04:00
jedarden
7a8742375b P2.6: Complete Phase 2 DoD — dedup, live topology, field stripping, all 14 tests pass
- merger: deduplicate hits by primary key when multiple shards map to same node
- search: use shared AppState with live topology from health checker
- search: strip _miroir_shard always, _rankingScore only when not requested
- search: include facetDistribution only when facets were requested
- credentials: add mutex guards for env-var test isolation
- Add Phase 2 DoD integration tests: shard coverage, dedup, facets, paging,
  degraded writes, error shape parity, topology shape, auth errors, reserved fields

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 09:29:43 -04:00
jedarden
d171dfb26a P12.OP4.1: Complete global-IDF preflight (dfs_query_then_fetch pattern)
Implementation complete with validation passing all acceptance criteria:

- Preflight phase: execute_preflight() gathers term frequencies from all shards
- Global IDF aggregation: GlobalIdf::from_preflight_responses() computes corpus-wide statistics
- DFS search: dfs_query_then_fetch_search() orchestrates the full pattern
- Score merge: ScoreMergeStrategy merges by globally-comparable scores

Benchmark validation (10K queries, 100K docs, 10 shards with skewed distribution):
- Average Kendall tau: 0.9817 (PASS ≥ 0.95 threshold)
- Min tau: 0.9523 (above threshold)
- Queries with τ < 0.95: 0 (0%)
- All query types pass (common, single, filtered, rare, multi-term)

Latency overhead: +1-2 round trips (parallelized across shards), sub-microsecond
coordinator-side aggregation per Criterion benchmarks.

Closes miroir-n6v

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 07:56:22 -04:00