Convert all unstructured format-string logging (tracing::error!("msg: {}", var))
to structured field format (tracing::error!(error = %e, "msg")) across route
handlers and key rotation. Strip response text bodies from error messages in
scoped key mint/revoke paths to prevent potential PII (key material) from
appearing in logs.
The core structured JSON logging infrastructure (tracing-subscriber JSON layer,
request ID generation via UUIDv7, pod_id from POD_NAME env, telemetry middleware
span with request_id/pod_id/method/path) was already in place from prior work.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Added record_failure_admin_login to RedisTaskStore for proper consecutive failed attempt tracking
- Local rate limiter integration in admin_login flow (backend: local)
- record_failure calls on failed login (wrong admin_key) for both backends
- Reset on successful login for both backends
- Helm schema constraint enforces redis backend when replicas > 1
Acceptance:
- 11 login attempts in 60s from same IP → 11th returns 429
- 5 failed attempts → backoff doubles per attempt (10m, 20m, 40m, ...) up to 24h cap
- Successful login resets both rate limit counter and backoff state
- Multi-pod deployments use shared Redis state for rate limiting
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements plan §13.21 leader-based rotation of per-index scoped search
keys with zero-403 overlap guarantees:
- Leader lease (Redis, Mode B §14.5) serializes rotation across pods
- Per-pod beacon with 60s TTL refreshed on every search request
- Revocation safety gate: leader checks all live peers observed new
generation before DELETE /keys/{previous_uid}
- Drain wait (default 120s) for stragglers before revocation
- Auto-rotation trigger: scoped_key_rotate_before_expiry_days (30d)
before scoped_key_max_age_days (60d)
- Manual trigger: POST /_miroir/ui/search/{index}/rotate-scoped-key
with force:true to bypass timing gate
- Config validation rejects rotate_before >= max_age at startup
- Helm _helpers.tpl render-time guard against rotation loop
- values.schema.json schema validation for scoped key config fields
Also includes session management routes (admin login/logout/session,
search UI JWT session) and auth middleware CSRF protection needed
by the admin-gated rotation endpoint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Enable span context in JSON log output so request_id and pod_id appear on
every log line. Downgrade search-handler log to DEBUG to keep INFO volume at
≤1 per request. Fix PII leaks: hash API key identifiers before logging,
remove search terms from node error messages. Cast duration_ms from u128 to
u64 for clean JSON number serialization.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Logs a warning with path and error when cookie unseal fails, helping
operators diagnose cross-pod ADMIN_SESSION_SEAL_KEY mismatches in HA
deployments (acceptance criterion 2).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement admin session cookie sealing per plan §9 and §13.19:
- SealKey loaded from ADMIN_SESSION_SEAL_KEY env (base64-encoded 32 bytes),
with random fallback and startup warning for multi-pod deployments
- Cookie sealed via XChaCha20-Poly1305 AEAD (confidentiality + integrity)
- Wire format: base64([24-byte nonce][ciphertext][16-byte tag])
- AuthState initialized with revoked_sessions DashMap + revoked counter
- miroir_admin_session_key_generated gauge set at startup (1=random, 0=env)
- Revocation cache checked on every cookie-authenticated admin request
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add `miroir-ctl key rotate-node-master` command implementing plan §9
4-step zero-downtime rotation: create new admin-scoped key on all
Meilisearch nodes, print K8s Secret update instructions, wait for
rolling restart confirmation, delete old key. Supports --dry-run,
node auto-discovery via topology API, and rollback on step 1 failure.
Add `address` field to topology API NodeInfo for CLI node discovery.
Add runbooks for both nodeMasterKey (zero-downtime) and startup master
key (maintenance window required) rotation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Expand eso-external-secret.yaml with full secret inventory (plan §9) —
documents all 8 keys with consumer, rotation strategy, and env var mapping.
Wire ADMIN_SESSION_SEAL_KEY, SEARCH_UI_JWT_SECRET,
SEARCH_UI_JWT_SECRET_PREVIOUS, and SEARCH_UI_SHARED_KEY into the Helm
deployment template as optional secretKeyRef env vars. Add startup
validation that refuses to start if search_ui is enabled but
SEARCH_UI_JWT_SECRET is missing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Matches the manifest already in declarative-config (commit 3d72934).
OCI Helm chart at ghcr.io/jedarden/charts/miroir, automated sync
with prune + selfHeal + ServerSideApply.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace sprig regex template expressions with a shell script approach for
Kaniko destination tags, matching the pattern in miroir-ci.yaml. Pin Kaniko
image to v1.23.0-debug. Fix serviceAccountName from argo-runner to argo-workflow.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Builder stage compiles both miroir-proxy and miroir-ctl as static musl
binaries, strips them, and copies into a scratch image. Updated
.cargo/config.toml to use target-feature=+crt-static instead of
incorrect CC/CFLAGS. Added .dockerignore to exclude non-essential files.
Image: 4.0 MB compressed (scratch base, single static binary).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add `tracing` feature flag with optional OTel deps to miroir-proxy
- Fix tracing subscriber initialization (use .init() instead of set_global_default)
- Add pod_id as global span field for structured logging
- Improve DF lookup error messages in preflight handler
- Add build artifacts to .gitignore
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add collapsed Resharding (§13.1) feature-gated row with phase gauge,
in-progress stat, and backfill rate panel. Fix overlapping y=74 on
Anti-Entropy and Settings Broadcast rows by shifting subsequent rows.
Sync charts/miroir/dashboards/ copy with root dashboard.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- PVC template conditional on cdc.buffer.primary=="pvc" or cdc.buffer.overflow=="pvc"
- Redis deployment conditional on redis.enabled with auth via auto-generated or ESO secret
- ESO ExternalSecret example pulling from kv/search/miroir via openbao-backend ClusterSecretStore
- Deployment mounts CDC PVC at /data/cdc and injects Redis password when enabled
- ConfigMap generates taskStore.url and cdc.buffer.pvc_path from helpers
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Simplify values.schema.json if/then patterns for rules 3-4 (removed
verbose allOf in favor of direct enum constraint in then branch),
drop unsupported errorMessage fields, and add run-tests.sh for
automated CI validation of all 12 schema/template test cases.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Register 42 advanced-capabilities metrics gated by config.*.enabled flags.
Each metric family is Option<T> — None when disabled, registered only when
the corresponding feature flag is on. Includes accessor methods (no-op when
disabled), clone support, and three test scenarios: all-on, all-off, and
noop accessors.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Full chart structure with 14 templates, values.schema.json, and NOTES.txt.
Dev defaults: 1 replica, 64 shards, RF=1, RG=1, sqlite task store, HPA off.
Production upgrade path documented in NOTES.txt.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ServiceMonitor scrapes the metrics port (9090) at 30s intervals.
PrometheusRule ships all 12 alerts: 7 availability (degraded shards,
node down, high latency, stuck tasks, stuck rebalance, settings
divergence, anti-entropy mismatch) + 5 resource pressure (memory,
request queue, background queue, peer discovery, no leader).
Both gated behind serviceMonitor.enabled / prometheusRule.enabled
(defaults: false — requires prometheus-operator in cluster).
Also adds metrics port to the miroir Service so ServiceMonitor can
select it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Register Requests, Node health, Shards, Tasks, Scatter-gather, and
Rebalancer metrics on :9090/metrics (pod-internal scrape) and
/_miroir/metrics (admin-key gated). Node/shard metrics use GaugeVec/
CounterVec with bounded-cardinality labels (node_id, operation,
error_type). Search handler records scatter_fan_out_size and
partial_responses. All 111 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add OTel distributed tracing support with zero overhead when disabled.
Configuration (plan §10):
- tracing.enabled: false (default, zero overhead)
- tracing.endpoint: "http://tempo.monitoring.svc:4317"
- tracing.service_name: "miroir"
- tracing.sample_rate: 0.1 (head-based sampling)
Span hierarchy:
- Parent: inbound request (POST /indexes/:index/search)
- Child: scatter plan construction
- Parallel children: one per node in covering set
- Child: merge operation
Resource attributes: service.name, service.version, host.name
When disabled (tracing.enabled: false), no OTel library calls are made.
Shutdown handler flushes pending traces before exit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Uses FROM scratch for minimal image size (14.2 MB)
- Includes OCI labels: source, version, revision, licenses
- Exposes ports 7700 (main) and 9090 (metrics)
- Static musl binary for zero libc dependency
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- merger: deduplicate hits by primary key when multiple shards map to same node
- search: use shared AppState with live topology from health checker
- search: strip _miroir_shard always, _rankingScore only when not requested
- search: include facetDistribution only when facets were requested
- credentials: add mutex guards for env-var test isolation
- Add Phase 2 DoD integration tests: shard coverage, dedup, facets, paging,
degraded writes, error shape parity, topology shape, auth errors, reserved fields
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fix middleware module export from lib.rs so the crate compiles as a library.
Remove unused settings mock assertions from test_create_index_broadcasts_to_all_nodes
(the settings injection flow is already covered by test_miroir_shard_in_filterable_attributes).
All 11 acceptance tests pass:
- POST /indexes broadcasts to all nodes with rollback on failure
- _miroir_shard in filterableAttributes after creation
- GET /indexes/{uid}/stats logical doc count (divided by RG*RF)
- Settings broadcast sequential with rollback
- DELETE /indexes broadcasts to all nodes
- PATCH /indexes/{uid} snapshot and rollback
- /keys CRUD broadcasts with all-or-nothing semantics
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements POST/PUT /indexes/{uid}/documents and DELETE /indexes/{uid}/documents:
- Primary key extraction on hot path with 400 miroir_primary_key_required if missing
- _miroir_shard injection into every document before forwarding to nodes
- Rejection of _miroir_shard in client-submitted docs (400 miroir_reserved_field)
- Two-rule quorum: per-group floor(RF/2)+1 ACKs, success if ≥1 group meets quorum
- X-Miroir-Degraded header when any group misses quorum
- 503 miroir_no_quorum only when NO group meets quorum
- Per-batch grouping by target shard for efficient HTTP fan-out
- DELETE by IDs routes each ID independently to its shard
- DELETE by filter broadcasts to all nodes
Acceptance tests pass:
- Primary key validation before any writes
- Reserved field rejection
- Shard distribution uniformity (17-26 shards/node with 64 shards/3 nodes)
- Quorum calculation: floor(RF/2)+1
- Meilisearch-compatible error shape
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Remove unused ShardHitPage import from p23_search_read_path.rs
- All 10 acceptance tests pass:
- Unique-keyword search returns exactly 1 hit (RRF deduplication)
- Facet counts sum correctly across shards
- Paging with no dupes/gaps (5 pages of 10 = 50 unique results)
- Node down with RF=2: search still covers all shards
- Group down with fallback: uses other group, not degraded
- X-Miroir-Degraded header includes actual shard IDs
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the search read path with scatter-gather + merge + group selection:
1. Group-unavailability fallback: When a shard has no available replica
in the primary group, the Fallback policy tries other replica groups
before failing. This provides full results (not degraded) when an
alternate group is healthy.
2. X-Miroir-Degraded header: Now includes actual shard IDs in the format
"X-Miroir-Degraded: shards=3,7,11" instead of just "partial".
3. Acceptance tests for P2.3:
- Unique-keyword search deduplicates correctly (RRF)
- Facet counts sum across shards
- Paging with no dupes/gaps
- Node down with RF=2 still covers all shards
- Group down falls back to other group (not degraded)
- Degraded header includes actual shard IDs
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Load Config (file + env + CLI args overlay) via MiroirConfig::load()
- Initialize tracing with JSON-to-stdout format (plan §10)
- Start two axum listeners: :7700 (client API) + :9090 (metrics, unauthenticated)
- Signal handlers for graceful shutdown (SIGTERM → drain → exit)
- GET /health returns {"status":"available"} immediately (Meilisearch-compatible)
- GET /version returns Meilisearch version from healthy node (60s TTL cache)
- GET /_miroir/ready returns 503 during startup, 200 once covering quorum reachable
- GET /_miroir/topology returns cluster state per plan §10 JSON shape
- GET /_miroir/shards returns shard → node mapping table
- GET /_miroir/metrics returns admin-key-gated Prometheus metrics
- Background health checker promotes nodes to Active when reachable
- UnifiedState bundles AuthState, Metrics, and admin_endpoints::AppState
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per plan §10, GET /_miroir/metrics is admin-key-gated so it can be
exposed outside the cluster. It was incorrectly marked as dispatch-exempt
with comment "admin-key-optional" - changed to require admin authentication.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
RRF merge (k=60) benchmarked against ground truth with 10K queries on
skewed 10-shard corpus (93% on shard 1). Result: Kendall τ = 0.1369
(95% CI [0.1339, 0.1399]), far below the 0.95 threshold. 9,998 of 10,000
queries fell below τ=0.95, confirming RRF alone is insufficient for
cross-shard ranking quality with skewed distributions.
DFS preflight (already implemented) achieves τ = 0.9818, passing the
threshold. Add full 10K-query DFS comparison report and fix paths in
experiment.json.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add axum feature flag to miroir-core with IntoResponse impl for MeilisearchError
- Refactor auth middleware to use MeilisearchError::new() + MiroirCode instead of
manual JSON construction, ensuring consistent error shape across all auth errors
- Add proxy error.rs re-export alias for ApiError
- Implement full telemetry middleware with Prometheus metrics (request duration,
in-flight gauge, scatter counters, node health)
- Reorder middleware layers: auth before telemetry so 401s are also instrumented
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the API error response format from plan §5:
- ErrorType enum: invalid_request, auth, internal, system
- MiroirCode enum with all 10 miroir_* codes and their HTTP status mappings
- MeilisearchError struct with Meilisearch-compatible JSON shape
- Forwarding support for Meilisearch-native node errors (verbatim passthrough)
- Doc links pointing to docs/errors.md#<code>
- 21 unit tests covering every code's JSON shape, HTTP status, and forwarding
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>