miroir/notes/phase-7-observability-verification.md
jedarden 158752fe7b feat(multi-search): implement timeout enforcement and acceptance tests (§13.11)
- Add per-query and total timeout enforcement to MultiSearchExecutor
- Improve SearchResult with helper methods (ok, err, timeout, is_success)
- Fix ModeACoordinator feature-gate compilation issues
- Add comprehensive acceptance tests for multi-search:
  - 5-query batch completion
  - Slow query doesn't block fast queries (parallel execution)
  - Partial failure handling
  - Per-query timeout
  - Total timeout
  - 100-query batch completion

Closes: miroir-uhj.11
2026-05-24 01:54:20 -04:00

8.9 KiB
Raw Blame History

Phase 7 - Observability + Ops (§10) Verification Summary

Implementation Status: COMPLETE

Definition of Done Checklist

Item Status Notes
Every metric in plan §10 + §14.9 registered and scraping on port 9090 Complete All metrics implemented in middleware.rs with proper Prometheus registry
/_miroir/metrics on port 7700 returns identical data when admin-key-authenticated Complete Endpoint exists in admin_endpoints.rs, routed in admin.rs
Grafana dashboard JSON imports cleanly; all 8 core panels render Complete Dashboard validated, 8 core panels + feature-gated panels
All 12 alerts live in the shipped PrometheusRule manifest Complete All alerts present in miroir-prometheusrule.yaml
OTel trace contains one parent span per request and one child per node call Complete Implemented in otel.rs with proper span propagation
Log entries match the schema verbatim (parseable as JSON) Complete Structured JSON logging in main.rs with tracing-subscriber
ServiceMonitor picks up the metrics service Complete Configured in miroir-servicemonitor.yaml

Implementation Details

Health Endpoints

Endpoint Purpose Location
GET /health Meilisearch-compatible liveness health.rs → livenessProbe
GET /_miroir/ready Readiness (503 until covering quorum reachable) admin_endpoints.rs → readinessProbe
GET /_miroir/topology Full cluster state per plan §10 admin_endpoints.rs
GET /_miroir/metrics Admin-key-gated Prometheus metrics admin_endpoints.rs
GET /metrics Unauthenticated metrics on port 9090 middleware.rs

Prometheus Metrics (58 metrics total)

Core Metrics (18) - Plan §10

  • Request metrics: miroir_request_duration_seconds, miroir_requests_total, miroir_requests_in_flight
  • Node health: miroir_node_healthy, miroir_node_request_duration_seconds, miroir_node_errors_total
  • Shard metrics: miroir_shard_coverage, miroir_degraded_shards_total, miroir_shard_distribution
  • Task metrics: miroir_task_processing_age_seconds, miroir_tasks_total, miroir_task_registry_size
  • Scatter-gather: miroir_scatter_fan_out_size, miroir_scatter_partial_responses_total, miroir_scatter_retries_total
  • Rebalancer: miroir_rebalance_in_progress, miroir_rebalance_documents_migrated_total, miroir_rebalance_duration_seconds

Advanced Capabilities Metrics (33) - Plan §13.11§13.21

  • Multi-search (§13.11): miroir_multisearch_queries_per_batch, miroir_multisearch_batches_total, miroir_multisearch_partial_failures_total, miroir_tenant_session_pin_override_total
  • Vector search (§13.12): miroir_vector_search_over_fetched_total, miroir_vector_merge_strategy, miroir_vector_embedder_drift_total
  • CDC (§13.13): miroir_cdc_events_published_total, miroir_cdc_lag_seconds, miroir_cdc_buffer_bytes, miroir_cdc_dropped_total, miroir_cdc_events_suppressed_total
  • TTL (§13.14): miroir_ttl_documents_expired_total, miroir_ttl_sweep_duration_seconds, miroir_ttl_pending_estimate
  • Tenant affinity (§13.15): miroir_tenant_queries_total, miroir_tenant_pinned_groups, miroir_tenant_fallback_total
  • Shadow traffic (§13.16): miroir_shadow_diff_total, miroir_shadow_kendall_tau, miroir_shadow_latency_delta_seconds, miroir_shadow_errors_total
  • ILM (§13.17): miroir_rollover_events_total, miroir_rollover_active_indexes, miroir_rollover_documents_expired_total, miroir_rollover_last_action_seconds
  • Canary (§13.18): miroir_canary_runs_total, miroir_canary_latency_ms, miroir_canary_assertion_failures_total
  • Admin UI (§13.19): miroir_admin_ui_sessions_total, miroir_admin_ui_action_total, miroir_admin_ui_destructive_action_total
  • Explain (§13.20): miroir_explain_requests_total, miroir_explain_warnings_total, miroir_explain_execute_total
  • Search UI (§13.21): miroir_search_ui_sessions_total, miroir_search_ui_queries_total, miroir_search_ui_zero_hits_total, miroir_search_ui_click_through_total, miroir_search_ui_p95_ms

Resource-Pressure Metrics (7) - Plan §14.9

  • miroir_memory_pressure, miroir_cpu_throttled_seconds_total, miroir_request_queue_depth, miroir_background_queue_depth, miroir_peer_pod_count, miroir_leader, miroir_owned_shards_count

Grafana Dashboard

File: charts/miroir/dashboards/miroir-overview.json

Core Panels (8):

  1. Cluster Health - Degraded Shards, Shard Coverage, Node Health table
  2. Request Rate - Requests/sec by Path, Requests/sec by Status
  3. Request Latency - p50/p95/p99
  4. Node Latency - Per-Node p99, Node Error Rate
  5. Search Overhead - Scatter Fan-Out, Partial Responses/Retries, Requests in Flight
  6. Task Lag - Processing Age, Tasks by Status, Registry Size
  7. Shard Distribution - Shards per Node, Shard Imbalance
  8. Rebalance Activity - In Progress, Documents Migrated, Duration

Feature-Gated Panels (collapsible rows):

  • Resharding (§13.1)
  • Multi-Search (§13.11)
  • Anti-Entropy (§13.8)
  • Settings Broadcast (§13.5)
  • CDC (§13.13)
  • Canary Tests (§13.18)
  • Search UI (§13.21)

PrometheusRule Alerts (12)

Availability Alerts (7):

  1. MiroirDegradedShards - Degraded shard count > 0 for 2m
  2. MiroirNodeDown - Node unhealthy for 5m
  3. MiroirHighSearchLatency - p95 search latency > 2s for 5m
  4. MiroirTaskStuck - Task processing age > 1h for 10m
  5. MiroirRebalanceStuck - Rebalance in progress for > 2h
  6. MiroirSettingsDivergence - Settings divergence without repair
  7. MiroirAntientropyMismatch - Persistent replica divergence across 3 passes

Resource-Pressure Alerts (5):

  1. MiroirMemoryPressure - Memory pressure >= 2 for 5m
  2. MiroirRequestQueueBacklog - Queue depth > 500 for 2m
  3. MiroirBackgroundJobBacklog - Background queue > 100 for 10m
  4. MiroirPeerDiscoveryGap - Peer count mismatch for 2m
  5. MiroirNoLeader - No leader elected for 1m

Ports Configuration

Port Purpose Access Path
7700 Main API External + admin-key /_miroir/metrics
9090 Metrics Pod-internal only /metrics

Test Coverage

P7.1 Core Metrics Tests (tests/p7_1_core_metrics.rs):

  • test_all_core_metrics_registered
  • test_scatter_fan_out_metric_records_correctly
  • test_node_health_metrics_have_correct_labels
  • test_node_request_duration_has_operation_label
  • test_task_metrics_have_status_label

P7.5 Structured Logging Tests (tests/p7_5_structured_logging.rs):

  • JSON logs parseable by jq
  • Request ID format and correlation
  • No PII in logs (API keys, query strings, document content)
  • Log volume (2 INFO entries per search request)
  • Request ID response header propagation
  • Request ID appears in all log lines within request

Files Modified/Verified

Core Implementation

  • crates/miroir-proxy/src/middleware.rs - Metrics registry and middleware
  • crates/miroir-proxy/src/otel.rs - OpenTelemetry tracing
  • crates/miroir-proxy/src/main.rs - Structured logging initialization
  • crates/miroir-proxy/src/routes/health.rs - Health endpoint
  • crates/miroir-proxy/src/routes/admin_endpoints.rs - Admin endpoints (topology, ready, metrics)
  • crates/miroir-proxy/src/routes/admin.rs - Admin router wiring

Kubernetes Manifests

  • charts/miroir/templates/miroir-deployment.yaml - Health probes, metrics port
  • charts/miroir/templates/miroir-service.yaml - HTTP and metrics ports
  • charts/miroir/templates/miroir-servicemonitor.yaml - Prometheus scraping
  • charts/miroir/templates/miroir-prometheusrule.yaml - Alerting rules
  • charts/miroir/templates/miroir-grafana-dashboard.yaml - Dashboard ConfigMap

Dashboard

  • charts/miroir/dashboards/miroir-overview.json - Grafana dashboard definition

Tests

  • crates/miroir-proxy/tests/p7_1_core_metrics.rs - Metrics acceptance tests
  • crates/miroir-proxy/tests/p7_5_structured_logging.rs - Logging acceptance tests

Verification Commands

# Run metrics tests
cargo test --package miroir-proxy --test p7_1_core_metrics

# Run logging tests
cargo test --package miroir-proxy --test p7_5_structured_logging

# Validate dashboard JSON
python3 -c "import json; json.load(open('charts/miroir/dashboards/miroir-overview.json'))"

# List all alerts
grep -E 'alert: Miroir' charts/miroir/templates/miroir-prometheusrule.yaml

# Verify ServiceMonitor structure
grep -E '^apiVersion:|^kind:|selector:|endpoints:|port:|path:' charts/miroir/templates/miroir-servicemonitor.yaml

Notes

  • All metrics are prefixed with miroir_ for easy identification
  • Feature-gated metrics (§13.11§13.21) are only registered when the corresponding feature is enabled
  • Resource-pressure metrics (§14.9) are always present
  • Structured logging uses tracing-subscriber with JSON formatter
  • Request IDs are 8-character hex values, propagated via X-Request-Id header
  • OTel tracing is disabled by default, enabled via tracing.enabled config