- Add per-query and total timeout enforcement to MultiSearchExecutor - Improve SearchResult with helper methods (ok, err, timeout, is_success) - Fix ModeACoordinator feature-gate compilation issues - Add comprehensive acceptance tests for multi-search: - 5-query batch completion - Slow query doesn't block fast queries (parallel execution) - Partial failure handling - Per-query timeout - Total timeout - 100-query batch completion Closes: miroir-uhj.11
8.9 KiB
8.9 KiB
Phase 7 - Observability + Ops (§10) Verification Summary
Implementation Status: COMPLETE ✅
Definition of Done Checklist
| Item | Status | Notes |
|---|---|---|
| ✅ Every metric in plan §10 + §14.9 registered and scraping on port 9090 | Complete | All metrics implemented in middleware.rs with proper Prometheus registry |
✅ /_miroir/metrics on port 7700 returns identical data when admin-key-authenticated |
Complete | Endpoint exists in admin_endpoints.rs, routed in admin.rs |
| ✅ Grafana dashboard JSON imports cleanly; all 8 core panels render | Complete | Dashboard validated, 8 core panels + feature-gated panels |
| ✅ All 12 alerts live in the shipped PrometheusRule manifest | Complete | All alerts present in miroir-prometheusrule.yaml |
| ✅ OTel trace contains one parent span per request and one child per node call | Complete | Implemented in otel.rs with proper span propagation |
| ✅ Log entries match the schema verbatim (parseable as JSON) | Complete | Structured JSON logging in main.rs with tracing-subscriber |
| ✅ ServiceMonitor picks up the metrics service | Complete | Configured in miroir-servicemonitor.yaml |
Implementation Details
Health Endpoints
| Endpoint | Purpose | Location |
|---|---|---|
GET /health |
Meilisearch-compatible liveness | health.rs → livenessProbe |
GET /_miroir/ready |
Readiness (503 until covering quorum reachable) | admin_endpoints.rs → readinessProbe |
GET /_miroir/topology |
Full cluster state per plan §10 | admin_endpoints.rs |
GET /_miroir/metrics |
Admin-key-gated Prometheus metrics | admin_endpoints.rs |
GET /metrics |
Unauthenticated metrics on port 9090 | middleware.rs |
Prometheus Metrics (58 metrics total)
Core Metrics (18) - Plan §10
- Request metrics:
miroir_request_duration_seconds,miroir_requests_total,miroir_requests_in_flight - Node health:
miroir_node_healthy,miroir_node_request_duration_seconds,miroir_node_errors_total - Shard metrics:
miroir_shard_coverage,miroir_degraded_shards_total,miroir_shard_distribution - Task metrics:
miroir_task_processing_age_seconds,miroir_tasks_total,miroir_task_registry_size - Scatter-gather:
miroir_scatter_fan_out_size,miroir_scatter_partial_responses_total,miroir_scatter_retries_total - Rebalancer:
miroir_rebalance_in_progress,miroir_rebalance_documents_migrated_total,miroir_rebalance_duration_seconds
Advanced Capabilities Metrics (33) - Plan §13.11–§13.21
- Multi-search (§13.11):
miroir_multisearch_queries_per_batch,miroir_multisearch_batches_total,miroir_multisearch_partial_failures_total,miroir_tenant_session_pin_override_total - Vector search (§13.12):
miroir_vector_search_over_fetched_total,miroir_vector_merge_strategy,miroir_vector_embedder_drift_total - CDC (§13.13):
miroir_cdc_events_published_total,miroir_cdc_lag_seconds,miroir_cdc_buffer_bytes,miroir_cdc_dropped_total,miroir_cdc_events_suppressed_total - TTL (§13.14):
miroir_ttl_documents_expired_total,miroir_ttl_sweep_duration_seconds,miroir_ttl_pending_estimate - Tenant affinity (§13.15):
miroir_tenant_queries_total,miroir_tenant_pinned_groups,miroir_tenant_fallback_total - Shadow traffic (§13.16):
miroir_shadow_diff_total,miroir_shadow_kendall_tau,miroir_shadow_latency_delta_seconds,miroir_shadow_errors_total - ILM (§13.17):
miroir_rollover_events_total,miroir_rollover_active_indexes,miroir_rollover_documents_expired_total,miroir_rollover_last_action_seconds - Canary (§13.18):
miroir_canary_runs_total,miroir_canary_latency_ms,miroir_canary_assertion_failures_total - Admin UI (§13.19):
miroir_admin_ui_sessions_total,miroir_admin_ui_action_total,miroir_admin_ui_destructive_action_total - Explain (§13.20):
miroir_explain_requests_total,miroir_explain_warnings_total,miroir_explain_execute_total - Search UI (§13.21):
miroir_search_ui_sessions_total,miroir_search_ui_queries_total,miroir_search_ui_zero_hits_total,miroir_search_ui_click_through_total,miroir_search_ui_p95_ms
Resource-Pressure Metrics (7) - Plan §14.9
miroir_memory_pressure,miroir_cpu_throttled_seconds_total,miroir_request_queue_depth,miroir_background_queue_depth,miroir_peer_pod_count,miroir_leader,miroir_owned_shards_count
Grafana Dashboard
File: charts/miroir/dashboards/miroir-overview.json
Core Panels (8):
- Cluster Health - Degraded Shards, Shard Coverage, Node Health table
- Request Rate - Requests/sec by Path, Requests/sec by Status
- Request Latency - p50/p95/p99
- Node Latency - Per-Node p99, Node Error Rate
- Search Overhead - Scatter Fan-Out, Partial Responses/Retries, Requests in Flight
- Task Lag - Processing Age, Tasks by Status, Registry Size
- Shard Distribution - Shards per Node, Shard Imbalance
- Rebalance Activity - In Progress, Documents Migrated, Duration
Feature-Gated Panels (collapsible rows):
- Resharding (§13.1)
- Multi-Search (§13.11)
- Anti-Entropy (§13.8)
- Settings Broadcast (§13.5)
- CDC (§13.13)
- Canary Tests (§13.18)
- Search UI (§13.21)
PrometheusRule Alerts (12)
Availability Alerts (7):
MiroirDegradedShards- Degraded shard count > 0 for 2mMiroirNodeDown- Node unhealthy for 5mMiroirHighSearchLatency- p95 search latency > 2s for 5mMiroirTaskStuck- Task processing age > 1h for 10mMiroirRebalanceStuck- Rebalance in progress for > 2hMiroirSettingsDivergence- Settings divergence without repairMiroirAntientropyMismatch- Persistent replica divergence across 3 passes
Resource-Pressure Alerts (5):
MiroirMemoryPressure- Memory pressure >= 2 for 5mMiroirRequestQueueBacklog- Queue depth > 500 for 2mMiroirBackgroundJobBacklog- Background queue > 100 for 10mMiroirPeerDiscoveryGap- Peer count mismatch for 2mMiroirNoLeader- No leader elected for 1m
Ports Configuration
| Port | Purpose | Access | Path |
|---|---|---|---|
| 7700 | Main API | External + admin-key | /_miroir/metrics |
| 9090 | Metrics | Pod-internal only | /metrics |
Test Coverage
P7.1 Core Metrics Tests (tests/p7_1_core_metrics.rs):
- ✅ test_all_core_metrics_registered
- ✅ test_scatter_fan_out_metric_records_correctly
- ✅ test_node_health_metrics_have_correct_labels
- ✅ test_node_request_duration_has_operation_label
- ✅ test_task_metrics_have_status_label
P7.5 Structured Logging Tests (tests/p7_5_structured_logging.rs):
- ✅ JSON logs parseable by jq
- ✅ Request ID format and correlation
- ✅ No PII in logs (API keys, query strings, document content)
- ✅ Log volume (2 INFO entries per search request)
- ✅ Request ID response header propagation
- ✅ Request ID appears in all log lines within request
Files Modified/Verified
Core Implementation
crates/miroir-proxy/src/middleware.rs- Metrics registry and middlewarecrates/miroir-proxy/src/otel.rs- OpenTelemetry tracingcrates/miroir-proxy/src/main.rs- Structured logging initializationcrates/miroir-proxy/src/routes/health.rs- Health endpointcrates/miroir-proxy/src/routes/admin_endpoints.rs- Admin endpoints (topology, ready, metrics)crates/miroir-proxy/src/routes/admin.rs- Admin router wiring
Kubernetes Manifests
charts/miroir/templates/miroir-deployment.yaml- Health probes, metrics portcharts/miroir/templates/miroir-service.yaml- HTTP and metrics portscharts/miroir/templates/miroir-servicemonitor.yaml- Prometheus scrapingcharts/miroir/templates/miroir-prometheusrule.yaml- Alerting rulescharts/miroir/templates/miroir-grafana-dashboard.yaml- Dashboard ConfigMap
Dashboard
charts/miroir/dashboards/miroir-overview.json- Grafana dashboard definition
Tests
crates/miroir-proxy/tests/p7_1_core_metrics.rs- Metrics acceptance testscrates/miroir-proxy/tests/p7_5_structured_logging.rs- Logging acceptance tests
Verification Commands
# Run metrics tests
cargo test --package miroir-proxy --test p7_1_core_metrics
# Run logging tests
cargo test --package miroir-proxy --test p7_5_structured_logging
# Validate dashboard JSON
python3 -c "import json; json.load(open('charts/miroir/dashboards/miroir-overview.json'))"
# List all alerts
grep -E 'alert: Miroir' charts/miroir/templates/miroir-prometheusrule.yaml
# Verify ServiceMonitor structure
grep -E '^apiVersion:|^kind:|selector:|endpoints:|port:|path:' charts/miroir/templates/miroir-servicemonitor.yaml
Notes
- All metrics are prefixed with
miroir_for easy identification - Feature-gated metrics (§13.11–§13.21) are only registered when the corresponding feature is enabled
- Resource-pressure metrics (§14.9) are always present
- Structured logging uses tracing-subscriber with JSON formatter
- Request IDs are 8-character hex values, propagated via X-Request-Id header
- OTel tracing is disabled by default, enabled via
tracing.enabledconfig