miroir/notes/phase-7-observability-verification.md
jedarden 158752fe7b feat(multi-search): implement timeout enforcement and acceptance tests (§13.11)
- Add per-query and total timeout enforcement to MultiSearchExecutor
- Improve SearchResult with helper methods (ok, err, timeout, is_success)
- Fix ModeACoordinator feature-gate compilation issues
- Add comprehensive acceptance tests for multi-search:
  - 5-query batch completion
  - Slow query doesn't block fast queries (parallel execution)
  - Partial failure handling
  - Per-query timeout
  - Total timeout
  - 100-query batch completion

Closes: miroir-uhj.11
2026-05-24 01:54:20 -04:00

170 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 7 - Observability + Ops (§10) Verification Summary
## Implementation Status: COMPLETE ✅
### Definition of Done Checklist
| Item | Status | Notes |
|------|--------|-------|
| ✅ Every metric in plan §10 + §14.9 registered and scraping on port 9090 | Complete | All metrics implemented in `middleware.rs` with proper Prometheus registry |
| ✅ `/_miroir/metrics` on port 7700 returns identical data when admin-key-authenticated | Complete | Endpoint exists in `admin_endpoints.rs`, routed in `admin.rs` |
| ✅ Grafana dashboard JSON imports cleanly; all 8 core panels render | Complete | Dashboard validated, 8 core panels + feature-gated panels |
| ✅ All 12 alerts live in the shipped PrometheusRule manifest | Complete | All alerts present in `miroir-prometheusrule.yaml` |
| ✅ OTel trace contains one parent span per request and one child per node call | Complete | Implemented in `otel.rs` with proper span propagation |
| ✅ Log entries match the schema verbatim (parseable as JSON) | Complete | Structured JSON logging in `main.rs` with tracing-subscriber |
| ✅ ServiceMonitor picks up the metrics service | Complete | Configured in `miroir-servicemonitor.yaml` |
## Implementation Details
### Health Endpoints
| Endpoint | Purpose | Location |
|----------|---------|----------|
| `GET /health` | Meilisearch-compatible liveness | `health.rs` → livenessProbe |
| `GET /_miroir/ready` | Readiness (503 until covering quorum reachable) | `admin_endpoints.rs` → readinessProbe |
| `GET /_miroir/topology` | Full cluster state per plan §10 | `admin_endpoints.rs` |
| `GET /_miroir/metrics` | Admin-key-gated Prometheus metrics | `admin_endpoints.rs` |
| `GET /metrics` | Unauthenticated metrics on port 9090 | `middleware.rs` |
### Prometheus Metrics (58 metrics total)
#### Core Metrics (18) - Plan §10
- Request metrics: `miroir_request_duration_seconds`, `miroir_requests_total`, `miroir_requests_in_flight`
- Node health: `miroir_node_healthy`, `miroir_node_request_duration_seconds`, `miroir_node_errors_total`
- Shard metrics: `miroir_shard_coverage`, `miroir_degraded_shards_total`, `miroir_shard_distribution`
- Task metrics: `miroir_task_processing_age_seconds`, `miroir_tasks_total`, `miroir_task_registry_size`
- Scatter-gather: `miroir_scatter_fan_out_size`, `miroir_scatter_partial_responses_total`, `miroir_scatter_retries_total`
- Rebalancer: `miroir_rebalance_in_progress`, `miroir_rebalance_documents_migrated_total`, `miroir_rebalance_duration_seconds`
#### Advanced Capabilities Metrics (33) - Plan §13.11§13.21
- Multi-search (§13.11): `miroir_multisearch_queries_per_batch`, `miroir_multisearch_batches_total`, `miroir_multisearch_partial_failures_total`, `miroir_tenant_session_pin_override_total`
- Vector search (§13.12): `miroir_vector_search_over_fetched_total`, `miroir_vector_merge_strategy`, `miroir_vector_embedder_drift_total`
- CDC (§13.13): `miroir_cdc_events_published_total`, `miroir_cdc_lag_seconds`, `miroir_cdc_buffer_bytes`, `miroir_cdc_dropped_total`, `miroir_cdc_events_suppressed_total`
- TTL (§13.14): `miroir_ttl_documents_expired_total`, `miroir_ttl_sweep_duration_seconds`, `miroir_ttl_pending_estimate`
- Tenant affinity (§13.15): `miroir_tenant_queries_total`, `miroir_tenant_pinned_groups`, `miroir_tenant_fallback_total`
- Shadow traffic (§13.16): `miroir_shadow_diff_total`, `miroir_shadow_kendall_tau`, `miroir_shadow_latency_delta_seconds`, `miroir_shadow_errors_total`
- ILM (§13.17): `miroir_rollover_events_total`, `miroir_rollover_active_indexes`, `miroir_rollover_documents_expired_total`, `miroir_rollover_last_action_seconds`
- Canary (§13.18): `miroir_canary_runs_total`, `miroir_canary_latency_ms`, `miroir_canary_assertion_failures_total`
- Admin UI (§13.19): `miroir_admin_ui_sessions_total`, `miroir_admin_ui_action_total`, `miroir_admin_ui_destructive_action_total`
- Explain (§13.20): `miroir_explain_requests_total`, `miroir_explain_warnings_total`, `miroir_explain_execute_total`
- Search UI (§13.21): `miroir_search_ui_sessions_total`, `miroir_search_ui_queries_total`, `miroir_search_ui_zero_hits_total`, `miroir_search_ui_click_through_total`, `miroir_search_ui_p95_ms`
#### Resource-Pressure Metrics (7) - Plan §14.9
- `miroir_memory_pressure`, `miroir_cpu_throttled_seconds_total`, `miroir_request_queue_depth`, `miroir_background_queue_depth`, `miroir_peer_pod_count`, `miroir_leader`, `miroir_owned_shards_count`
### Grafana Dashboard
**File:** `charts/miroir/dashboards/miroir-overview.json`
**Core Panels (8):**
1. Cluster Health - Degraded Shards, Shard Coverage, Node Health table
2. Request Rate - Requests/sec by Path, Requests/sec by Status
3. Request Latency - p50/p95/p99
4. Node Latency - Per-Node p99, Node Error Rate
5. Search Overhead - Scatter Fan-Out, Partial Responses/Retries, Requests in Flight
6. Task Lag - Processing Age, Tasks by Status, Registry Size
7. Shard Distribution - Shards per Node, Shard Imbalance
8. Rebalance Activity - In Progress, Documents Migrated, Duration
**Feature-Gated Panels (collapsible rows):**
- Resharding (§13.1)
- Multi-Search (§13.11)
- Anti-Entropy (§13.8)
- Settings Broadcast (§13.5)
- CDC (§13.13)
- Canary Tests (§13.18)
- Search UI (§13.21)
### PrometheusRule Alerts (12)
**Availability Alerts (7):**
1. `MiroirDegradedShards` - Degraded shard count > 0 for 2m
2. `MiroirNodeDown` - Node unhealthy for 5m
3. `MiroirHighSearchLatency` - p95 search latency > 2s for 5m
4. `MiroirTaskStuck` - Task processing age > 1h for 10m
5. `MiroirRebalanceStuck` - Rebalance in progress for > 2h
6. `MiroirSettingsDivergence` - Settings divergence without repair
7. `MiroirAntientropyMismatch` - Persistent replica divergence across 3 passes
**Resource-Pressure Alerts (5):**
1. `MiroirMemoryPressure` - Memory pressure >= 2 for 5m
2. `MiroirRequestQueueBacklog` - Queue depth > 500 for 2m
3. `MiroirBackgroundJobBacklog` - Background queue > 100 for 10m
4. `MiroirPeerDiscoveryGap` - Peer count mismatch for 2m
5. `MiroirNoLeader` - No leader elected for 1m
### Ports Configuration
| Port | Purpose | Access | Path |
|------|---------|--------|------|
| 7700 | Main API | External + admin-key | `/_miroir/metrics` |
| 9090 | Metrics | Pod-internal only | `/metrics` |
### Test Coverage
**P7.1 Core Metrics Tests** (`tests/p7_1_core_metrics.rs`):
- ✅ test_all_core_metrics_registered
- ✅ test_scatter_fan_out_metric_records_correctly
- ✅ test_node_health_metrics_have_correct_labels
- ✅ test_node_request_duration_has_operation_label
- ✅ test_task_metrics_have_status_label
**P7.5 Structured Logging Tests** (`tests/p7_5_structured_logging.rs`):
- ✅ JSON logs parseable by jq
- ✅ Request ID format and correlation
- ✅ No PII in logs (API keys, query strings, document content)
- ✅ Log volume (2 INFO entries per search request)
- ✅ Request ID response header propagation
- ✅ Request ID appears in all log lines within request
## Files Modified/Verified
### Core Implementation
- `crates/miroir-proxy/src/middleware.rs` - Metrics registry and middleware
- `crates/miroir-proxy/src/otel.rs` - OpenTelemetry tracing
- `crates/miroir-proxy/src/main.rs` - Structured logging initialization
- `crates/miroir-proxy/src/routes/health.rs` - Health endpoint
- `crates/miroir-proxy/src/routes/admin_endpoints.rs` - Admin endpoints (topology, ready, metrics)
- `crates/miroir-proxy/src/routes/admin.rs` - Admin router wiring
### Kubernetes Manifests
- `charts/miroir/templates/miroir-deployment.yaml` - Health probes, metrics port
- `charts/miroir/templates/miroir-service.yaml` - HTTP and metrics ports
- `charts/miroir/templates/miroir-servicemonitor.yaml` - Prometheus scraping
- `charts/miroir/templates/miroir-prometheusrule.yaml` - Alerting rules
- `charts/miroir/templates/miroir-grafana-dashboard.yaml` - Dashboard ConfigMap
### Dashboard
- `charts/miroir/dashboards/miroir-overview.json` - Grafana dashboard definition
### Tests
- `crates/miroir-proxy/tests/p7_1_core_metrics.rs` - Metrics acceptance tests
- `crates/miroir-proxy/tests/p7_5_structured_logging.rs` - Logging acceptance tests
## Verification Commands
```bash
# Run metrics tests
cargo test --package miroir-proxy --test p7_1_core_metrics
# Run logging tests
cargo test --package miroir-proxy --test p7_5_structured_logging
# Validate dashboard JSON
python3 -c "import json; json.load(open('charts/miroir/dashboards/miroir-overview.json'))"
# List all alerts
grep -E 'alert: Miroir' charts/miroir/templates/miroir-prometheusrule.yaml
# Verify ServiceMonitor structure
grep -E '^apiVersion:|^kind:|selector:|endpoints:|port:|path:' charts/miroir/templates/miroir-servicemonitor.yaml
```
## Notes
- All metrics are prefixed with `miroir_` for easy identification
- Feature-gated metrics (§13.11§13.21) are only registered when the corresponding feature is enabled
- Resource-pressure metrics (§14.9) are always present
- Structured logging uses tracing-subscriber with JSON formatter
- Request IDs are 8-character hex values, propagated via X-Request-Id header
- OTel tracing is disabled by default, enabled via `tracing.enabled` config