miroir/docs/onboarding/production.md
jedarden f20c1bae4d bf-1p4v: Verify compile error already fixed
The borrow-of-moved-value error for `state` was already fixed in the codebase.
Line 568 uses `.with_state(state.clone())` and `UnifiedState` derives Clone.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 06:49:04 -04:00

188 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Production Deployment Guide
This guide covers operational considerations for running Miroir in production.
## Sizing
Start with the [Deployment Sizing Guide](../horizontal-scaling/sizing.md) to determine the number of orchestrator pods and task store configuration for your workload.
## Resource envelope
Each orchestrator pod is designed for **2 vCPU / 3.75 GB RAM**. This matches common small-instance tiers across cloud providers:
- AWS: t3.medium
- GCP: e2-medium
- Hetzner: CX22
- Rackspace Spot: 2c/3.75GB
See [plan.md §14](../plan/plan.md#14-resource-envelope-and-horizontal-scaling) for the full resource budget breakdown.
## Task store selection
| Replicas | Recommended task store |
|----------|----------------------|
| 1 | SQLite (default) |
| 2+ | Redis (required) |
The Helm chart's `values.schema.json` enforces this requirement — it rejects configurations where `miroir.replicas > 1` and `taskStore.backend: sqlite`.
For Redis deployment modes:
- **Small deployments** (≤ 200 GB corpus): Single Redis instance is sufficient
- **Medium deployments** (≤ 1 TB corpus): Redis with persistence enabled
- **Large deployments** (≤ 5 TB corpus): Redis Cluster or Sentinel for HA
## Horizontal Pod Autoscaler
Enable HPA for production deployments with variable traffic:
```yaml
miroir:
hpa:
enabled: true
minReplicas: 2
maxReplicas: 24
```
HPA requires `taskStore.backend: redis` and `miroir.replicas >= 2`. The chart enforces these constraints.
See the [sizing matrix](../horizontal-scaling/sizing.md#sizing-matrix) for recommended replica ranges per workload tier.
## High availability
### Minimum HA configuration
- **2 orchestrator pods** (handles one pod failure without downtime)
- **Redis task store** (shared state across pods)
- **Replication factor ≥ 2** for Meilisearch (handles one node failure per shard group)
### Leader election
The orchestrator uses Redis-based leader election for operations that require exactly-one semantics (settings broadcast, ILM rollover). Leader lease TTL is 30s with heartbeat renewal at 10s intervals — a new leader is elected within 30s if the current leader fails.
### Pod disruption budgets
The chart includes a PodDisruptionBudget that ensures minimum availability during voluntary disruptions:
```yaml
minAvailable: 1
```
For stricter guarantees, adjust to `minAvailable: 2` or use percentage-based policies.
## Monitoring
### Resource pressure metrics
Key metrics to monitor:
| Metric | Meaning | Alert threshold |
|--------|---------|-----------------|
| `miroir_requests_in_flight` | Concurrent requests | > 80% of `server.max_concurrent_requests` |
| `miroir_background_queue_depth` | Pending background jobs | > 10 for > 5min |
| `miroir_task_store_latency_seconds` | Task store operation p99 | > 100ms |
| Container CPU | Orchestrator CPU utilization | > 80% sustained |
| Container memory | Orchestrator RAM usage | > 85% sustained |
See [plan.md §14.9](../plan/plan.md#149-resource-pressure-metrics-and-alerts) for the full alerting catalog.
### Search UI rate limiting
When `miroir.replicas > 1`, the Search UI rate limiter **must** use Redis backend:
```yaml
search_ui:
rate_limit:
backend: redis # Required for multi-pod deployments
```
With `backend: local`, the effective cluster-wide rate is `per_ip × pod_count` because each pod maintains its own bucket table.
## Vertical scaling escape valve
If your environment cannot support horizontal scaling (e.g., single-node Kubernetes), you can use a larger pod size (e.g., 4 vCPU / 8 GB). This is supported but not recommended for production — horizontal scaling provides better fault isolation and elasticity.
See [plan.md §14.10](../plan/plan.md#1410-vertical-scaling-escape-valve) for trade-offs.
## Feature flags for memory-constrained deployments
Every advanced capability has an `enabled: true` knob that can be flipped off to reclaim memory:
| Feature | Memory savings | Config path |
|---------|---------------|-------------|
| Vector search | ~30 MB per pod | `vector_search.enabled` |
| Multi-search | ~5 MB per pod | `multi_search.enabled` |
| Shadow tee | ~50 MB per pod (5% sample) | `shadow.enabled` |
| CDC publisher | ~64 MB per pod | `cdc.enabled` |
See [plan.md §13](../plan/plan.md#13-feature-scaling-modes) for the full feature list and scaling modes.
## Backup and recovery
### Meilisearch data
Meilisearch persists index data to disk at `/meilisearch-data`. Backup strategy:
1. **Snapshot volumes** — Take volume snapshots before index operations (reshard, ILM rollover)
2. **Dump exports** — Use `/_indexes/{uid}/documents/export` for logical backups
3. **RG/RF awareness** — With replication factor > 1, you can survive node loss without data loss
### Task store (Redis)
Redis task store contains:
- Idempotency keys (ephemeral, TTL 24h)
- Session pinning state (ephemeral, TTL per session)
- Background job queues (rebuildable from operations log)
- Leader lease (ephemeral, auto-recoverable)
Redis data is **rebuildable** — a fresh Redis instance is acceptable after failure. The orchestrator will reconstruct queues and lease state on startup.
For persistence:
- Enable RDB snapshots (`save 900 1`, `save 300 10`, `save 60 10000`)
- Or use AOF (`appendonly yes`) for stronger durability
### Configuration
Store your Helm values in git. The chart supports external secrets for sensitive values (API keys, JWT secrets).
## Upgrade path
See [versioning-policy.md](../versioning-policy.md) for backward compatibility commitments.
Upgrade procedure:
1. Read the changelog for breaking changes
2. Test in non-production first
3. Roll out orchestrator pods first (Meilisearch API compatibility)
4. Update Meilisearch nodes independently
## Troubleshooting
### Orchestrator pods stuck in CrashLoopBack
Check task store connectivity:
```bash
kubectl logs -n search deployment/miroir --tail 100 | grep -i task
```
If Redis is unreachable, verify `taskStore.backend` and connection settings.
### High memory usage
Identify the culprit:
1. Check `miroir_idempotency_cache_entries` — idempotency cache may be oversized
2. Check `miroir_session_pinning_active_sessions` — session pinning may be leaking
3. Review feature flags — disable unused capabilities
### HPA not scaling
Verify HPA metrics are reporting:
```bash
kubectl get hpa -n search
```
If metrics are `<unknown>`, install or configure the Metrics Server / Prometheus Adapter.
## Further reading
- [Deployment Sizing Guide](../horizontal-scaling/sizing.md) — Corpus/QPS → pod count matrix
- [plan.md §14](../plan/plan.md#14-resource-envelope-and-horizontal-scaling) — Full resource envelope documentation
- [plan.md §13](../plan/plan.md#13-feature-scaling-modes) — Per-feature scaling behavior