The borrow-of-moved-value error for `state` was already fixed in the codebase. Line 568 uses `.with_state(state.clone())` and `UnifiedState` derives Clone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
188 lines
6.7 KiB
Markdown
188 lines
6.7 KiB
Markdown
# Production Deployment Guide
|
||
|
||
This guide covers operational considerations for running Miroir in production.
|
||
|
||
## Sizing
|
||
|
||
Start with the [Deployment Sizing Guide](../horizontal-scaling/sizing.md) to determine the number of orchestrator pods and task store configuration for your workload.
|
||
|
||
## Resource envelope
|
||
|
||
Each orchestrator pod is designed for **2 vCPU / 3.75 GB RAM**. This matches common small-instance tiers across cloud providers:
|
||
|
||
- AWS: t3.medium
|
||
- GCP: e2-medium
|
||
- Hetzner: CX22
|
||
- Rackspace Spot: 2c/3.75GB
|
||
|
||
See [plan.md §14](../plan/plan.md#14-resource-envelope-and-horizontal-scaling) for the full resource budget breakdown.
|
||
|
||
## Task store selection
|
||
|
||
| Replicas | Recommended task store |
|
||
|----------|----------------------|
|
||
| 1 | SQLite (default) |
|
||
| 2+ | Redis (required) |
|
||
|
||
The Helm chart's `values.schema.json` enforces this requirement — it rejects configurations where `miroir.replicas > 1` and `taskStore.backend: sqlite`.
|
||
|
||
For Redis deployment modes:
|
||
- **Small deployments** (≤ 200 GB corpus): Single Redis instance is sufficient
|
||
- **Medium deployments** (≤ 1 TB corpus): Redis with persistence enabled
|
||
- **Large deployments** (≤ 5 TB corpus): Redis Cluster or Sentinel for HA
|
||
|
||
## Horizontal Pod Autoscaler
|
||
|
||
Enable HPA for production deployments with variable traffic:
|
||
|
||
```yaml
|
||
miroir:
|
||
hpa:
|
||
enabled: true
|
||
minReplicas: 2
|
||
maxReplicas: 24
|
||
```
|
||
|
||
HPA requires `taskStore.backend: redis` and `miroir.replicas >= 2`. The chart enforces these constraints.
|
||
|
||
See the [sizing matrix](../horizontal-scaling/sizing.md#sizing-matrix) for recommended replica ranges per workload tier.
|
||
|
||
## High availability
|
||
|
||
### Minimum HA configuration
|
||
|
||
- **2 orchestrator pods** (handles one pod failure without downtime)
|
||
- **Redis task store** (shared state across pods)
|
||
- **Replication factor ≥ 2** for Meilisearch (handles one node failure per shard group)
|
||
|
||
### Leader election
|
||
|
||
The orchestrator uses Redis-based leader election for operations that require exactly-one semantics (settings broadcast, ILM rollover). Leader lease TTL is 30s with heartbeat renewal at 10s intervals — a new leader is elected within 30s if the current leader fails.
|
||
|
||
### Pod disruption budgets
|
||
|
||
The chart includes a PodDisruptionBudget that ensures minimum availability during voluntary disruptions:
|
||
|
||
```yaml
|
||
minAvailable: 1
|
||
```
|
||
|
||
For stricter guarantees, adjust to `minAvailable: 2` or use percentage-based policies.
|
||
|
||
## Monitoring
|
||
|
||
### Resource pressure metrics
|
||
|
||
Key metrics to monitor:
|
||
|
||
| Metric | Meaning | Alert threshold |
|
||
|--------|---------|-----------------|
|
||
| `miroir_requests_in_flight` | Concurrent requests | > 80% of `server.max_concurrent_requests` |
|
||
| `miroir_background_queue_depth` | Pending background jobs | > 10 for > 5min |
|
||
| `miroir_task_store_latency_seconds` | Task store operation p99 | > 100ms |
|
||
| Container CPU | Orchestrator CPU utilization | > 80% sustained |
|
||
| Container memory | Orchestrator RAM usage | > 85% sustained |
|
||
|
||
See [plan.md §14.9](../plan/plan.md#149-resource-pressure-metrics-and-alerts) for the full alerting catalog.
|
||
|
||
### Search UI rate limiting
|
||
|
||
When `miroir.replicas > 1`, the Search UI rate limiter **must** use Redis backend:
|
||
|
||
```yaml
|
||
search_ui:
|
||
rate_limit:
|
||
backend: redis # Required for multi-pod deployments
|
||
```
|
||
|
||
With `backend: local`, the effective cluster-wide rate is `per_ip × pod_count` because each pod maintains its own bucket table.
|
||
|
||
## Vertical scaling escape valve
|
||
|
||
If your environment cannot support horizontal scaling (e.g., single-node Kubernetes), you can use a larger pod size (e.g., 4 vCPU / 8 GB). This is supported but not recommended for production — horizontal scaling provides better fault isolation and elasticity.
|
||
|
||
See [plan.md §14.10](../plan/plan.md#1410-vertical-scaling-escape-valve) for trade-offs.
|
||
|
||
## Feature flags for memory-constrained deployments
|
||
|
||
Every advanced capability has an `enabled: true` knob that can be flipped off to reclaim memory:
|
||
|
||
| Feature | Memory savings | Config path |
|
||
|---------|---------------|-------------|
|
||
| Vector search | ~30 MB per pod | `vector_search.enabled` |
|
||
| Multi-search | ~5 MB per pod | `multi_search.enabled` |
|
||
| Shadow tee | ~50 MB per pod (5% sample) | `shadow.enabled` |
|
||
| CDC publisher | ~64 MB per pod | `cdc.enabled` |
|
||
|
||
See [plan.md §13](../plan/plan.md#13-feature-scaling-modes) for the full feature list and scaling modes.
|
||
|
||
## Backup and recovery
|
||
|
||
### Meilisearch data
|
||
|
||
Meilisearch persists index data to disk at `/meilisearch-data`. Backup strategy:
|
||
|
||
1. **Snapshot volumes** — Take volume snapshots before index operations (reshard, ILM rollover)
|
||
2. **Dump exports** — Use `/_indexes/{uid}/documents/export` for logical backups
|
||
3. **RG/RF awareness** — With replication factor > 1, you can survive node loss without data loss
|
||
|
||
### Task store (Redis)
|
||
|
||
Redis task store contains:
|
||
- Idempotency keys (ephemeral, TTL 24h)
|
||
- Session pinning state (ephemeral, TTL per session)
|
||
- Background job queues (rebuildable from operations log)
|
||
- Leader lease (ephemeral, auto-recoverable)
|
||
|
||
Redis data is **rebuildable** — a fresh Redis instance is acceptable after failure. The orchestrator will reconstruct queues and lease state on startup.
|
||
|
||
For persistence:
|
||
- Enable RDB snapshots (`save 900 1`, `save 300 10`, `save 60 10000`)
|
||
- Or use AOF (`appendonly yes`) for stronger durability
|
||
|
||
### Configuration
|
||
|
||
Store your Helm values in git. The chart supports external secrets for sensitive values (API keys, JWT secrets).
|
||
|
||
## Upgrade path
|
||
|
||
See [versioning-policy.md](../versioning-policy.md) for backward compatibility commitments.
|
||
|
||
Upgrade procedure:
|
||
1. Read the changelog for breaking changes
|
||
2. Test in non-production first
|
||
3. Roll out orchestrator pods first (Meilisearch API compatibility)
|
||
4. Update Meilisearch nodes independently
|
||
|
||
## Troubleshooting
|
||
|
||
### Orchestrator pods stuck in CrashLoopBack
|
||
|
||
Check task store connectivity:
|
||
```bash
|
||
kubectl logs -n search deployment/miroir --tail 100 | grep -i task
|
||
```
|
||
|
||
If Redis is unreachable, verify `taskStore.backend` and connection settings.
|
||
|
||
### High memory usage
|
||
|
||
Identify the culprit:
|
||
1. Check `miroir_idempotency_cache_entries` — idempotency cache may be oversized
|
||
2. Check `miroir_session_pinning_active_sessions` — session pinning may be leaking
|
||
3. Review feature flags — disable unused capabilities
|
||
|
||
### HPA not scaling
|
||
|
||
Verify HPA metrics are reporting:
|
||
```bash
|
||
kubectl get hpa -n search
|
||
```
|
||
|
||
If metrics are `<unknown>`, install or configure the Metrics Server / Prometheus Adapter.
|
||
|
||
## Further reading
|
||
|
||
- [Deployment Sizing Guide](../horizontal-scaling/sizing.md) — Corpus/QPS → pod count matrix
|
||
- [plan.md §14](../plan/plan.md#14-resource-envelope-and-horizontal-scaling) — Full resource envelope documentation
|
||
- [plan.md §13](../plan/plan.md#13-feature-scaling-modes) — Per-feature scaling behavior
|