Add cross-links from the production deployment guide and Docker Compose examples README to the main troubleshooting guide and diagnostic playbook. This completes the cross-linking requirement for P11.5. Changes: - docs/onboarding/production.md: Add cross-link to troubleshooting guide - examples/README.md: Add cross-link to troubleshooting guide Closes: miroir-uyx.5 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6.8 KiB
Production Deployment Guide
This guide covers operational considerations for running Miroir in production.
Sizing
Start with the Deployment Sizing Guide to determine the number of orchestrator pods and task store configuration for your workload.
Resource envelope
Each orchestrator pod is designed for 2 vCPU / 3.75 GB RAM. This matches common small-instance tiers across cloud providers:
- AWS: t3.medium
- GCP: e2-medium
- Hetzner: CX22
- Rackspace Spot: 2c/3.75GB
See plan.md §14 for the full resource budget breakdown.
Task store selection
| Replicas | Recommended task store |
|---|---|
| 1 | SQLite (default) |
| 2+ | Redis (required) |
The Helm chart's values.schema.json enforces this requirement — it rejects configurations where miroir.replicas > 1 and taskStore.backend: sqlite.
For Redis deployment modes:
- Small deployments (≤ 200 GB corpus): Single Redis instance is sufficient
- Medium deployments (≤ 1 TB corpus): Redis with persistence enabled
- Large deployments (≤ 5 TB corpus): Redis Cluster or Sentinel for HA
Horizontal Pod Autoscaler
Enable HPA for production deployments with variable traffic:
miroir:
hpa:
enabled: true
minReplicas: 2
maxReplicas: 24
HPA requires taskStore.backend: redis and miroir.replicas >= 2. The chart enforces these constraints.
See the sizing matrix for recommended replica ranges per workload tier.
High availability
Minimum HA configuration
- 2 orchestrator pods (handles one pod failure without downtime)
- Redis task store (shared state across pods)
- Replication factor ≥ 2 for Meilisearch (handles one node failure per shard group)
Leader election
The orchestrator uses Redis-based leader election for operations that require exactly-one semantics (settings broadcast, ILM rollover). Leader lease TTL is 30s with heartbeat renewal at 10s intervals — a new leader is elected within 30s if the current leader fails.
Pod disruption budgets
The chart includes a PodDisruptionBudget that ensures minimum availability during voluntary disruptions:
minAvailable: 1
For stricter guarantees, adjust to minAvailable: 2 or use percentage-based policies.
Monitoring
Resource pressure metrics
Key metrics to monitor:
| Metric | Meaning | Alert threshold |
|---|---|---|
miroir_requests_in_flight |
Concurrent requests | > 80% of server.max_concurrent_requests |
miroir_background_queue_depth |
Pending background jobs | > 10 for > 5min |
miroir_task_store_latency_seconds |
Task store operation p99 | > 100ms |
| Container CPU | Orchestrator CPU utilization | > 80% sustained |
| Container memory | Orchestrator RAM usage | > 85% sustained |
See plan.md §14.9 for the full alerting catalog.
Search UI rate limiting
When miroir.replicas > 1, the Search UI rate limiter must use Redis backend:
search_ui:
rate_limit:
backend: redis # Required for multi-pod deployments
With backend: local, the effective cluster-wide rate is per_ip × pod_count because each pod maintains its own bucket table.
Vertical scaling escape valve
If your environment cannot support horizontal scaling (e.g., single-node Kubernetes), you can use a larger pod size (e.g., 4 vCPU / 8 GB). This is supported but not recommended for production — horizontal scaling provides better fault isolation and elasticity.
See plan.md §14.10 for trade-offs.
Feature flags for memory-constrained deployments
Every advanced capability has an enabled: true knob that can be flipped off to reclaim memory:
| Feature | Memory savings | Config path |
|---|---|---|
| Vector search | ~30 MB per pod | vector_search.enabled |
| Multi-search | ~5 MB per pod | multi_search.enabled |
| Shadow tee | ~50 MB per pod (5% sample) | shadow.enabled |
| CDC publisher | ~64 MB per pod | cdc.enabled |
See plan.md §13 for the full feature list and scaling modes.
Backup and recovery
Meilisearch data
Meilisearch persists index data to disk at /meilisearch-data. Backup strategy:
- Snapshot volumes — Take volume snapshots before index operations (reshard, ILM rollover)
- Dump exports — Use
/_indexes/{uid}/documents/exportfor logical backups - RG/RF awareness — With replication factor > 1, you can survive node loss without data loss
Task store (Redis)
Redis task store contains:
- Idempotency keys (ephemeral, TTL 24h)
- Session pinning state (ephemeral, TTL per session)
- Background job queues (rebuildable from operations log)
- Leader lease (ephemeral, auto-recoverable)
Redis data is rebuildable — a fresh Redis instance is acceptable after failure. The orchestrator will reconstruct queues and lease state on startup.
For persistence:
- Enable RDB snapshots (
save 900 1,save 300 10,save 60 10000) - Or use AOF (
appendonly yes) for stronger durability
Configuration
Store your Helm values in git. The chart supports external secrets for sensitive values (API keys, JWT secrets).
Upgrade path
See versioning-policy.md for backward compatibility commitments.
Upgrade procedure:
- Read the changelog for breaking changes
- Test in non-production first
- Roll out orchestrator pods first (Meilisearch API compatibility)
- Update Meilisearch nodes independently
Troubleshooting
For comprehensive troubleshooting of common issues, see the Troubleshooting Guide and Diagnostic Playbook.
Orchestrator pods stuck in CrashLoopBack
Check task store connectivity:
kubectl logs -n search deployment/miroir --tail 100 | grep -i task
If Redis is unreachable, verify taskStore.backend and connection settings.
High memory usage
Identify the culprit:
- Check
miroir_idempotency_cache_entries— idempotency cache may be oversized - Check
miroir_session_pinning_active_sessions— session pinning may be leaking - Review feature flags — disable unused capabilities
HPA not scaling
Verify HPA metrics are reporting:
kubectl get hpa -n search
If metrics are <unknown>, install or configure the Metrics Server / Prometheus Adapter.
Further reading
- Deployment Sizing Guide — Corpus/QPS → pod count matrix
- plan.md §14 — Full resource envelope documentation
- plan.md §13 — Per-feature scaling behavior