Commit graph

15 commits

Author SHA1 Message Date
jedarden
dae7cdd07a P3: Add Helm schema validation - Redis requires replicas > 1
Add Rule 0 to values.schema.json enforcing miroir.replicas > 1 when
taskStore.backend is redis (HA mode requires multiple replicas).

This completes the Phase 3 Task Registry + Persistence epic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-02 18:01:32 -04:00
jedarden
ba70cd25c0 P3: Complete Phase 3 — Task Registry + Persistence (SQLite + Redis)
Implements all 14 tables from plan §4 with dual backend support.

## Implementation

### TaskStore Trait (502 lines)
- Complete API covering all 14 tables
- Runtime backend selection (sqlite | redis)

### SQLite Backend (2,536 lines)
- rusqlite-based with WAL mode
- Idempotent migrations (schema_versions table)
- 36 tests passing (proptest + integration)

### Redis Backend (3,884 lines)
- Full TaskStore trait implementation
- Uses `_index` sets for O(1) list queries (no SCAN)
- 33 integration tests (testcontainers)

### Schema Files
- 001_initial.sql: Tables 1-7
- 002_feature_tables.sql: Tables 8-14
- 003_task_registry_fields.sql: No-op marker

### Validation
- Helm values.schema.json enforces HA constraints:
  - replicas > 1 requires backend: redis
  - HPA requires replicas >= 2 + redis
- Verified with helm lint

### Documentation
- REDIS_MEMORY_ACCOUNTING.md: Complete sizing guide

## Definition of Done — Complete
 rusqlite store with idempotent table initialization
 Redis store mirrors TaskStore API
 Migrations/versioning with schema_version row
 Property tests (proptest) for SQLite
 Restart resilience integration tests
 Redis integration tests (testcontainers)
 `_index` pattern for list queries
 Helm schema enforces HA requirements
 Redis memory accounting (plan §14.7)

Total: 6,922 lines of production code + tests

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-02 17:14:29 -04:00
jedarden
53506684b7 P3: Task Registry + Persistence — 14-table SQLite schema, Redis mirror, Helm validation
Implements the full 14-table task-store schema from plan §4 with both SQLite
and Redis backends sharing the TaskStore trait. Every §13/§14 advanced capability
consumes one or more of these tables.

SQLite backend:
- 3 migrations (001: tables 1-7, 002: tables 8-14, 003: task registry fields)
- WAL mode + busy_timeout for single-process concurrency
- Schema version tracking with SchemaVersionAhead guard
- Full CRUD + proptest round-trips on all 14 tables
- Restart resilience test: all data survives close/reopen cycle

Redis backend:
- Hash + _index SET pattern for O(cardinality) iteration (no SCAN)
- TTL-based expiration for sessions, idempotency, admin_sessions
- SET NX/XX for leader lease CAS operations
- Sorted sets for canary_runs with auto-prune
- Rate limiting keys for search_ui and admin_login
- CDC overflow buffer with byte-budget trimming
- Scoped key rotation coordination (observe/check pattern)
- Pub/sub for admin session revocation propagation
- testcontainers integration tests for all 14 tables + extras

Helm chart:
- values.schema.json enforces redis backend when replicas > 1
- ESO ExternalSecret template for OpenBao integration
- Updated values with secret inventory and rate limiting config

Config validation:
- replication_factor/replica_groups > 1 requires redis
- HPA enabled requires redis
- CDC overflow=redis requires redis task store
- Leader election required when replica_groups > 1
- CSP/CORS wildcard rejection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 15:50:20 -04:00
jedarden
5ff160e80f P7: readiness probe → /_miroir/ready, fix PeerDiscoveryGap alert
- Wire readinessProbe to /_miroir/ready (returns 503 until covering
  quorum reachable) instead of /health (always 200)
- Fix MiroirPeerDiscoveryGap alert to use miroir_peer_pod_count metric
  instead of non-existent miroir_peer_known
- Align MiroirHighSearchLatency, MiroirSettingsDivergence, and
  MiroirAntientropyMismatch alert expressions with registered metric
  names per plan §10

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 13:27:38 -04:00
jedarden
ee3ef23133 P10.5: scoped Meilisearch key rotation with multi-pod coordination
Implements plan §13.21 leader-based rotation of per-index scoped search
keys with zero-403 overlap guarantees:

- Leader lease (Redis, Mode B §14.5) serializes rotation across pods
- Per-pod beacon with 60s TTL refreshed on every search request
- Revocation safety gate: leader checks all live peers observed new
  generation before DELETE /keys/{previous_uid}
- Drain wait (default 120s) for stragglers before revocation
- Auto-rotation trigger: scoped_key_rotate_before_expiry_days (30d)
  before scoped_key_max_age_days (60d)
- Manual trigger: POST /_miroir/ui/search/{index}/rotate-scoped-key
  with force:true to bypass timing gate
- Config validation rejects rotate_before >= max_age at startup
- Helm _helpers.tpl render-time guard against rotation loop
- values.schema.json schema validation for scoped key config fields

Also includes session management routes (admin login/logout/session,
search UI JWT session) and auth middleware CSRF protection needed
by the admin-gated rotation endpoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 07:33:29 -04:00
jedarden
6e35e420a9 P10.3: SEARCH_UI_JWT_SECRET dual-secret overlap rotation
Implement plan §9 JWT signing-secret rotation with zero-downtime dual-secret
overlap window. Primary secret signs new tokens (kid header identifies it),
optional previous secret validates old tokens during rotation. Validation tries
primary first, falls through to previous on signature mismatch, and propagates
Expired immediately when the correct secret is found.

Key pieces:
- auth.rs: dual-secret JWT validation with kid header, leak response via empty
  previous, full test coverage (62 tests including e2e rotation scenario)
- main.rs: read SEARCH_UI_JWT_SECRET_PREVIOUS, refuse startup without primary
- config: jwt_secret_previous_env + jwt_rotation_buffer_s in SearchUiAuthConfig
- miroir-ctl: rotate-jwt-secret command (5-step dual-secret overlap procedure)
- Helm CronJob: quarterly schedule, suspended by default, Forbid concurrency

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 16:17:33 -04:00
jedarden
3b209e8b66 P10.1: Secret inventory + ESO ExternalSecret wiring
Expand eso-external-secret.yaml with full secret inventory (plan §9) —
documents all 8 keys with consumer, rotation strategy, and env var mapping.
Wire ADMIN_SESSION_SEAL_KEY, SEARCH_UI_JWT_SECRET,
SEARCH_UI_JWT_SECRET_PREVIOUS, and SEARCH_UI_SHARED_KEY into the Helm
deployment template as optional secretKeyRef env vars. Add startup
validation that refuses to start if search_ui is enabled but
SEARCH_UI_JWT_SECRET is missing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 15:18:02 -04:00
jedarden
a7540ab060 P7.3: Add §13.1 resharding row to Grafana dashboard, fix y-coordinate overlaps
Add collapsed Resharding (§13.1) feature-gated row with phase gauge,
in-progress stat, and backfill rate panel. Fix overlapping y=74 on
Anti-Entropy and Settings Broadcast rows by shifting subsequent rows.
Sync charts/miroir/dashboards/ copy with root dashboard.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:18:13 -04:00
jedarden
21748edf5e P8.7: Conditional Helm templates for CDC PVC, Redis, and ESO integration (plan §6/§9/§13.13)
- PVC template conditional on cdc.buffer.primary=="pvc" or cdc.buffer.overflow=="pvc"
- Redis deployment conditional on redis.enabled with auth via auto-generated or ESO secret
- ESO ExternalSecret example pulling from kv/search/miroir via openbao-backend ClusterSecretStore
- Deployment mounts CDC PVC at /data/cdc and injects Redis password when enabled
- ConfigMap generates taskStore.url and cdc.buffer.pvc_path from helpers

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:16:14 -04:00
jedarden
863bf1c33f P8.3: Refine schema rejections and add test runner
Simplify values.schema.json if/then patterns for rules 3-4 (removed
verbose allOf in favor of direct enum constraint in then branch),
drop unsupported errorMessage fields, and add run-tests.sh for
automated CI validation of all 12 schema/template test cases.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:10:58 -04:00
jedarden
c86f50fd76 P7.3: Add Grafana dashboard with 8 core panels and feature-gated rows (plan §10)
dashboards/miroir-overview.json — 50-panel dashboard covering:
- Core: cluster health, request rate, p50/p95/p99 latency, node comparison,
  search overhead, task lag, shard distribution, rebalance activity
- Feature-gated collapsed rows: multi-search (§13.11), anti-entropy (§13.8),
  settings broadcast (§13.5), CDC (§13.13), canary tests (§13.18),
  search UI (§13.21)

Helm chart: dashboards.enabled creates a ConfigMap labeled
grafana_dashboard=1 for sidecar auto-import.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 13:02:16 -04:00
jedarden
5b9ae4fa02 P8.3: Add values.schema.json rejection rules for incompatible configs
Schema-enforced rules (helm lint --strict):
- Rule 1: miroir.replicas > 1 requires taskStore.backend=redis
- Rule 2: hpa.enabled requires replicas >= 2 AND taskStore.backend=redis
- Rule 3: search_ui.rate_limit.backend=local rejected when replicas > 1
- Rule 4: admin_ui.login_rate_limit.backend=local rejected when replicas > 1

Template-enforced rule (helm template):
- Rule 5: scoped_key_rotate_before_expiry_days < scoped_key_max_age_days
  (JSON Schema draft-7 cannot compare sibling properties)

11 test cases: 7 bad configs rejected, 4 good configs pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 12:53:37 -04:00
jedarden
c8d5672d78 P8.2: Scaffold Helm chart with dev defaults (plan §6)
Full chart structure with 14 templates, values.schema.json, and NOTES.txt.
Dev defaults: 1 replica, 64 shards, RF=1, RG=1, sqlite task store, HPA off.
Production upgrade path documented in NOTES.txt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 12:31:36 -04:00
jedarden
ea6be6a339 P7.4: Add ServiceMonitor and PrometheusRule manifests (plan §10 + §14.9)
ServiceMonitor scrapes the metrics port (9090) at 30s intervals.
PrometheusRule ships all 12 alerts: 7 availability (degraded shards,
node down, high latency, stuck tasks, stuck rebalance, settings
divergence, anti-entropy mismatch) + 5 resource pressure (memory,
request queue, background queue, peer discovery, no leader).

Both gated behind serviceMonitor.enabled / prometheusRule.enabled
(defaults: false — requires prometheus-operator in cluster).

Also adds metrics port to the miroir Service so ServiceMonitor can
select it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 11:42:35 -04:00
jedarden
26d5524fec P3.5: Add values.schema.json constraint for replicas>1 requires Redis
- Create charts/miroir/ with values.schema.json, values.yaml, Chart.yaml
- Add JSON Schema if/then constraint: replicas > 1 requires taskStore.backend=redis
- Include errorMessage for clear operator feedback when constraint is violated
- Add test cases in charts/miroir/tests/ for validation:
  * valid-single-replica-sqlite.yaml (replicas: 1, backend: sqlite) → pass
  * invalid-multi-replica-sqlite.yaml (replicas: 2, backend: sqlite) → fail
  * valid-multi-replica-redis.yaml (replicas: 2, backend: redis) → pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 23:44:15 -04:00