Verified the rebalancer worker implementation with advisory lock is complete and all acceptance tests pass: - Advisory lock via leader_lease (scope: rebalance:<index>) - Progress persistence via jobs table for pod restart resumption - Metrics: rebalance_in_progress, documents_migrated_total, duration_seconds All 24 rebalancer worker tests pass including 4 acceptance tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.4 KiB
2.4 KiB
P4.1 Rebalancer Background Worker - Implementation Summary
Task Completed
The rebalancer background worker with advisory lock was already implemented in the codebase. Verified all acceptance criteria pass.
Implementation Location
crates/miroir-core/src/rebalancer_worker/mod.rs- Main worker implementationcrates/miroir-core/src/rebalancer_worker/acceptance_tests.rs- Acceptance tests
Key Components
-
Advisory Lock (Leader Lease)
- Uses
try_acquire_leader_leasewith scoperebalance:<index> - Only one pod can hold the lease at a time
- Lease renewal every 2 seconds (configurable)
- TTL of 10 seconds (configurable)
- Uses
-
Topology Change Events
NodeAdded- Triggers shard migration to new nodeNodeDraining- Triggers shard migration away from draining nodeNodeFailed- Marks node as failedNodeRecovered- Marks node as active
-
Shard Migration State Machine
Idle → DualWriteStarted → MigrationInProgress → MigrationComplete → DualWriteStopped → OldReplicaDeleted → Idle -
Progress Persistence
- Jobs persisted to
jobstable in task store - Each shard tracks: phase, docs_migrated, last_offset
load_persisted_jobs()loads state on startup
- Jobs persisted to
-
Metrics (Plan §10)
miroir_rebalance_in_progress- Gauge (0 or 1)miroir_rebalance_documents_migrated_total- Counter (monotonically increasing)miroir_rebalance_duration_seconds- Histogram (per-shard migration time)
Acceptance Tests Verified
- P4.1-A1: Advisory lock prevents duplicate migrations ✓
- P4.1-A2: Progress persistence allows pod restart resumption ✓
- P4.1-A3: Metrics monotonically increase ✓
- P4.1-A4: Two workers produce 0 duplicate migrations ✓
Integration
- Started as background task in
main.rs(line 320-337) - Loads persisted jobs on startup
- Metrics callback wired up in
admin_endpoints.rs - Health checker syncs metrics to Prometheus
Configuration
RebalancerWorkerConfig {
max_concurrent_migrations: 4, // Plan §14.2 memory budget
lease_ttl_secs: 10,
lease_renewal_interval_ms: 2000,
migration_batch_size: 1000,
migration_batch_delay_ms: 100,
event_channel_capacity: 100,
}
Test Results
All 24 rebalancer worker tests pass:
- 4 acceptance tests (P4.1-A1 through P4.1-A4)
- 6 anti-entropy worker tests
- 7 settings broadcast tests
- 7 other unit tests