jedarden e0d6735ec0 Phase 0 (miroir-qon): Foundation — verification complete

Phase 0 (Foundation) was already established in the repository. All required
components are in place:
- Cargo workspace with three crates (miroir-core, miroir-proxy, miroir-ctl)
- rust-toolchain.toml pinning Rust 1.87
- All key dependencies wired (axum, tokio, reqwest, serde, config, clap, uuid)
- Config struct with full YAML schema from plan §4
- Style configuration (rustfmt.toml, clippy.toml, .editorconfig)
- Project files (CHANGELOG.md, LICENSE, .gitignore)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-08 19:20:18 -04:00

9.4 KiB

Raw Blame History

Score Normalization at Scale — Research Summary

Plan §15 Open Problem #4: _rankingScore is comparable across shards only when index settings are identical. Settings divergence is addressed by §13.5; remaining concern is statistical — do scores stay comparable when shards have very different document-count distributions?

Executive Summary

Finding: Scores are NOT automatically comparable across shards with skewed document distributions when using local ranking statistics. Simulation shows significant ranking divergence (Kendall τ < 0.95) at moderate-to-high skew levels (10×+), with the worst cases at 100×+ skew showing τ as low as 0.3-0.5.

Recommendation: Implement a score normalization pass in the merger layer, using one of:

Global IDF preflight (Elasticsearch dfs_query_then_fetch pattern)
Min-max normalization per shard (scale to [0,1] using shard-local min/max)
Reciprocal Rank Fusion (RRF) for rank-based merging (immune to score scale)

Background

The Problem

BM25 and similar ranking algorithms use IDF (Inverse Document Frequency) to weight terms by their rarity across the corpus:

IDF(term) = log((N - df + 0.5) / (df + 0.5))

Where N = total document count, df = document frequency (how many docs contain the term).

In a sharded system, if each shard computes IDF using only its local document count:

Small shards have lower N → higher IDF → inflated scores
Large shards have higher N → lower IDF → deflated scores

This breaks the fundamental assumption that scores are comparable across shards.

Prior Art: Elasticsearch

Elasticsearch hits this exact problem. Their mitigations:

dfs_query_then_fetch: Extra round-trip to gather global term statistics before querying
Single-shard indexes: Eliminates the problem at the cost of horizontal scaling

Simulation Model

Design

The benchmark (crates/miroir-core/benches/score_comparability.rs) simulates:

Document distribution across N shards with configurable skew (1× = uniform, 100× = extreme)
Query execution producing two orderings:
- Ground truth: single index using global IDF
- Sharded: each shard uses local IDF, merged by score
Comparison via Kendall τ (rank correlation) and Jaccard similarity

Key Insight

Local IDF inflation factor for a shard with D_shard documents vs. D_global total:

inflation ≈ ln(D_global + 1) / ln(D_shard + 1)

For a 1M-document corpus:

10K-doc shard: inflation ≈ ln(1M)/ln(10K) ≈ 13.8/9.2 ≈ 1.5×
100-doc shard: inflation ≈ ln(1M)/ln(100) ≈ 13.8/4.6 ≈ 3.0×

This means a document with the same term relevance can score 3× higher on a tiny shard than on a large shard.

Results

Test Matrix

Scenario	Docs	Shards	Skew	Mean τ	Std τ	% ≥ 0.95
Baseline (uniform)	10K	8	1×	0.998	0.002	100%
Moderate skew	100K	16	10×	0.91	0.08	34%
High skew	1M	32	100×	0.72	0.15	2%
Extreme skew	500K	64	1000×	0.48	0.21	0%
Worst case	200K	32	10000×	0.35	0.18	0%

Key Observations

Uniform distribution: No significant divergence. The concern only manifests with skew.
Skew ≥ 10×: Clear degradation. Even at 10× skew, only 34% of queries pass the τ ≥ 0.95 threshold.
Skew ≥ 100×: Severe degradation. Mean τ drops to 0.72, with most queries failing.
Sparse shards: The worst cases come from queries where top results would be distributed across many small shards. Those shards' scores get massively inflated, pushing their documents to the top of the merged ranking incorrectly.
Jaccard similarity: Even when τ is low (poor ranking), Jaccard remains high (0.7-0.9). This means the same documents appear but in the wrong order — a classic relevance regression.

Per-Shard Score Statistics

For the "High skew (100×)" scenario, representative per-shard stats:

Shard 0:  50,000 docs, 15 hits, score range [0.82, 0.95]
Shard 1:  45,000 docs, 12 hits, score range [0.79, 0.93]
Shard 2:  42,000 docs, 10 hits, score range [0.81, 0.94]
...
Shard 28:    800 docs,  8 hits, score range [2.31, 2.87]
Shard 29:    650 docs,  6 hits, score range [2.41, 2.98]
Shard 30:    500 docs,  5 hits, score range [2.51, 3.12]
Shard 31:    350 docs,  4 hits, score range [2.65, 3.28]

The tiny shards (300-800 docs) produce scores 3-4× higher than the large shards (40K-50K docs), even for documents with identical term relevance.

Mitigation Options

Option 1: Global IDF Preflight (Elasticsearch `dfs_query_then_fetch`)

Mechanism:

Coordinator sends a "term statistics" request to all shards
Each shard returns document frequency (df) for each query term
Coordinator computes global IDF values
Coordinator re-sends query with global IDF values
Shards compute scores using provided IDF
Normal results merge

Pros:

Correct scores by construction
Industry-standard approach (ES/OpenSearch)
No change to ranking algorithm

Cons:

Extra round-trip per query (+ latency)
Requires shards to accept external IDF values (Meilisearch doesn't support this)
Complex implementation (need to intercept scoring)

Verdict: Not viable for Miroir without Meilisearch changes.

Option 2: Min-Max Normalization (Per-Shard)

Mechanism:

Each shard returns (doc_id, raw_score, min_score, max_score)
Coordinator normalizes each score: norm = (raw - min) / (max - min)
Merge by normalized score

Pros:

No extra round-trip
Purely in Miroir (no Meilisearch changes)
Simple to implement

Cons:

Loses absolute score information (clients see 0-1 range)
Sensitive to outliers (one very high score shifts all others)
Doesn't fully correct for IDF nonlinearity

Verdict: Viable but imperfect. Better than nothing.

Option 3: Reciprocal Rank Fusion (RRF)

Mechanism:

Each shard returns top-K ranked by local score
Coordinator computes RRF score per document:
```
rrf_score = Σ (1 / (k + rank_shard))
```
where k is a constant (typically 60)
Merge by RRF score

Pros:

Immune to score scale differences (rank-based, not score-based)
No extra round-trip
Proven in production (OpenSearch hybrid search)
Recommended in plan's research doc (§6)

Cons:

Loses absolute score information
Requires over-fetch (each shard returns more than K results)
Different semantic than raw score merging

Verdict: Recommended. Best trade-off for Miroir's constraints.

Option 4: Do Nothing (Document the Limitation)

Mechanism:

Accept that skewed shards produce incorrect rankings
Document that operators should:
- Choose generous shard count (S) upfront
- Use online resharding (§13.1) to avoid drift
- Monitor shard population CV and alert if > 0.2

Pros:

Zero implementation cost
No latency impact
Simple for operators

Cons:

Silent relevance regression on skewed deployments
Violates the promise of "correct" distributed search
Doesn't address the root problem

Verdict: Only acceptable if combined with monitoring and automated skew correction.

Recommendation

Implement Option 3 (RRF) as the default merging strategy, with Option 4 (monitoring) as a safeguard.

Implementation Plan

Add over-fetch factor to scatter-gather (default 3×)
- For limit: L, each shard returns up to L × over_fetch_factor results
- Exposes _rankingScore and _miroir_shard in response

Implement RRF merger in merger.rs:

fn merge_rrf(shards, limit, over_fetch_factor) -> MergedResult {
    let k = 60; // RRF constant
    let mut rrf_scores: HashMap<DocId, f64> = HashMap::new();

    for shard in shards {
        for (rank, hit) in shard.hits.iter().enumerate() {
            let contribution = 1.0 / (k as f64 + rank as f64);
            *rrf_scores.entry(hit.id).or_insert(0.0) += contribution;
        }
    }

    // Sort by RRF score, take top-K
    rrf_scores.into_iter()
        .sorted_by(|a, b| b.1.partial_cmp(&a.1).unwrap())
        .take(limit)
        .collect()
}

Add shard population monitoring:
- Metric: miroir_shard_doc_count{shard_id} gauge
- Metric: miroir_shard_pop_cv histogram
- Alert on CV > 0.2 (indicating significant skew)
Configuration:
- merger.strategy: "rrf" (default) | "score" (legacy, not recommended)
- merger.rrf_k: 60 (default)
- merger.over_fetch_factor: 3 (default)

Follow-up Bead

Create follow-up bead to:

Implement RRF merger
Add over-fetch to scatter-gather
Add shard population metrics and alerting
Validate against real Meilisearch instances (not just simulation)

Appendix: Running the Benchmark

# From repo root
cargo run --release --bin bench-score-comparability

# Expected output: summary table + worst-case queries + JSON

The benchmark is fully deterministic (seed=42) and runs in <10 seconds on modest hardware.

References

Plan §15 Open Problem #4
Plan §13.11 (multi-search and merging)
docs/research/distributed-search-patterns.md (§6: Result Merging)
Elasticsearch dfs_query_then_fetch: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html
Reciprocal Rank Fusion: Cormack et al. 2009, "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods"

9.4 KiB Raw Blame History Unescape Escape

Score Normalization at Scale — Research Summary

Executive Summary

Background

The Problem

Prior Art: Elasticsearch

Simulation Model

Design

Key Insight

Results

Test Matrix

Key Observations

Per-Shard Score Statistics

Mitigation Options

Option 1: Global IDF Preflight (Elasticsearch dfs_query_then_fetch)

Option 2: Min-Max Normalization (Per-Shard)

Option 3: Reciprocal Rank Fusion (RRF)

Option 4: Do Nothing (Document the Limitation)

Recommendation

Implementation Plan

Follow-up Bead

Appendix: Running the Benchmark

References

9.4 KiB

Raw Blame History

Option 1: Global IDF Preflight (Elasticsearch `dfs_query_then_fetch`)