Research complete: both score-based and RRF merge fail 0.95 threshold.
Updated research doc with full RRF validation results and confidence intervals.
Added benchmark result reports and helper tests. Follow-up bead miroir-n6v
created for global-IDF preflight (dfs_query_then_fetch pattern).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Research doc updated with precise 95% CIs per query type. compare.py
now computes and reports confidence intervals. Kendall τ = 0.79
(95% CI [0.7873, 0.8006]) confirms raw score merging is not viable;
RRF already implemented in merger.rs as mitigation. Follow-up bead
created (miroir-zfo) for RRF quality validation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Completed Plan §15 Open Problem #4 research on cross-shard score comparability.
## Key Finding
Average Kendall tau: 0.79 vs. 0.95 threshold — FAIL
Cross-shard score comparability is a significant issue:
- Common-term queries: τ = 0.15 (catastrophic)
- Local IDF statistics cause score inflation on small shards
- Documents from 10-doc shards outrank 93K-doc shard results
## Recommendation
Implement Reciprocal Rank Fusion (RRF) for result merging.
Follow-up bead: miroir-nsu
## Artifacts Added
- Benchmark infrastructure: tests/benches/score-comparability/
- Corpus generator with extreme shard skew (100× variance)
- Query generator (10K random queries across 5 types)
- BM25-based simulation with global vs local IDF
- Kendall tau comparison tool
- Full experimental results (τ = 0.79 ± 0.01, 95% CI)
- Research writeup: docs/research/score-normalization-at-scale.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>