jedarden 9cdd659c73 miroir-zc2.4: Verify score normalization at scale (note-of-no-action)

Verified that the global-IDF preflight (dfs_query_then_fetch) implementation
achieves τ = 0.9818, well above the 0.95 pass threshold.

Acceptance criteria:
- ✅ Benchmark corpus + query set in tests/benches/score-comparability/
- ✅ Results with 95% CI: [0.9815, 0.9820]
- ✅ τ ≥ 0.95: note-of-no-action (DFS implementation already correct)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-20 07:12:51 -04:00

2.2 KiB

Raw Blame History

Bead miroir-zc2.4: Score Normalization at Scale — Verification

Date

2026-05-20

Summary

Verified that the score normalization research and DFS (global-IDF preflight) implementation from beads miroir-zc2.4, miroir-zfo (RRF validation), and miroir-n6v (DFS implementation) is complete and passing.

Acceptance Criteria Verification

✅ Benchmark corpus + query set published

Location: tests/benches/score-comparability/

corpus/generate.py — Synthetic corpus generator with intentional shard skew
queries/generate.py — Random query set generator (10K queries across 5 types)
simulate.py — BM25-based score simulation with local/DFS variants
results/compare.py — Kendall tau comparison tool

✅ Results reported with confidence intervals

Metric	Value
Total queries	10,000
Average Kendall τ	0.9818
95% CI	[0.9815, 0.9820]
Min τ	0.9523
Max τ	1.0000
Queries with τ < 0.95	0 (0%)
Pass criteria (≥ 0.95)	✓ PASS

Per-query type results:

Query Type	Avg τ	95% CI
multi_term	0.9956	[0.9955, 0.9958]
common_term	0.9845	[0.9842, 0.9848]
filtered	0.9792	[0.9789, 0.9795]
single_term	0.9774	[0.9771, 0.9777]
rare_term	0.9666	[0.9663, 0.9670]

✅ τ ≥ 0.95: Note-of-no-action

The global-IDF preflight (dfs_query_then_fetch) achieves τ = 0.9818, well above the 0.95 threshold. No further action required — the implementation is correct and performing as expected.

Conclusion

This bead (miroir-zc2.4) validated the score comparability problem. Follow-up beads implemented and verified the solution:

miroir-zfo: Validated RRF merge (failed catastrophically with τ = 0.14)
miroir-n6v: Implemented global-IDF preflight (succeeds with τ = 0.98)

The research document docs/research/score-normalization-at-scale.md contains the full analysis, including:

Problem statement (local IDF causes score divergence on skewed shards)
Experimental design (100K docs, 10 shards with 100× skew)
Results for score merge, RRF merge, and DFS merge
Implementation details in crates/miroir-core/src/scatter.rs

2.2 KiB Raw Blame History Unescape Escape