miroir/notes/miroir-zc2.4.md
jedarden 9cdd659c73 miroir-zc2.4: Verify score normalization at scale (note-of-no-action)
Verified that the global-IDF preflight (dfs_query_then_fetch) implementation
achieves τ = 0.9818, well above the 0.95 pass threshold.

Acceptance criteria:
-  Benchmark corpus + query set in tests/benches/score-comparability/
-  Results with 95% CI: [0.9815, 0.9820]
-  τ ≥ 0.95: note-of-no-action (DFS implementation already correct)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 07:12:51 -04:00

2.2 KiB
Raw Blame History

Bead miroir-zc2.4: Score Normalization at Scale — Verification

Date

2026-05-20

Summary

Verified that the score normalization research and DFS (global-IDF preflight) implementation from beads miroir-zc2.4, miroir-zfo (RRF validation), and miroir-n6v (DFS implementation) is complete and passing.

Acceptance Criteria Verification

Benchmark corpus + query set published

Location: tests/benches/score-comparability/

  • corpus/generate.py — Synthetic corpus generator with intentional shard skew
  • queries/generate.py — Random query set generator (10K queries across 5 types)
  • simulate.py — BM25-based score simulation with local/DFS variants
  • results/compare.py — Kendall tau comparison tool

Results reported with confidence intervals

Metric Value
Total queries 10,000
Average Kendall τ 0.9818
95% CI [0.9815, 0.9820]
Min τ 0.9523
Max τ 1.0000
Queries with τ < 0.95 0 (0%)
Pass criteria (≥ 0.95) ✓ PASS

Per-query type results:

Query Type Avg τ 95% CI
multi_term 0.9956 [0.9955, 0.9958]
common_term 0.9845 [0.9842, 0.9848]
filtered 0.9792 [0.9789, 0.9795]
single_term 0.9774 [0.9771, 0.9777]
rare_term 0.9666 [0.9663, 0.9670]

τ ≥ 0.95: Note-of-no-action

The global-IDF preflight (dfs_query_then_fetch) achieves τ = 0.9818, well above the 0.95 threshold. No further action required — the implementation is correct and performing as expected.

Conclusion

This bead (miroir-zc2.4) validated the score comparability problem. Follow-up beads implemented and verified the solution:

  • miroir-zfo: Validated RRF merge (failed catastrophically with τ = 0.14)
  • miroir-n6v: Implemented global-IDF preflight (succeeds with τ = 0.98)

The research document docs/research/score-normalization-at-scale.md contains the full analysis, including:

  • Problem statement (local IDF causes score divergence on skewed shards)
  • Experimental design (100K docs, 10 shards with 100× skew)
  • Results for score merge, RRF merge, and DFS merge
  • Implementation details in crates/miroir-core/src/scatter.rs