The score_span_readability function was already fully implemented in readability.rs. This commit adds comprehensive tests for the acceptance criteria of bead pdftract-1q4ku: - AC1: All-printable English high coverage -> > 0.9 - AC2: All-U+FFFD -> significantly reduced (< 0.7) - AC3: All-whitespace -> whitespace_score=0 (binary penalty) - AC4: Low confidence -> scaled by confidence_floor - AC5: Non-English -> dict_coverage forced to 1.0 - AC6: Ligature split -> integrity 0 lowers score Also adds tests verifying: - Empty span returns 0.0 - Confidence threshold (0.6 -> 1.0) - Whitespace bounds [0.05, 0.40] - Printable fraction calculation - Dict coverage enabled/disabled behavior - Non-English lang tag handling (en, en-US, zh, None) All tests pass. The implementation correctly computes: - 0.35 * printable_fraction - 0.30 * dict_coverage (disabled for non-English) - 0.15 * whitespace_score (binary in/out bounds) - 0.10 * ligature_integrity (binary split detection) - 0.10 * confidence_floor (min(1.0, conf/0.6)) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| caption.rs | ||
| code.rs | ||
| columns.rs | ||
| correction.rs | ||
| header_footer.rs | ||
| line.rs | ||
| mod.rs | ||
| readability.rs | ||
| reading_order.rs | ||
| watermark_formula.rs | ||
| wordlist.rs | ||