jedarden 99709354f5 feat(pdftract-oh30a): implement per-page readability aggregation

Implement char-weighted median aggregation of per-span readability
scores into a page-level score stored in extraction_quality.readability.

Algorithm:
- Collect (score, char_count) pairs from spans
- Sort by score ascending
- Walk sorted list accumulating character counts
- Return score at half-total-char position

Acceptance criteria:
- Single span: returns its score
- Multiple spans: char-weighted median (longer spans count more)
- Empty page: returns 0.0
- All-perfect: returns 1.0

Closes: pdftract-oh30a

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-24 03:28:41 -04:00

3.3 KiB

Raw Blame History

pdftract-oh30a: Per-page readability aggregation (median weighted by char count)

Implementation Summary

Implemented aggregate_page_readability() function that computes per-page readability as the char-weighted median of per-span scores.

Files Changed

Created crates/pdftract-core/src/layout/readability.rs:
- ScoredSpan trait for abstracting over different span representations
- aggregate_page_readability<T: ScoredSpan>() function
- Char-weighted median algorithm:
  - Collect (score, char_count) pairs from spans
  - Sort by score ascending
  - Compute cumulative character count
  - Return score at half-total-char point
- Edge case handling: empty page (0.0), single span, all empty strings
Modified crates/pdftract-core/src/layout/mod.rs:
- Added pub mod readability;
- Exported aggregate_page_readability and ScoredSpan
Modified crates/pdftract-core/src/schema/mod.rs:
- Added readability: Option<f32> field to ExtractionQuality
- Updated ExtractionQuality::new() to initialize readability: None
- Updated tests to include the new field

Algorithm

The char-weighted median correctly weights longer spans more heavily:

Sort spans by score (ascending)
Walk sorted list accumulating character counts
Return the score at the position where cumulative count exceeds half the total

Example from acceptance criteria:

Spans: (100 chars, 0.9), (10 chars, 0.5), (100 chars, 0.8)
Sorted: 0.5(10), 0.8(100), 0.9(100)
Cumsum: 10, 110, 210
Half = 105
Score at cumsum >= 105 is 0.8 ✓

Test Results

All readability module tests PASS (15/15):

✓ test_single_span - Single span returns its score
✓ test_empty_page - Empty page returns 0.0
✓ test_all_unscored_spans - No scored spans returns 0.0
✓ test_mixed_scored_unscored - Unscored spans excluded
✓ test_char_weighted_median_example - AC example from bead
✓ test_char_weighted_median_even_split - Equal spans
✓ test_all_same_score - All same score returns that score
✓ test_empty_strings - All empty strings returns 0.0
✓ test_unicode_char_count - Counts Unicode code points correctly
✓ test_longer_span_dominates - Long spans dominate median
✓ test_all_perfect_scores - All 1.0 returns 1.0
✓ test_all_zero_scores - All 0.0 returns 0.0
✓ test_order_preservation - Result independent of input order
✓ test_nan_score_handling - NaN scores handled gracefully
✓ test_zero_width_joiner - Combining marks counted correctly

Validation

Code compiles: cargo check --all-targets ✓
All layout tests pass: cargo test --lib layout ✓ (53/53 passed)
All schema tests pass: cargo test --lib schema ✓ (26/26 passed)
Algorithm matches acceptance criteria exactly

Commit

Files to commit:

crates/pdftract-core/src/layout/readability.rs (new)
crates/pdftract-core/src/layout/mod.rs (modified)
crates/pdftract-core/src/schema/mod.rs (modified)

Closing the bead

All acceptance criteria PASS:

✓ Page with 1 span of 100 chars at score 0.9: page score = 0.9
✓ Page with 3 spans: (100 chars, 0.9), (10 chars, 0.5), (100 chars, 0.8): char-weighted median = 0.8
✓ Empty page: page score = 0.0 (default)
✓ All-perfect spans: page score = 1.0

Ready to close.

3.3 KiB Raw Blame History