Implement char-weighted median aggregation of per-span readability scores into a page-level score stored in extraction_quality.readability. Algorithm: - Collect (score, char_count) pairs from spans - Sort by score ascending - Walk sorted list accumulating character counts - Return score at half-total-char position Acceptance criteria: - Single span: returns its score - Multiple spans: char-weighted median (longer spans count more) - Empty page: returns 0.0 - All-perfect: returns 1.0 Closes: pdftract-oh30a Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.3 KiB
3.3 KiB
pdftract-oh30a: Per-page readability aggregation (median weighted by char count)
Implementation Summary
Implemented aggregate_page_readability() function that computes per-page readability as the char-weighted median of per-span scores.
Files Changed
-
Created
crates/pdftract-core/src/layout/readability.rs:ScoredSpantrait for abstracting over different span representationsaggregate_page_readability<T: ScoredSpan>()function- Char-weighted median algorithm:
- Collect
(score, char_count)pairs from spans - Sort by score ascending
- Compute cumulative character count
- Return score at half-total-char point
- Collect
- Edge case handling: empty page (0.0), single span, all empty strings
-
Modified
crates/pdftract-core/src/layout/mod.rs:- Added
pub mod readability; - Exported
aggregate_page_readabilityandScoredSpan
- Added
-
Modified
crates/pdftract-core/src/schema/mod.rs:- Added
readability: Option<f32>field toExtractionQuality - Updated
ExtractionQuality::new()to initializereadability: None - Updated tests to include the new field
- Added
Algorithm
The char-weighted median correctly weights longer spans more heavily:
- Sort spans by score (ascending)
- Walk sorted list accumulating character counts
- Return the score at the position where cumulative count exceeds half the total
Example from acceptance criteria:
- Spans: (100 chars, 0.9), (10 chars, 0.5), (100 chars, 0.8)
- Sorted: 0.5(10), 0.8(100), 0.9(100)
- Cumsum: 10, 110, 210
- Half = 105
- Score at cumsum >= 105 is 0.8 ✓
Test Results
All readability module tests PASS (15/15):
- ✓
test_single_span- Single span returns its score - ✓
test_empty_page- Empty page returns 0.0 - ✓
test_all_unscored_spans- No scored spans returns 0.0 - ✓
test_mixed_scored_unscored- Unscored spans excluded - ✓
test_char_weighted_median_example- AC example from bead - ✓
test_char_weighted_median_even_split- Equal spans - ✓
test_all_same_score- All same score returns that score - ✓
test_empty_strings- All empty strings returns 0.0 - ✓
test_unicode_char_count- Counts Unicode code points correctly - ✓
test_longer_span_dominates- Long spans dominate median - ✓
test_all_perfect_scores- All 1.0 returns 1.0 - ✓
test_all_zero_scores- All 0.0 returns 0.0 - ✓
test_order_preservation- Result independent of input order - ✓
test_nan_score_handling- NaN scores handled gracefully - ✓
test_zero_width_joiner- Combining marks counted correctly
Validation
- Code compiles:
cargo check --all-targets✓ - All layout tests pass:
cargo test --lib layout✓ (53/53 passed) - All schema tests pass:
cargo test --lib schema✓ (26/26 passed) - Algorithm matches acceptance criteria exactly
Commit
Files to commit:
crates/pdftract-core/src/layout/readability.rs(new)crates/pdftract-core/src/layout/mod.rs(modified)crates/pdftract-core/src/schema/mod.rs(modified)
Closing the bead
All acceptance criteria PASS:
- ✓ Page with 1 span of 100 chars at score 0.9: page score = 0.9
- ✓ Page with 3 spans: (100 chars, 0.9), (10 chars, 0.5), (100 chars, 0.8): char-weighted median = 0.8
- ✓ Empty page: page score = 0.0 (default)
- ✓ All-perfect spans: page score = 1.0
Ready to close.