pdftract/notes/pdftract-oh30a.md
jedarden 99709354f5 feat(pdftract-oh30a): implement per-page readability aggregation
Implement char-weighted median aggregation of per-span readability
scores into a page-level score stored in extraction_quality.readability.

Algorithm:
- Collect (score, char_count) pairs from spans
- Sort by score ascending
- Walk sorted list accumulating character counts
- Return score at half-total-char position

Acceptance criteria:
- Single span: returns its score
- Multiple spans: char-weighted median (longer spans count more)
- Empty page: returns 0.0
- All-perfect: returns 1.0

Closes: pdftract-oh30a

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 03:28:41 -04:00

83 lines
3.3 KiB
Markdown

# pdftract-oh30a: Per-page readability aggregation (median weighted by char count)
## Implementation Summary
Implemented `aggregate_page_readability()` function that computes per-page readability as the char-weighted median of per-span scores.
### Files Changed
1. **Created** `crates/pdftract-core/src/layout/readability.rs`:
- `ScoredSpan` trait for abstracting over different span representations
- `aggregate_page_readability<T: ScoredSpan>()` function
- Char-weighted median algorithm:
- Collect `(score, char_count)` pairs from spans
- Sort by score ascending
- Compute cumulative character count
- Return score at half-total-char point
- Edge case handling: empty page (0.0), single span, all empty strings
2. **Modified** `crates/pdftract-core/src/layout/mod.rs`:
- Added `pub mod readability;`
- Exported `aggregate_page_readability` and `ScoredSpan`
3. **Modified** `crates/pdftract-core/src/schema/mod.rs`:
- Added `readability: Option<f32>` field to `ExtractionQuality`
- Updated `ExtractionQuality::new()` to initialize `readability: None`
- Updated tests to include the new field
### Algorithm
The char-weighted median correctly weights longer spans more heavily:
- Sort spans by score (ascending)
- Walk sorted list accumulating character counts
- Return the score at the position where cumulative count exceeds half the total
Example from acceptance criteria:
- Spans: (100 chars, 0.9), (10 chars, 0.5), (100 chars, 0.8)
- Sorted: 0.5(10), 0.8(100), 0.9(100)
- Cumsum: 10, 110, 210
- Half = 105
- Score at cumsum >= 105 is **0.8**
### Test Results
All readability module tests PASS (15/15):
-`test_single_span` - Single span returns its score
-`test_empty_page` - Empty page returns 0.0
-`test_all_unscored_spans` - No scored spans returns 0.0
-`test_mixed_scored_unscored` - Unscored spans excluded
-`test_char_weighted_median_example` - AC example from bead
-`test_char_weighted_median_even_split` - Equal spans
-`test_all_same_score` - All same score returns that score
-`test_empty_strings` - All empty strings returns 0.0
-`test_unicode_char_count` - Counts Unicode code points correctly
-`test_longer_span_dominates` - Long spans dominate median
-`test_all_perfect_scores` - All 1.0 returns 1.0
-`test_all_zero_scores` - All 0.0 returns 0.0
-`test_order_preservation` - Result independent of input order
-`test_nan_score_handling` - NaN scores handled gracefully
-`test_zero_width_joiner` - Combining marks counted correctly
### Validation
- [x] Code compiles: `cargo check --all-targets`
- [x] All layout tests pass: `cargo test --lib layout` ✓ (53/53 passed)
- [x] All schema tests pass: `cargo test --lib schema` ✓ (26/26 passed)
- [x] Algorithm matches acceptance criteria exactly
### Commit
Files to commit:
- `crates/pdftract-core/src/layout/readability.rs` (new)
- `crates/pdftract-core/src/layout/mod.rs` (modified)
- `crates/pdftract-core/src/schema/mod.rs` (modified)
### Closing the bead
All acceptance criteria PASS:
- ✓ Page with 1 span of 100 chars at score 0.9: page score = 0.9
- ✓ Page with 3 spans: (100 chars, 0.9), (10 chars, 0.5), (100 chars, 0.8): char-weighted median = 0.8
- ✓ Empty page: page score = 0.0 (default)
- ✓ All-perfect spans: page score = 1.0
Ready to close.