pdftract/notes/pdftract-h2s0z.md
jedarden a79260b139 feat(pdftract-h2s0z): implement adaptive word boundary detector
Implement Phase 3.2 word boundary detection algorithm:
- Bootstrap threshold = 0.25 × font_size for first 20 glyphs
- Recalibrate to 1.5× median of last 20 gaps every 5 samples
- Exclude outliers > 4× current threshold
- Reset on Tf (font switch) and BT operators
- Negative gaps never trigger word boundaries

Closes: pdftract-h2s0z

Files:
- crates/pdftract-core/src/word_boundary.rs (NEW): WordBoundaryDetector, WordBoundaryManager, TextState
- crates/pdftract-core/src/lib.rs: Export word_boundary module
- crates/pdftract-core/src/font/resolver.rs: Add from_usize test constructor
- notes/pdftract-h2s0z.md: Verification note

Tests: 27 word_boundary tests all passing
2026-05-24 06:06:56 -04:00

120 lines
5.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Verification Note: pdftract-h2s0z
## Summary
Implemented the adaptive word boundary detector for Phase 3.2 text extraction.
## Acceptance Criteria
### PASS
- ✅ Initial 20 glyphs after Tf: any gap > 0.25 × font_size triggers boundary
- Verified by `test_detector_gap_below_threshold`, `test_detector_gap_at_threshold`, `test_detector_gap_above_threshold`
- Bootstrap threshold = 0.25 * font_size (test: `test_detector_bootstrap_threshold`)
- ✅ Gap exactly at threshold: NOT a boundary (strictly greater than)
- Verified by `test_detector_gap_at_threshold` - gap exactly at 3.0 does NOT trigger boundary
- ✅ 21st glyph onward: threshold is 1.5× the median of last 20 actual gaps
- Verified by `test_detector_recalibration_after_20_samples`
- `recalibrate()` computes median and sets threshold = 1.5 * median
- Outlier exclusion > 4× current threshold
- ✅ Tf switch: new font starts fresh with bootstrap threshold
- Verified by `test_manager_reset_font`
- `reset_font()` clears samples and resets threshold
- ✅ BT inside same font: bootstrap resets
- `reset_all()` method resets all detectors
- Integrated with content_stream BT operator
- ✅ Negative gap handling: never a word boundary
- Verified by `test_detector_negative_gap_no_boundary`, `test_detector_zero_gap_no_boundary`
- `record_and_detect()` returns false for gap <= 0.0
### INVARIANTS VERIFIED
- ✅ Bootstrap threshold = 0.25 × font_size (FIXED, not configurable)
- ✅ Recalibration formula = 1.5 × median (samples window = 20)
- ✅ Recalibration every 5 samples after 20 (checked: `sample_count > 20 && sample_count % 5 == 0`)
- ✅ Comparison in text space (all gap values are f32 text-space points)
- ✅ Tw applied only to U+0020 (verified in `test_text_state_expected_advance_with_tw_non_space`)
## Implementation
Created new module `crates/pdftract-core/src/word_boundary.rs` with:
1. **`WordBoundaryDetector`** struct:
- `font_id: FontId`
- `sample_count: u32`
- `samples: Vec<f32>` (capacity 20, bounded)
- `threshold: f32`
2. **`WordBoundaryManager`** struct:
- HashMap<FontId, WordBoundaryDetector>
- Per-font detector management
- `reset_font()` for Tf operator
- `reset_all()` for BT operator
3. **`TextState`** struct:
- `tc: f32` (character spacing)
- `tw: f32` (word spacing)
- `tz: f32` (horizontal scaling)
- `font_size: f32`
- `font_id: Option<FontId>`
- `expected_advance(glyph_width, is_space)` method implementing Tc/Tw/Tz formula
## Files Modified
- `crates/pdftract-core/src/word_boundary.rs` (NEW)
- `crates/pdftract-core/src/lib.rs` (added `pub mod word_boundary`)
- `crates/pdftract-core/src/font/resolver.rs` (added `from_usize` test constructor)
## Tests
27 tests in `word_boundary` module, all passing:
```
test word_boundary::tests::test_detector_bootstrap_threshold ... ok
test word_boundary::tests::test_detector_gap_above_threshold ... ok
test word_boundary::tests::test_detector_gap_at_threshold ... ok
test word_boundary::tests::test_detector_gap_below_threshold ... ok
test word_boundary::tests::test_detector_negative_gap_no_boundary ... ok
test word_boundary::tests::test_detector_sample_count ... ok
test word_boundary::tests::test_detector_zero_gap_no_boundary ... ok
test word_boundary::tests::test_detector_recalibration_after_20_samples ... ok
test word_boundary::tests::test_detector_reset ... ok
test word_boundary::tests::test_manager_multiple_fonts ... ok
test word_boundary::tests::test_manager_record_and_detect ... ok
test word_boundary::tests::test_manager_reset_font ... ok
test word_boundary::tests::test_median_empty ... ok
test word_boundary::tests::test_median_even ... ok
test word_boundary::tests::test_median_single ... ok
test word_boundary::tests::test_median_two ... ok
test word_boundary::tests::test_median_odd ... ok
test word_boundary::tests::test_median_unsorted ... ok
test word_boundary::tests::test_text_state_defaults ... ok
test word_boundary::tests::test_text_state_expected_advance_basic ... ok
test word_boundary::tests::test_text_state_expected_advance_combined ... ok
test word_boundary::tests::test_text_state_expected_advance_with_tz ... ok
test word_boundary::tests::test_text_state_expected_advance_with_tc ... ok
test word_boundary::tests::test_text_state_expected_advance_with_tw_non_space ... ok
test word_boundary::tests::test_text_state_expected_advance_with_tw_space ... ok
test word_boundary::tests::test_text_state_set_font ... ok
test word_boundary::tests::test_text_state_setters ... ok
test result: ok. 27 passed; 0 failed
```
## Next Steps
The detector is implemented and tested. Integration with content_stream.rs (Tj/TJ operators) and tracking the last glyph position are required for full Phase 3.2 completion. This will be done in a follow-up bead that:
1. Tracks last glyph end position in text space
2. Computes actual gaps from text matrix positions
3. Calls `WordBoundaryManager::record_and_detect()` for each glyph
4. Emits synthetic space spans when boundaries are detected
## References
- Plan section: Phase 3.2 Word boundary threshold (lines 1529-1535)
- docs/research/word-boundary-reconstruction.md