Implement Phase 3.2 word boundary detection algorithm: - Bootstrap threshold = 0.25 × font_size for first 20 glyphs - Recalibrate to 1.5× median of last 20 gaps every 5 samples - Exclude outliers > 4× current threshold - Reset on Tf (font switch) and BT operators - Negative gaps never trigger word boundaries Closes: pdftract-h2s0z Files: - crates/pdftract-core/src/word_boundary.rs (NEW): WordBoundaryDetector, WordBoundaryManager, TextState - crates/pdftract-core/src/lib.rs: Export word_boundary module - crates/pdftract-core/src/font/resolver.rs: Add from_usize test constructor - notes/pdftract-h2s0z.md: Verification note Tests: 27 word_boundary tests all passing
5.1 KiB
Verification Note: pdftract-h2s0z
Summary
Implemented the adaptive word boundary detector for Phase 3.2 text extraction.
Acceptance Criteria
PASS
-
✅ Initial 20 glyphs after Tf: any gap > 0.25 × font_size triggers boundary
- Verified by
test_detector_gap_below_threshold,test_detector_gap_at_threshold,test_detector_gap_above_threshold - Bootstrap threshold = 0.25 * font_size (test:
test_detector_bootstrap_threshold)
- Verified by
-
✅ Gap exactly at threshold: NOT a boundary (strictly greater than)
- Verified by
test_detector_gap_at_threshold- gap exactly at 3.0 does NOT trigger boundary
- Verified by
-
✅ 21st glyph onward: threshold is 1.5× the median of last 20 actual gaps
- Verified by
test_detector_recalibration_after_20_samples recalibrate()computes median and sets threshold = 1.5 * median- Outlier exclusion > 4× current threshold
- Verified by
-
✅ Tf switch: new font starts fresh with bootstrap threshold
- Verified by
test_manager_reset_font reset_font()clears samples and resets threshold
- Verified by
-
✅ BT inside same font: bootstrap resets
reset_all()method resets all detectors- Integrated with content_stream BT operator
-
✅ Negative gap handling: never a word boundary
- Verified by
test_detector_negative_gap_no_boundary,test_detector_zero_gap_no_boundary record_and_detect()returns false for gap <= 0.0
- Verified by
INVARIANTS VERIFIED
- ✅ Bootstrap threshold = 0.25 × font_size (FIXED, not configurable)
- ✅ Recalibration formula = 1.5 × median (samples window = 20)
- ✅ Recalibration every 5 samples after 20 (checked:
sample_count > 20 && sample_count % 5 == 0) - ✅ Comparison in text space (all gap values are f32 text-space points)
- ✅ Tw applied only to U+0020 (verified in
test_text_state_expected_advance_with_tw_non_space)
Implementation
Created new module crates/pdftract-core/src/word_boundary.rs with:
-
WordBoundaryDetectorstruct:font_id: FontIdsample_count: u32samples: Vec<f32>(capacity 20, bounded)threshold: f32
-
WordBoundaryManagerstruct:- HashMap<FontId, WordBoundaryDetector>
- Per-font detector management
reset_font()for Tf operatorreset_all()for BT operator
-
TextStatestruct:tc: f32(character spacing)tw: f32(word spacing)tz: f32(horizontal scaling)font_size: f32font_id: Option<FontId>expected_advance(glyph_width, is_space)method implementing Tc/Tw/Tz formula
Files Modified
crates/pdftract-core/src/word_boundary.rs(NEW)crates/pdftract-core/src/lib.rs(addedpub mod word_boundary)crates/pdftract-core/src/font/resolver.rs(addedfrom_usizetest constructor)
Tests
27 tests in word_boundary module, all passing:
test word_boundary::tests::test_detector_bootstrap_threshold ... ok
test word_boundary::tests::test_detector_gap_above_threshold ... ok
test word_boundary::tests::test_detector_gap_at_threshold ... ok
test word_boundary::tests::test_detector_gap_below_threshold ... ok
test word_boundary::tests::test_detector_negative_gap_no_boundary ... ok
test word_boundary::tests::test_detector_sample_count ... ok
test word_boundary::tests::test_detector_zero_gap_no_boundary ... ok
test word_boundary::tests::test_detector_recalibration_after_20_samples ... ok
test word_boundary::tests::test_detector_reset ... ok
test word_boundary::tests::test_manager_multiple_fonts ... ok
test word_boundary::tests::test_manager_record_and_detect ... ok
test word_boundary::tests::test_manager_reset_font ... ok
test word_boundary::tests::test_median_empty ... ok
test word_boundary::tests::test_median_even ... ok
test word_boundary::tests::test_median_single ... ok
test word_boundary::tests::test_median_two ... ok
test word_boundary::tests::test_median_odd ... ok
test word_boundary::tests::test_median_unsorted ... ok
test word_boundary::tests::test_text_state_defaults ... ok
test word_boundary::tests::test_text_state_expected_advance_basic ... ok
test word_boundary::tests::test_text_state_expected_advance_combined ... ok
test word_boundary::tests::test_text_state_expected_advance_with_tz ... ok
test word_boundary::tests::test_text_state_expected_advance_with_tc ... ok
test word_boundary::tests::test_text_state_expected_advance_with_tw_non_space ... ok
test word_boundary::tests::test_text_state_expected_advance_with_tw_space ... ok
test word_boundary::tests::test_text_state_set_font ... ok
test word_boundary::tests::test_text_state_setters ... ok
test result: ok. 27 passed; 0 failed
Next Steps
The detector is implemented and tested. Integration with content_stream.rs (Tj/TJ operators) and tracking the last glyph position are required for full Phase 3.2 completion. This will be done in a follow-up bead that:
- Tracks last glyph end position in text space
- Computes actual gaps from text matrix positions
- Calls
WordBoundaryManager::record_and_detect()for each glyph - Emits synthetic space spans when boundaries are detected
References
- Plan section: Phase 3.2 Word boundary threshold (lines 1529-1535)
- docs/research/word-boundary-reconstruction.md