diff --git a/notes/pdftract-3zz9n.md b/notes/pdftract-3zz9n.md new file mode 100644 index 0000000..824b609 --- /dev/null +++ b/notes/pdftract-3zz9n.md @@ -0,0 +1,94 @@ +# pdftract-3zz9n: 5-trigger break detector + glyph-to-span merger + +## Summary + +Verified that `merge_glyphs_to_spans(glyphs: &[Glyph]) -> Vec` is correctly implemented in `/home/coding/pdftract/crates/pdftract-core/src/span/mod.rs` (lines 388-487). All 54 span module tests pass. + +## Implementation Details + +The function implements the 5-trigger break detector: + +1. **Trigger 1: font_name != prev font_name** (line 412) + - Compares `&glyph.font_name != &span.font` + +2. **Trigger 2: font_size delta > 0.5pt** (line 415) + - Uses `(glyph.font_size - span.size).abs() > 0.5` + +3. **Trigger 3: rendering_mode != prev rendering_mode** (line 418) + - Direct comparison `glyph.rendering_mode != span.rendering_mode` + +4. **Trigger 4: RGB-normalized fill_color != prev color** (lines 421-425) + - Uses `colors_equal()` helper which normalizes: + - DeviceGray(v) → (v,v,v) RGB tuple + - DeviceCMYK → RGB via formula R=(1-C)*(1-K) + - DeviceRGB → as-is + - Spot/Other → compared by variant (Spot by name+tint exactly) + +5. **Trigger 5: is_word_boundary == true** (lines 399-407) + - Implements option (a): appends space to previous span + - Then finalizes that span and starts fresh + +## Word Boundary Handling + +The implementation uses option (a) from the plan: +- When `is_word_boundary == true`, appends " " to the PREVIOUS span text +- Finalizes the span with trailing space +- Next glyph starts a new span WITHOUT leading space +- Produces cleaner JSON output: 2 spans "Hello " and "World" instead of 3 spans "Hello", " ", "World" + +## Confidence Tracking + +- Confidence is MINIMUM of all member glyphs: `span.confidence.min(glyph.confidence)` (line 474) +- Confidence_source mapped from WORST glyph (lowest confidence) source (lines 469-472) +- Mapping: ToUnicode/Agl/Fingerprint → Native, ShapeMatch/Unknown → Heuristic + +## Bbox Union + +Bbox is extended to union of member glyphs (lines 461-465): +- `span.bbox[0] = span.bbox[0].min(glyph.bbox[0])` (x0) +- `span.bbox[1] = span.bbox[1].min(glyph.bbox[1])` (y0) +- `span.bbox[2] = span.bbox[2].max(glyph.bbox[2])` (x1) +- `span.bbox[3] = span.bbox[3].max(glyph.bbox[3])` (y1) + +## Acceptance Criteria - PASS + +All acceptance criteria tests pass: + +| AC | Test | Result | +|---|------|--------| +| "Hello World" → 2 spans | `test_merge_glyphs_to_spans_hello_world_with_word_boundary` | PASS | +| Font name change triggers break | `test_merge_glyphs_to_spans_font_name_change_triggers_break` | PASS | +| Font size 12pt vs 12.2pt → 1 span | `test_merge_glyphs_to_spans_font_size_within_threshold_no_break` | PASS | +| DeviceGray vs RGB normalized | `test_merge_glyphs_to_spans_device_gray_and_rgb_normalized_same_color` | PASS | +| Spot vs DeviceRGB different | `test_merge_glyphs_to_spans_spot_vs_device_rgb_different_colors` | PASS | +| Empty glyph list | `test_merge_glyphs_to_spans_empty_glyph_list` | PASS | +| Confidence is minimum | `test_merge_glyphs_to_spans_confidence_minimum` | PASS | +| Confidence source from worst glyph | `test_merge_glyphs_to_spans_confidence_source_worst_glyph` | PASS | +| Bbox is union | `test_merge_glyphs_to_spans_bbox_union` | PASS | + +## Test Run + +```bash +cargo test -p pdftract-core --lib 'span::' +``` + +Result: **54 passed; 0 failed; 0 ignored** + +## Invariants Verified + +- INV: text round-trips - joining all span texts produces original document text +- INV: confidence is MINIMUM of all member glyphs +- INV: confidence_source is from WORST glyph source +- INV: bbox is union of member glyph bboxes +- INV: font_size delta uses 0.5pt threshold +- INV: fill_color compared via RGB-normalized equality +- INV: Spot colors compared by name AND tint exactly + +## Files Verified + +- `/home/coding/pdftract/crates/pdftract-core/src/span/mod.rs` - Implementation (lines 388-487) +- `/home/coding/pdftract/crates/pdftract-core/src/span/mod.rs` - Tests (lines 768-1355) + +## Status + +**COMPLETE** - Implementation exists, all tests pass, all acceptance criteria met.