Verified that merge_glyphs_to_spans() correctly implements: - 5-trigger break detector (font_name, font_size delta >0.5pt, rendering_mode, RGB-normalized color, is_word_boundary) - Word boundary handling option (a): append space to previous span - Confidence tracking: minimum of all glyphs, source from worst glyph - Bbox union of member glyphs All 54 span module tests pass. Acceptance criteria: - "Hello World" → 2 spans "Hello " and "World" ✓ - Font name change triggers break ✓ - Font size 12pt vs 12.2pt → 1 span (delta < 0.5pt) ✓ - DeviceGray vs DeviceRGB normalized same color ✓ - Spot vs DeviceRGB different colors ✓ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.8 KiB
pdftract-3zz9n: 5-trigger break detector + glyph-to-span merger
Summary
Verified that merge_glyphs_to_spans(glyphs: &[Glyph]) -> Vec<Span> is correctly implemented in /home/coding/pdftract/crates/pdftract-core/src/span/mod.rs (lines 388-487). All 54 span module tests pass.
Implementation Details
The function implements the 5-trigger break detector:
-
Trigger 1: font_name != prev font_name (line 412)
- Compares
&glyph.font_name != &span.font
- Compares
-
Trigger 2: font_size delta > 0.5pt (line 415)
- Uses
(glyph.font_size - span.size).abs() > 0.5
- Uses
-
Trigger 3: rendering_mode != prev rendering_mode (line 418)
- Direct comparison
glyph.rendering_mode != span.rendering_mode
- Direct comparison
-
Trigger 4: RGB-normalized fill_color != prev color (lines 421-425)
- Uses
colors_equal()helper which normalizes:- DeviceGray(v) → (v,v,v) RGB tuple
- DeviceCMYK → RGB via formula R=(1-C)*(1-K)
- DeviceRGB → as-is
- Spot/Other → compared by variant (Spot by name+tint exactly)
- Uses
-
Trigger 5: is_word_boundary == true (lines 399-407)
- Implements option (a): appends space to previous span
- Then finalizes that span and starts fresh
Word Boundary Handling
The implementation uses option (a) from the plan:
- When
is_word_boundary == true, appends " " to the PREVIOUS span text - Finalizes the span with trailing space
- Next glyph starts a new span WITHOUT leading space
- Produces cleaner JSON output: 2 spans "Hello " and "World" instead of 3 spans "Hello", " ", "World"
Confidence Tracking
- Confidence is MINIMUM of all member glyphs:
span.confidence.min(glyph.confidence)(line 474) - Confidence_source mapped from WORST glyph (lowest confidence) source (lines 469-472)
- Mapping: ToUnicode/Agl/Fingerprint → Native, ShapeMatch/Unknown → Heuristic
Bbox Union
Bbox is extended to union of member glyphs (lines 461-465):
span.bbox[0] = span.bbox[0].min(glyph.bbox[0])(x0)span.bbox[1] = span.bbox[1].min(glyph.bbox[1])(y0)span.bbox[2] = span.bbox[2].max(glyph.bbox[2])(x1)span.bbox[3] = span.bbox[3].max(glyph.bbox[3])(y1)
Acceptance Criteria - PASS
All acceptance criteria tests pass:
| AC | Test | Result |
|---|---|---|
| "Hello World" → 2 spans | test_merge_glyphs_to_spans_hello_world_with_word_boundary |
PASS |
| Font name change triggers break | test_merge_glyphs_to_spans_font_name_change_triggers_break |
PASS |
| Font size 12pt vs 12.2pt → 1 span | test_merge_glyphs_to_spans_font_size_within_threshold_no_break |
PASS |
| DeviceGray vs RGB normalized | test_merge_glyphs_to_spans_device_gray_and_rgb_normalized_same_color |
PASS |
| Spot vs DeviceRGB different | test_merge_glyphs_to_spans_spot_vs_device_rgb_different_colors |
PASS |
| Empty glyph list | test_merge_glyphs_to_spans_empty_glyph_list |
PASS |
| Confidence is minimum | test_merge_glyphs_to_spans_confidence_minimum |
PASS |
| Confidence source from worst glyph | test_merge_glyphs_to_spans_confidence_source_worst_glyph |
PASS |
| Bbox is union | test_merge_glyphs_to_spans_bbox_union |
PASS |
Test Run
cargo test -p pdftract-core --lib 'span::'
Result: 54 passed; 0 failed; 0 ignored
Invariants Verified
- INV: text round-trips - joining all span texts produces original document text
- INV: confidence is MINIMUM of all member glyphs
- INV: confidence_source is from WORST glyph source
- INV: bbox is union of member glyph bboxes
- INV: font_size delta uses 0.5pt threshold
- INV: fill_color compared via RGB-normalized equality
- INV: Spot colors compared by name AND tint exactly
Files Verified
/home/coding/pdftract/crates/pdftract-core/src/span/mod.rs- Implementation (lines 388-487)/home/coding/pdftract/crates/pdftract-core/src/span/mod.rs- Tests (lines 768-1355)
Status
COMPLETE - Implementation exists, all tests pass, all acceptance criteria met.