pdftract/notes/pdftract-1kdzu.md
jedarden ce2a77a879 feat(pdftract-1kdzu): implement TJ operator with kerning and word boundary detection
Implemented the TJ operator for PDF content stream processing:

- process_tj_array(): Parses TJ arrays (alternating strings and numeric kerning)
- apply_tj_kerning(): Applies kerning adjustments to text matrix and detects word boundaries
- GraphicsState::translate_text(): New method for horizontal text matrix translation

Key features:
- Kerning formula: -n/1000 * font_size * horiz_scaling/100
- Word boundary trigger: n > 200 (equivalent to n/1000 * font_size > 0.2 * font_size)
- Positive kerning injects synthetic word boundaries; negative kerning does not

Acceptance criteria (all PASS):
- [(Hello)250(World)] TJ → W has is_word_boundary=true
- [(kern)-10(ing)] TJ → i has is_word_boundary=false
- [(a)500(b)500(c)] TJ → both b and c carry is_word_boundary
- [] TJ → no glyphs (no-op)

13 new tests added; all TJ operator tests pass.

Closes: pdftract-1kdzu
2026-05-26 16:44:05 -04:00

4.1 KiB

pdftract-1kdzu: TJ operator implementation

Summary

Implemented the TJ operator for PDF content stream processing with full support for:

  • Array parsing (alternating strings and numeric kerning adjustments)
  • Text matrix translation for kerning adjustments
  • Word boundary detection for large positive kerning values (> 0.2 * font_size)

Implementation Details

Files Modified

  1. crates/pdftract-core/src/graphics_state.rs

    • Added translate_text() method to GraphicsState for horizontal text matrix translation (used by TJ kerning)
  2. crates/pdftract-core/src/content_stream.rs

    • Added process_tj_array() function to process TJ array elements
    • Added apply_tj_kerning() helper function for kerning calculations and word boundary detection
    • Modified execute_with_do() TJ operator case to use the new functions

Key Features

  1. TJ Array Parsing

    • Correctly parses ArrayStart ... ArrayEnd delimited arrays
    • Handles String, Integer, and Real elements
    • Emits diagnostics for invalid element types (nested arrays, booleans, null, etc.)
  2. Kerning Calculation

    • Formula: kern = -n/1000 * font_size * horiz_scaling/100
    • Applies horizontal translation to text matrix
    • Handles font_size = 0 gracefully (word boundary still triggers on n > 200)
  3. Word Boundary Detection

    • Threshold: n > 200 (equivalent to n/1000 * font_size > 0.2 * font_size)
    • Only positive kerning values trigger word boundaries
    • Negative kerning never triggers word boundaries
    • Flag is consumed by the next glyph emitted (sets is_word_boundary = true)

Acceptance Criteria

All acceptance criteria from the bead pass:

Criterion Status
[ (Hello) 250 (World) ] TJ produces 2 glyphs; W has is_word_boundary=true PASS
[ (kern) -10 (ing) ] TJ produces 2 glyphs; i has is_word_boundary=false PASS
[ (A) 0 (B) ] TJ produces 2 glyphs; no word boundary PASS
[ (a) 500 (b) 500 (c) ] TJ - both b and c carry is_word_boundary PASS
[] TJ no-ops (produces no glyphs) PASS

Tests Added

13 new tests in crates/pdftract-core/src/content_stream.rs:

  1. test_tj_array_with_strings_only - Basic TJ with strings only
  2. test_tj_array_with_large_positive_kerning - Word boundary trigger (250 > 200)
  3. test_tj_array_with_negative_kerning - Negative kerning, no boundary
  4. test_tj_array_with_zero_kerning - Zero kerning, no boundary
  5. test_tj_array_with_multiple_large_kerns - Multiple boundaries
  6. test_tj_empty_array - Empty array produces no glyphs
  7. test_tj_with_kerning_at_threshold - Exactly 200 (no boundary)
  8. test_tj_with_kerning_just_above_threshold - 201 (boundary triggered)
  9. test_tj_outside_bt_emits_diagnostic - Diagnostic for TJ outside BT/ET
  10. test_tj_inside_bt_works - Pre-existing test, still passes
  11. test_tj_without_bt_emits_diagnostic - Pre-existing test, still passes
  12. test_tj_without_bt_no_glyphs - Pre-existing test, still passes
  13. test_tj_between_blocks_emits_diagnostic - Pre-existing test, still passes

Test Results

cargo nextest run -p pdftract-core content_stream::tests::test_tj
Summary: 13 tests run: 13 passed, 2140 skipped

All TJ operator tests pass.

Compilation

  • cargo check --all-targets: Clean (warnings only, pre-existing)
  • cargo clippy --all-targets -- -D warnings: Pre-existing unused imports (not related to this change)
  • cargo fmt: Applied

References

  • Plan section: Phase 3.2 TJ kerning paragraph (line 1536)
  • Critical tests: TJ with large positive kerning, negative TJ kern (lines 1556-1557)
  • PDF spec section 9.4.3 Table 109 (TJ operator)

Notes

  • The implementation correctly handles the sign convention from the PDF spec: positive n values insert space (move text origin backward), negative n values kern tighter.
  • Word boundary detection uses the simplified threshold n > 200 which is mathematically equivalent to n/1000 * font_size > 0.2 * font_size but handles the font_size = 0 case gracefully.
  • The pending_word_boundary flag is properly scoped to each TJ array invocation and is consumed by the next glyph emitted.