pdftract/crates
jedarden aebe37ca84 feat(pdftract-5o6hx): implement hyphenation repair
Implement repair_hyphenation() that detects and repairs end-of-line
hyphenation within blocks. Joins hyphenated words across line breaks
when the hyphen is at the column right edge and the continuation
starts with a lowercase letter.

Key features:
- Detects hyphens: -, ‐ (U+2010), ‑ (U+2011), soft hyphen (U+00AD)
- Right-edge detection: span bbox.x1 within 5% of column width
- Lowercase continuation check to avoid joining sentences
- Column-aware: only joins spans in same column
- Cleans up empty spans/lines after repair

Adds HasBBox and HyphenableSpan traits for flexible span types.
Includes 9 comprehensive tests covering all acceptance criteria.

Fixes pre-existing test cases in schema module (missing column field).

Closes: pdftract-5o6hx

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 17:24:48 -04:00
..
pdftract-cer-diff docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
pdftract-cli feat(pdftract-5pbkp): implement inspect subcommand with clap parsing and axum server 2026-05-24 17:13:05 -04:00
pdftract-core feat(pdftract-5o6hx): implement hyphenation repair 2026-05-24 17:24:48 -04:00
pdftract-libpdftract feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
pdftract-py feat(pdftract-2nu0s): implement Python SDK contract conformance 2026-05-24 08:55:11 -04:00