pdftract/crates
jedarden 19c6328542 feat(pdftract-19oy): codespace range parser + multi-byte tokenizer
Implemented codespace range parsing from begincodespacerange/endcodespacerange
blocks and multi-byte CJK tokenizer with widest-first matching per ISO 32000-1
9.10.3.1.

Changes:
- codespace.rs: Added pending_count handling for count-before-keyword syntax
- codespace.rs: Improved error recovery (skip invalid ranges, continue parsing)
- tokenize.rs: Added cfg guards for cjk feature diagnostic emission
- mod.rs: Added tokenize module exports

All acceptance criteria PASS:
- [<00>-<7F>, <8140>-<FEFE>] tokenizes to [0x41, 0x82A0, 0x42]
- [<00>-<7F>, <8000>-<FFFF>] tokenizes to [0x41, 0x82A0, 0x42]
- Widest-first matching for overlapping ranges
- Unrecognized bytes emit U+FFFD + diagnostic
- 1-byte-only codespace handles ASCII correctly

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 12:26:25 -04:00
..
pdftract-cer-diff docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
pdftract-cli fix(pdftract-25igv): fix emit! macro usage in codespace parser 2026-05-28 07:29:33 -04:00
pdftract-core feat(pdftract-19oy): codespace range parser + multi-byte tokenizer 2026-05-28 12:26:25 -04:00
pdftract-libpdftract feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
pdftract-py feat(pdftract-30ahi): configure maturin for 5-target wheel builds 2026-05-28 08:04:32 -04:00