pdftract

History

jedarden 19c6328542 feat(pdftract-19oy): codespace range parser + multi-byte tokenizer Implemented codespace range parsing from begincodespacerange/endcodespacerange blocks and multi-byte CJK tokenizer with widest-first matching per ISO 32000-1 9.10.3.1. Changes: - codespace.rs: Added pending_count handling for count-before-keyword syntax - codespace.rs: Improved error recovery (skip invalid ranges, continue parsing) - tokenize.rs: Added cfg guards for cjk feature diagnostic emission - mod.rs: Added tokenize module exports All acceptance criteria PASS: - [<00>-<7F>, <8140>-<FEFE>] tokenizes to [0x41, 0x82A0, 0x42] - [<00>-<7F>, <8000>-<FFFF>] tokenizes to [0x41, 0x82A0, 0x42] - Widest-first matching for overlapping ranges - Unrecognized bytes emit U+FFFD + diagnostic - 1-byte-only codespace handles ASCII correctly Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-28 12:26:25 -04:00
..
pdftract-cer-diff	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
pdftract-cli	fix(pdftract-25igv): fix emit! macro usage in codespace parser	2026-05-28 07:29:33 -04:00
pdftract-core	feat(pdftract-19oy): codespace range parser + multi-byte tokenizer	2026-05-28 12:26:25 -04:00
pdftract-libpdftract	feat(pdftract-3s2i): implement Phase 5.5.2 validation filter	2026-05-24 04:57:17 -04:00
pdftract-py	feat(pdftract-30ahi): configure maturin for 5-target wheel builds	2026-05-28 08:04:32 -04:00