jedarden
|
1dfaf73aa4
|
feat(pdftract-3g6ne): implement CMap codespace range parser
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
This commit adds the codespace range parser for CMap streams. The parser
extracts the begincodespacerange / endcodespacerange blocks that define
legal byte-width boundaries for character codes in a CMap.
## Implementation
- CodespaceRange: Single range with lo/hi bounds (stored as [u8; 4]) and width (1-4 bytes)
- CodespaceRanges: Collection with SmallVec<[CodespaceRange; 8]>
- CodespaceParser: PostScript-style tokenizer for begincodespacerange blocks
## Acceptance Criteria (all PASS)
- Parse <00> <7F> → 1 range, width=1 ✅
- Parse <00> <7F> <8000> <FFFF> in one block → 2 ranges ✅
- Width inference: 2-char hex → width=1; 4-char hex → width=2 ✅
- Case-insensitive hex (<C0> and <c0> equivalent) ✅
- Malformed range (width mismatch) → diagnostic + skipped ✅
- Empty CMap → empty ranges ✅
- JIS range <8140> <FEFE> → 2-byte CJK ✅
- 3-byte and 4-byte range support ✅
Also adds encrypted fixture provenance entries to PROVENANCE.md.
Co-Authored-By: Claude Code <noreply@anthropic.com>
|
2026-05-28 05:47:07 -04:00 |
|