Documents the implementation, acceptance criteria status, and design decisions for the CMap codespace range parser. Co-Authored-By: Claude Code <noreply@anthropic.com>
3.6 KiB
3.6 KiB
pdftract-3g6ne: CMap Codespace Range Parser
Bead Summary
Implemented the CMap codespace range parser for extracting byte-width boundaries from begincodespacerange / endcodespacerange PostScript blocks.
Implementation Location
- Module:
crates/pdftract-core/src/font/codespace.rs - Exported from:
crates/pdftract-core/src/font/mod.rs
Structures Implemented
CodespaceRange
pub struct CodespaceRange {
pub lo: [u8; 4], // Low bound (big-endian, 4-byte storage)
pub hi: [u8; 4], // High bound (big-endian, 4-byte storage)
pub width: u8, // Byte width (1-4)
}
CodespaceRanges
pub struct CodespaceRanges {
pub ranges: SmallVec<[CodespaceRange; 8]>,
}
CodespaceParser
PostScript-style tokenizer that:
- Recognizes
begincodespacerange/endcodespacerangekeywords - Parses hex string pairs
<lo> <hi> - Validates width matching (lo.len() == hi.len())
- Emits diagnostics on malformed entries
- Continues parsing after errors (recovery)
Acceptance Criteria Status
| Criterion | Status | Test |
|---|---|---|
| Parse <00> <7F> → 1 range, width=1 | PASS | test_parse_single_range_one_byte |
| Parse <00> <7F> <8000> in one block → 2 ranges | PASS | test_parse_two_ranges_mixed_width |
| Width inference: 2-char hex → width=1; 4-char hex → width=2 | PASS | test_width_inference |
| Case-insensitive hex ( and equivalent) | PASS | test_case_insensitive_hex |
| Malformed range (width mismatch) → diagnostic + skipped | PASS | test_malformed_range_width_mismatch |
| Empty CMap → empty ranges | PASS | test_empty_cmap, test_no_codespace_block |
| Round-trip with Identity-H CMap fixture | N/A | No standalone CMap fixtures exist; tests cover parsing logic |
Additional Tests
test_jis_range: JIS lead/trail 2-byte pattern<8140> <FEFE>test_three_byte_range: 3-byte codespace supporttest_four_byte_range: 4-byte codespace supporttest_invalid_width_too_large: Rejects 5+ byte rangestest_find_range: Utility to match byte sequences to rangestest_comment_in_block: PostScript comment strippingtest_hex_string_with_whitespace: Internal whitespace handlingtest_odd_length_hex_string: Dangling nibble paddingtest_recovery_on_error: Continues after malformed entriestest_convenience_function: Public API entry points
Public API
// Parse without diagnostics (for internal use)
pub fn parse_codespace_ranges(input: &[u8]) -> CodespaceRanges
// Parse with diagnostics (for error reporting)
pub fn parse_codespace_ranges_with_diags(input: &[u8]) -> (CodespaceRanges, Vec<Diagnostic>)
Design Decisions
- 4-byte storage for bounds: Ranges up to 4 bytes are stored in fixed
[u8; 4]arrays with leading zeros, simplifying comparison logic - SmallVec capacity 8: Most predefined CMaps (Identity-H/V, UTF-16 variants) have 1-2 ranges; 8 provides stack allocation for typical cases without overflow
- Recovery over hard failure: Malformed entries emit diagnostics but don't stop parsing; subsequent valid ranges are still collected
- Case-insensitive hex: Both
<C0>and<c0>parse to0xC0per PDF spec - Width validation: Rejects ranges where lo.len() != hi.len() or width > 4
Integration Notes
- Module is imported in
font/mod.rsbut not yet exported at crate level - Sibling tokenizer bead will consume
CodespaceRangesfor multi-byte walking - Coordinator
pdftract-19oy(CMap parser + tokenizer) depends on this module
Commit
- Hash:
1dfaf73 - Message:
feat(pdftract-3g6ne): implement CMap codespace range parser - Pushed:
forgejo main