jedarden dbe5e3d5b8

Schema Generation Validation / Validate JSON Schema (push) Waiting to run

Details

Schema Generation Validation / Validate JSON Syntax (push) Waiting to run

Details

docs(pdftract-3g6ne): add verification note

Documents the implementation, acceptance criteria status, and design
decisions for the CMap codespace range parser.

Co-Authored-By: Claude Code <noreply@anthropic.com>

2026-05-28 05:48:59 -04:00

3.6 KiB

Raw Blame History

pdftract-3g6ne: CMap Codespace Range Parser

Bead Summary

Implemented the CMap codespace range parser for extracting byte-width boundaries from begincodespacerange / endcodespacerange PostScript blocks.

Implementation Location

Module: crates/pdftract-core/src/font/codespace.rs
Exported from: crates/pdftract-core/src/font/mod.rs

Structures Implemented

CodespaceRange

pub struct CodespaceRange {
    pub lo: [u8; 4],   // Low bound (big-endian, 4-byte storage)
    pub hi: [u8; 4],   // High bound (big-endian, 4-byte storage)
    pub width: u8,     // Byte width (1-4)
}

CodespaceRanges

pub struct CodespaceRanges {
    pub ranges: SmallVec<[CodespaceRange; 8]>,
}

CodespaceParser

PostScript-style tokenizer that:

Recognizes begincodespacerange / endcodespacerange keywords
Parses hex string pairs <lo> <hi>
Validates width matching (lo.len() == hi.len())
Emits diagnostics on malformed entries
Continues parsing after errors (recovery)

Acceptance Criteria Status

Criterion	Status	Test
Parse <00> <7F> → 1 range, width=1	PASS	`test_parse_single_range_one_byte`
Parse <00> <7F> <8000> in one block → 2 ranges	PASS	`test_parse_two_ranges_mixed_width`
Width inference: 2-char hex → width=1; 4-char hex → width=2	PASS	`test_width_inference`
Case-insensitive hex ( and equivalent)	PASS	`test_case_insensitive_hex`
Malformed range (width mismatch) → diagnostic + skipped	PASS	`test_malformed_range_width_mismatch`
Empty CMap → empty ranges	PASS	`test_empty_cmap`, `test_no_codespace_block`
Round-trip with Identity-H CMap fixture	N/A	No standalone CMap fixtures exist; tests cover parsing logic

Additional Tests

test_jis_range: JIS lead/trail 2-byte pattern <8140> <FEFE>
test_three_byte_range: 3-byte codespace support
test_four_byte_range: 4-byte codespace support
test_invalid_width_too_large: Rejects 5+ byte ranges
test_find_range: Utility to match byte sequences to ranges
test_comment_in_block: PostScript comment stripping
test_hex_string_with_whitespace: Internal whitespace handling
test_odd_length_hex_string: Dangling nibble padding
test_recovery_on_error: Continues after malformed entries
test_convenience_function: Public API entry points

Public API

// Parse without diagnostics (for internal use)
pub fn parse_codespace_ranges(input: &[u8]) -> CodespaceRanges

// Parse with diagnostics (for error reporting)
pub fn parse_codespace_ranges_with_diags(input: &[u8]) -> (CodespaceRanges, Vec<Diagnostic>)

Design Decisions

4-byte storage for bounds: Ranges up to 4 bytes are stored in fixed [u8; 4] arrays with leading zeros, simplifying comparison logic
SmallVec capacity 8: Most predefined CMaps (Identity-H/V, UTF-16 variants) have 1-2 ranges; 8 provides stack allocation for typical cases without overflow
Recovery over hard failure: Malformed entries emit diagnostics but don't stop parsing; subsequent valid ranges are still collected
Case-insensitive hex: Both <C0> and <c0> parse to 0xC0 per PDF spec
Width validation: Rejects ranges where lo.len() != hi.len() or width > 4

Integration Notes

Module is imported in font/mod.rs but not yet exported at crate level
Sibling tokenizer bead will consume CodespaceRanges for multi-byte walking
Coordinator pdftract-19oy (CMap parser + tokenizer) depends on this module

Commit

Hash: 1dfaf73
Message: feat(pdftract-3g6ne): implement CMap codespace range parser
Pushed: forgejo main

3.6 KiB Raw Blame History