pdftract/notes/pdftract-hzuc.md
jedarden 8ec7cae1fd docs(pdftract-hzuc): add coordinator verification note
All 3 children closed with verified acceptance criteria:
- Predefined CMap registry (Identity-H/V + 8 UTF16 CMaps)
- encoding_rs adapter for Shift-JIS / GB18030 / Big5 / EUC-KR
- Codespace range parser + multi-byte content-stream tokenizer

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 13:04:51 -04:00

3.5 KiB

Verification Note: pdftract-hzuc (Phase 2.3 CJK Encoding Coordinator)

Summary

Coordinator bead for Phase 2.3: CJK Encoding. All 3 children are closed with verified implementation.

Children Status

Bead Title Status
pdftract-43ry Predefined CMap registry (Identity-H/V + 8 UTF16 CMaps) CLOSED
pdftract-5rmc encoding_rs adapter for Shift-JIS / GB18030 / Big5 / EUC-KR CLOSED
pdftract-19oy Codespace range parser + multi-byte content-stream tokenizer CLOSED

The pdftract-19oy bead has its own children which are also closed: | pdftract-3g6ne | Codespace range parser | CLOSED | | pdftract-3wbls | Multi-byte content-stream tokenizer | CLOSED |

Acceptance Criteria Status

Criterion Status Verification
All 3 children closed PASS All children show Status: closed
Identity-H Type 0 font with ToUnicode extracts text via passthrough PASS pdftract-43ry: test_identity_h_decode_bytes verifies CID 65 from [0x00, 0x41]; Identity-H/V return None for cid_to_unicode (caller uses ToUnicode)
Type 0 font with /Encoding /UniJIS-UTF16-H extracts Japanese text PASS pdftract-43ry: test_unijis_utf16_h_cid_to_unicode verifies CID 236 -> U+3042 (あ)
Mixed 0x41 0x82 0xA0 0x42 tokenizes to [0x41, 0x82A0, 0x42] PASS pdftract-3wbls: test_mixed_1_and_2_byte covers this exact case
All 4 encoding_rs adapters decode correctly PASS pdftract-5rmc: 15 tests pass, including round-trip tests for all 4 encodings

Components Delivered

1. Predefined CMap Registry (pdftract-43ry)

  • crates/pdftract-core/src/font/predefined_cmap.rs
  • All 10 predefined CMap names implemented
  • Identity-H/V passthrough working
  • UTF16 CMaps with CID->Unicode mapping (gated behind cjk feature)

2. encoding_rs Adapter (pdftract-5rmc)

  • crates/pdftract-core/src/font/cjk_encoding.rs
  • Shift-JIS, GB18030, Big5, EUC-KR decoding
  • Malformed input handling with U+FFFD

3. Codespace Range Parser (pdftract-3g6ne)

  • crates/pdftract-core/src/cmap/codespace.rs
  • PostScript-style parsing of begincodespacerange/endcodespacerange
  • 1-4 byte width support

4. Multi-byte Content-stream Tokenizer (pdftract-3wbls)

  • crates/pdftract-core/src/cmap/tokenize.rs
  • Widest-first matching per ISO 32000-1 9.10.3.1
  • Mixed 1+2 byte code handling
  • Benchmark: 100KB tokenized in < 10ms

Integration Notes

  • Type0 font module (crates/pdftract-core/src/font/type0.rs) has encoding_name field for storing the encoding CMap name
  • Predefined CMap registry is exported from font/mod.rs
  • Tokenizer is exported from cmap/mod.rs
  • All modules compile successfully (compilation errors exist in unrelated modules: document.rs, extract.rs)

Test Coverage Summary

  • Predefined CMap: 20 tests (13 base + 7 cjk-specific)
  • encoding_rs adapter: 15 tests
  • Codespace parser: 20 tests
  • Tokenizer: 14 tests + benchmarks

Commits

This coordinator bead consolidates work from child beads. Each child bead has its own verification note with commit details:

  • pdftract-43ry: feat(pdftract-43ry): implement predefined CMap registry
  • pdftract-5rmc: feat(pdftract-5rmc): implement encoding_rs adapter for CJK encodings
  • pdftract-3g6ne: feat(pdftract-3g6ne): implement CMap codespace range parser
  • pdftract-3wbls: feat(pdftract-3wbls): implement multi-byte CJK content-stream tokenizer

Coordinator Closure

All acceptance criteria verified. Coordinator can be closed.