pdftract/crates/pdftract-core/src
jedarden 1dfaf73aa4
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
feat(pdftract-3g6ne): implement CMap codespace range parser
This commit adds the codespace range parser for CMap streams. The parser
extracts the begincodespacerange / endcodespacerange blocks that define
legal byte-width boundaries for character codes in a CMap.

## Implementation

- CodespaceRange: Single range with lo/hi bounds (stored as [u8; 4]) and width (1-4 bytes)
- CodespaceRanges: Collection with SmallVec<[CodespaceRange; 8]>
- CodespaceParser: PostScript-style tokenizer for begincodespacerange blocks

## Acceptance Criteria (all PASS)

- Parse <00> <7F> → 1 range, width=1 
- Parse <00> <7F> <8000> <FFFF> in one block → 2 ranges 
- Width inference: 2-char hex → width=1; 4-char hex → width=2 
- Case-insensitive hex (<C0> and <c0> equivalent) 
- Malformed range (width mismatch) → diagnostic + skipped 
- Empty CMap → empty ranges 
- JIS range <8140> <FEFE> → 2-byte CJK 
- 3-byte and 4-byte range support 

Also adds encrypted fixture provenance entries to PROVENANCE.md.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-28 05:47:07 -04:00
..
annotation feat(pdftract-4hle): implement 7.6.4 links + annotations JSON output 2026-05-25 07:44:12 -04:00
attachment feat(pdftract-3j2u): implement 50 MB size limit + base64 encoding for attachments 2026-05-25 11:42:28 -04:00
cache feat(pdftract-2okbq): implement TH-10 cache poisoning protection 2026-05-26 21:09:54 -04:00
decoder feat(pdftract-36glh): implement JPXDecode passthrough with JP2 validation 2026-05-28 05:11:19 -04:00
encryption feat(pdftract-1z0qt): implement encryption detection + RC4/AES-128/AES-256 decryption 2026-05-28 03:22:36 -04:00
fingerprint feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
font feat(pdftract-3g6ne): implement CMap codespace range parser 2026-05-28 05:47:07 -04:00
forms feat(pdftract-5qca): implement form_fields JSON output + schema integration 2026-05-24 14:36:03 -04:00
glyph feat(pdftract-1q19p): implement OCG /OC tag tracking with is_hidden flag 2026-05-26 22:25:27 -04:00
layout feat(pdftract-4bylb): implement Docstrum fallback for reading order 2026-05-28 04:16:24 -04:00
ocr/preprocessing feat(pdftract-3h9xo): implement threads JSON output + schema integration 2026-05-25 13:40:15 -04:00
output docs(pdftract-1t5sj): verify book_chapter profile implementation complete 2026-05-27 22:38:46 -04:00
parser chore(pdftract-36glh): remove unused JpxDecoder import and add verification note 2026-05-28 05:23:13 -04:00
profiles feat(pdftract-64p5): implement classify CLI subcommand and --auto flag 2026-05-24 15:16:56 -04:00
receipts feat(pdftract-4yspv): implement OCR receipt fallback 2026-05-25 19:53:42 -04:00
render feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
schema test(pdftract-4c8qu): add page_label tests and fix JSON schema 2026-05-25 14:43:31 -04:00
signature feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
source chore(pdftract-36glh): remove unused JpxDecoder import and add verification note 2026-05-28 05:23:13 -04:00
span fix(pdftract-37qim): fix span compilation errors, verify multi-output CLI parsing 2026-05-28 01:29:07 -04:00
table feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
threads feat(pdftract-4li3d): implement security constraints for serve mode 2026-05-26 18:47:51 -04:00
atomic_file_writer.rs feat(pdftract-68wfa): implement AtomicFileWriter for atomic file writes 2026-05-24 13:02:37 -04:00
audit.rs fix: resolve compilation errors across codebase 2026-05-25 08:38:04 -04:00
classify.rs feat(pdftract-4li3d): implement security constraints for serve mode 2026-05-26 18:47:51 -04:00
confidence.rs feat(pdftract-2etcd): implement map_confidence_source function 2026-05-28 00:46:19 -04:00
conformance.rs feat(pdftract-2bs4j): implement PDF/A conformance detection via XMP parsing 2026-05-28 03:36:59 -04:00
content_stream.rs fix(pdftract-63ka2): AES-128 test buffer allocation for PKCS#7 padding 2026-05-28 01:30:33 -04:00
detection.rs chore(pdftract-36glh): remove unused JpxDecoder import and add verification note 2026-05-28 05:23:13 -04:00
diagnostics.rs feat(pdftract-36glh): implement JPXDecode passthrough with JP2 validation 2026-05-28 05:11:19 -04:00
document.rs chore(pdftract-36glh): remove unused JpxDecoder import and add verification note 2026-05-28 05:23:13 -04:00
dpi.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
extract.rs feat(pdftract-1z0qt): implement encryption detection + RC4/AES-128/AES-256 decryption 2026-05-28 03:22:36 -04:00
graphics_state.rs feat(pdftract-1kdzu): implement TJ operator with kerning and word boundary detection 2026-05-26 16:44:05 -04:00
hybrid.rs feat(pdftract-1t5sj): implement book_chapter profile with fixtures and tests 2026-05-27 22:30:09 -04:00
javascript.rs feat(pdftract-4li3d): implement security constraints for serve mode 2026-05-26 18:47:51 -04:00
lib.rs chore(pdftract-36glh): remove unused JpxDecoder import and add verification note 2026-05-28 05:23:13 -04:00
markdown.rs chore(pdftract-36glh): remove unused JpxDecoder import and add verification note 2026-05-28 05:23:13 -04:00
ocr.rs feat(pdftract-6dki1): implement histogram stretch contrast normalization 2026-05-24 10:30:20 -04:00
options.rs chore(pdftract-36glh): remove unused JpxDecoder import and add verification note 2026-05-28 05:23:13 -04:00
page_class.rs fix(pdftract-tuky): fix color clamping test and verify Phase 3.1 coordinator 2026-05-26 16:36:01 -04:00
pages.rs fix(pdftract-63ka2): AES-128 test buffer allocation for PKCS#7 padding 2026-05-28 01:30:33 -04:00
preprocess.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
render.rs feat(pdftract-axcri): record inline images as ImageXObject entries 2026-05-24 07:41:50 -04:00
semaphore.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
span_flags.rs feat(pdftract-cbrbg): implement span flag detector for Phase 4.1 2026-05-24 07:28:25 -04:00
text.rs fix(pdftract-38p8h): add fallback for empty block.spans in invisible text filter 2026-05-28 00:39:37 -04:00
url_validation.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
word_boundary.rs feat(pdftract-h2s0z): implement adaptive word boundary detector 2026-05-24 06:06:56 -04:00