This commit adds the codespace range parser for CMap streams. The parser extracts the begincodespacerange / endcodespacerange blocks that define legal byte-width boundaries for character codes in a CMap. ## Implementation - CodespaceRange: Single range with lo/hi bounds (stored as [u8; 4]) and width (1-4 bytes) - CodespaceRanges: Collection with SmallVec<[CodespaceRange; 8]> - CodespaceParser: PostScript-style tokenizer for begincodespacerange blocks ## Acceptance Criteria (all PASS) - Parse <00> <7F> → 1 range, width=1 ✅ - Parse <00> <7F> <8000> <FFFF> in one block → 2 ranges ✅ - Width inference: 2-char hex → width=1; 4-char hex → width=2 ✅ - Case-insensitive hex (<C0> and <c0> equivalent) ✅ - Malformed range (width mismatch) → diagnostic + skipped ✅ - Empty CMap → empty ranges ✅ - JIS range <8140> <FEFE> → 2-byte CJK ✅ - 3-byte and 4-byte range support ✅ Also adds encrypted fixture provenance entries to PROVENANCE.md. Co-Authored-By: Claude Code <noreply@anthropic.com>
5.4 KiB
5.4 KiB
pdftract-57np8: Image Filter Passthroughs Verification
Task
Implement DCTDecode / JBIG2Decode / JPXDecode / CCITTFaxDecode passthroughs with SOI/EOI validation + OCR_*_UNSUPPORTED diagnostics
Status: COMPLETE
All four image filter passthroughs are implemented in crates/pdftract-core/src/parser/stream.rs with proper validation and diagnostic emission.
Implementation Summary
1. DCTDecode (JPEG) Passthrough
- Location:
crates/pdftract-core/src/parser/stream.rslines 3718-3743 - SOI Marker Validation: Checks first 2 bytes are 0xFF 0xD8 (SOI = Start Of Image)
- EOI Marker Validation: Checks last 2 bytes are 0xFF 0xD9 (EOI = End Of Image)
- Diagnostic:
STREAM_INVALID_JPEGemitted for missing SOI or EOI markers - Passthrough: Raw JPEG bytes passed through unchanged
- Tests:
test_dctdecode_passthrough_valid_jpeg- verifies bytes unchanged with SOI/EOItest_dctdecode_passthrough_missing_soi- verifies warning without SOItest_dctdecode_passthrough_missing_eoi- verifies warning without EOIprop_dct_decode_never_panics- proptest for random input
2. JBIG2Decode Passthrough
- Location:
crates/pdftract-core/src/parser/stream.rslines 3697-3716 - Diagnostic:
OCR_JBIG2_UNSUPPORTEDemitted when full-render feature is disabled - Passthrough: Raw JBIG2 bytes passed through unchanged
- Globals Recording:
/JBIG2Globalsreference extracted and stored in StreamMeta - Tests:
test_jbig2_passthrough- integration test for passthroughprop_jbig2_decode_never_panics- proptest for random inputprop_jbig2_passthrough_never_panics- proptest via get_decoder
3. JPXDecode (JPEG2000) Passthrough
- Location:
crates/pdftract-core/src/parser/stream.rslines 3745-3757 - JP2 Box Magic Validation: Checks first 12 bytes match JP2 signature (00 00 00 0C 6A 50 20 20 0D 0A 87 0A)
- Diagnostics:
OCR_JPX_UNSUPPORTEDemitted when full-render AND libopenjp2 are unavailableSTREAM_INVALID_JPXemitted when JP2 box magic doesn't match (raw J2K or corrupt)
- Passthrough: Raw JPEG2000 bytes passed through unchanged
- Tests:
test_jpxstream_passthrough_valid_jp2- verifies JP2 passthroughtest_jpxstream_passthrough_raw_j2k- verifies raw J2K passthroughtest_jpxstream_passthrough_empty- edge caseprop_jpx_decode_never_panics- proptest for random input
4. CCITTFaxDecode Passthrough
- Location:
crates/pdftract-core/src/parser/stream.rslines 3667-3695 - Diagnostic:
OCR_CCITT_UNSUPPORTEDemitted when full-render AND libtiff are unavailable - Parameter Parsing: Parses /K, /Columns, /Rows, /EncodedByteAlign, /EndOfLine, /BlackIs1
- Defaults: Uses DEFAULT_COLUMNS (1728) when /Columns missing
- Passthrough: Raw CCITT bytes passed through unchanged
- Tests:
test_ccittfax_passthrough_with_columns- verifies passthrough with paramstest_ccittfax_passthrough_missing_columns- verifies default usedtest_ccittfax_parse_params_with_all_fields- verifies parameter parsingprop_ccitt_decode_never_panics- proptest for random input
Acceptance Criteria Status
Critical Test
- PASS: DCTDecode fixture with known JPEG — bytes unchanged, SOI marker present
- Test:
test_dctdecode_passthrough_valid_jpeg(line 1951)
- Test:
Diagnostics
- PASS: JPEG without EOI marker passes through with STREAM_INVALID_JPEG warning
- Test:
test_dctdecode_passthrough_missing_eoi(line 1982)
- Test:
- PASS: JBIG2Decode without full-render emits OCR_JBIG2_UNSUPPORTED
- Emission at line 3703 (emits when cfg!(feature = "full-render") is false)
- PASS: JPXDecode without full-render emits OCR_JPX_UNSUPPORTED
- Emission at line 3750 (via JpxDecoder::emit_unsupported_diagnostic)
- PASS: CCITTFaxDecode without libtiff emits OCR_CCITT_UNSUPPORTED
- Emission at line 3690 (emits when !has_full_render && !has_libtiff)
Validation
- PASS: JP2 box magic check detects malformed JPX with STREAM_INVALID_JPX
- Validation at line 3754 (via JpxDecoder::validate_jp2_magic)
INV-8 Compliance
- PASS: Proptest random byte sequences for each filter never panic
- Tests:
prop_dct_decode_never_panics,prop_jbig2_decode_never_panics,prop_jpx_decode_never_panics,prop_ccitt_decode_never_panics
- Tests:
Files Modified
Core Implementation
crates/pdftract-core/src/parser/stream.rs: Diagnostic emissions for all 4 filterscrates/pdftract-core/src/decoder/jbig2.rs: JBIG2Decoder with diagnostic emissioncrates/pdftract-core/src/decoder/jpx.rs: JpxDecoder with JP2 validation and diagnostics
Tests
tests/proptest/stream.rs: Added proptest coverage for all 4 filters- 14 new property tests verifying never-panic and passthrough behavior
Feature Gate Behavior
With full-render feature
- All diagnostics suppressed
- Image data passed to OCR pipeline for pdfium-render decoding
Without full-render feature
- OCR_JBIG2_UNSUPPORTED emitted per JBIG2 stream (EC-11)
- OCR_JPX_UNSUPPORTED emitted per JPX stream (EC-12)
- OCR_CCITT_UNSUPPORTED emitted per CCITT stream (EC-13)
- Data still passed through for downstream consumption
Verification Date
2026-05-28
Notes
- Diagnostics emitted in
decode_stream_implfunction, not in individual decoder implementations - This is because
StreamDecodertrait doesn't provide a way to return diagnostics - Passthrough pattern preserves all bytes unchanged, including malformed data (INV-8)