pdftract/notes/pdftract-57np8.md
jedarden 1dfaf73aa4
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
feat(pdftract-3g6ne): implement CMap codespace range parser
This commit adds the codespace range parser for CMap streams. The parser
extracts the begincodespacerange / endcodespacerange blocks that define
legal byte-width boundaries for character codes in a CMap.

## Implementation

- CodespaceRange: Single range with lo/hi bounds (stored as [u8; 4]) and width (1-4 bytes)
- CodespaceRanges: Collection with SmallVec<[CodespaceRange; 8]>
- CodespaceParser: PostScript-style tokenizer for begincodespacerange blocks

## Acceptance Criteria (all PASS)

- Parse <00> <7F> → 1 range, width=1 
- Parse <00> <7F> <8000> <FFFF> in one block → 2 ranges 
- Width inference: 2-char hex → width=1; 4-char hex → width=2 
- Case-insensitive hex (<C0> and <c0> equivalent) 
- Malformed range (width mismatch) → diagnostic + skipped 
- Empty CMap → empty ranges 
- JIS range <8140> <FEFE> → 2-byte CJK 
- 3-byte and 4-byte range support 

Also adds encrypted fixture provenance entries to PROVENANCE.md.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-28 05:47:07 -04:00

5.4 KiB

pdftract-57np8: Image Filter Passthroughs Verification

Task

Implement DCTDecode / JBIG2Decode / JPXDecode / CCITTFaxDecode passthroughs with SOI/EOI validation + OCR_*_UNSUPPORTED diagnostics

Status: COMPLETE

All four image filter passthroughs are implemented in crates/pdftract-core/src/parser/stream.rs with proper validation and diagnostic emission.

Implementation Summary

1. DCTDecode (JPEG) Passthrough

  • Location: crates/pdftract-core/src/parser/stream.rs lines 3718-3743
  • SOI Marker Validation: Checks first 2 bytes are 0xFF 0xD8 (SOI = Start Of Image)
  • EOI Marker Validation: Checks last 2 bytes are 0xFF 0xD9 (EOI = End Of Image)
  • Diagnostic: STREAM_INVALID_JPEG emitted for missing SOI or EOI markers
  • Passthrough: Raw JPEG bytes passed through unchanged
  • Tests:
    • test_dctdecode_passthrough_valid_jpeg - verifies bytes unchanged with SOI/EOI
    • test_dctdecode_passthrough_missing_soi - verifies warning without SOI
    • test_dctdecode_passthrough_missing_eoi - verifies warning without EOI
    • prop_dct_decode_never_panics - proptest for random input

2. JBIG2Decode Passthrough

  • Location: crates/pdftract-core/src/parser/stream.rs lines 3697-3716
  • Diagnostic: OCR_JBIG2_UNSUPPORTED emitted when full-render feature is disabled
  • Passthrough: Raw JBIG2 bytes passed through unchanged
  • Globals Recording: /JBIG2Globals reference extracted and stored in StreamMeta
  • Tests:
    • test_jbig2_passthrough - integration test for passthrough
    • prop_jbig2_decode_never_panics - proptest for random input
    • prop_jbig2_passthrough_never_panics - proptest via get_decoder

3. JPXDecode (JPEG2000) Passthrough

  • Location: crates/pdftract-core/src/parser/stream.rs lines 3745-3757
  • JP2 Box Magic Validation: Checks first 12 bytes match JP2 signature (00 00 00 0C 6A 50 20 20 0D 0A 87 0A)
  • Diagnostics:
    • OCR_JPX_UNSUPPORTED emitted when full-render AND libopenjp2 are unavailable
    • STREAM_INVALID_JPX emitted when JP2 box magic doesn't match (raw J2K or corrupt)
  • Passthrough: Raw JPEG2000 bytes passed through unchanged
  • Tests:
    • test_jpxstream_passthrough_valid_jp2 - verifies JP2 passthrough
    • test_jpxstream_passthrough_raw_j2k - verifies raw J2K passthrough
    • test_jpxstream_passthrough_empty - edge case
    • prop_jpx_decode_never_panics - proptest for random input

4. CCITTFaxDecode Passthrough

  • Location: crates/pdftract-core/src/parser/stream.rs lines 3667-3695
  • Diagnostic: OCR_CCITT_UNSUPPORTED emitted when full-render AND libtiff are unavailable
  • Parameter Parsing: Parses /K, /Columns, /Rows, /EncodedByteAlign, /EndOfLine, /BlackIs1
  • Defaults: Uses DEFAULT_COLUMNS (1728) when /Columns missing
  • Passthrough: Raw CCITT bytes passed through unchanged
  • Tests:
    • test_ccittfax_passthrough_with_columns - verifies passthrough with params
    • test_ccittfax_passthrough_missing_columns - verifies default used
    • test_ccittfax_parse_params_with_all_fields - verifies parameter parsing
    • prop_ccitt_decode_never_panics - proptest for random input

Acceptance Criteria Status

Critical Test

  • PASS: DCTDecode fixture with known JPEG — bytes unchanged, SOI marker present
    • Test: test_dctdecode_passthrough_valid_jpeg (line 1951)

Diagnostics

  • PASS: JPEG without EOI marker passes through with STREAM_INVALID_JPEG warning
    • Test: test_dctdecode_passthrough_missing_eoi (line 1982)
  • PASS: JBIG2Decode without full-render emits OCR_JBIG2_UNSUPPORTED
    • Emission at line 3703 (emits when cfg!(feature = "full-render") is false)
  • PASS: JPXDecode without full-render emits OCR_JPX_UNSUPPORTED
    • Emission at line 3750 (via JpxDecoder::emit_unsupported_diagnostic)
  • PASS: CCITTFaxDecode without libtiff emits OCR_CCITT_UNSUPPORTED
    • Emission at line 3690 (emits when !has_full_render && !has_libtiff)

Validation

  • PASS: JP2 box magic check detects malformed JPX with STREAM_INVALID_JPX
    • Validation at line 3754 (via JpxDecoder::validate_jp2_magic)

INV-8 Compliance

  • PASS: Proptest random byte sequences for each filter never panic
    • Tests: prop_dct_decode_never_panics, prop_jbig2_decode_never_panics, prop_jpx_decode_never_panics, prop_ccitt_decode_never_panics

Files Modified

Core Implementation

  • crates/pdftract-core/src/parser/stream.rs: Diagnostic emissions for all 4 filters
  • crates/pdftract-core/src/decoder/jbig2.rs: JBIG2Decoder with diagnostic emission
  • crates/pdftract-core/src/decoder/jpx.rs: JpxDecoder with JP2 validation and diagnostics

Tests

  • tests/proptest/stream.rs: Added proptest coverage for all 4 filters
    • 14 new property tests verifying never-panic and passthrough behavior

Feature Gate Behavior

With full-render feature

  • All diagnostics suppressed
  • Image data passed to OCR pipeline for pdfium-render decoding

Without full-render feature

  • OCR_JBIG2_UNSUPPORTED emitted per JBIG2 stream (EC-11)
  • OCR_JPX_UNSUPPORTED emitted per JPX stream (EC-12)
  • OCR_CCITT_UNSUPPORTED emitted per CCITT stream (EC-13)
  • Data still passed through for downstream consumption

Verification Date

2026-05-28

Notes

  • Diagnostics emitted in decode_stream_impl function, not in individual decoder implementations
  • This is because StreamDecoder trait doesn't provide a way to return diagnostics
  • Passthrough pattern preserves all bytes unchanged, including malformed data (INV-8)