pdftract/notes/pdftract-57np8.md
jedarden 1dfaf73aa4
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
feat(pdftract-3g6ne): implement CMap codespace range parser
This commit adds the codespace range parser for CMap streams. The parser
extracts the begincodespacerange / endcodespacerange blocks that define
legal byte-width boundaries for character codes in a CMap.

## Implementation

- CodespaceRange: Single range with lo/hi bounds (stored as [u8; 4]) and width (1-4 bytes)
- CodespaceRanges: Collection with SmallVec<[CodespaceRange; 8]>
- CodespaceParser: PostScript-style tokenizer for begincodespacerange blocks

## Acceptance Criteria (all PASS)

- Parse <00> <7F> → 1 range, width=1 
- Parse <00> <7F> <8000> <FFFF> in one block → 2 ranges 
- Width inference: 2-char hex → width=1; 4-char hex → width=2 
- Case-insensitive hex (<C0> and <c0> equivalent) 
- Malformed range (width mismatch) → diagnostic + skipped 
- Empty CMap → empty ranges 
- JIS range <8140> <FEFE> → 2-byte CJK 
- 3-byte and 4-byte range support 

Also adds encrypted fixture provenance entries to PROVENANCE.md.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-28 05:47:07 -04:00

113 lines
5.4 KiB
Markdown

# pdftract-57np8: Image Filter Passthroughs Verification
## Task
Implement DCTDecode / JBIG2Decode / JPXDecode / CCITTFaxDecode passthroughs with SOI/EOI validation + OCR_*_UNSUPPORTED diagnostics
## Status: COMPLETE
All four image filter passthroughs are implemented in `crates/pdftract-core/src/parser/stream.rs` with proper validation and diagnostic emission.
## Implementation Summary
### 1. DCTDecode (JPEG) Passthrough
- **Location**: `crates/pdftract-core/src/parser/stream.rs` lines 3718-3743
- **SOI Marker Validation**: Checks first 2 bytes are 0xFF 0xD8 (SOI = Start Of Image)
- **EOI Marker Validation**: Checks last 2 bytes are 0xFF 0xD9 (EOI = End Of Image)
- **Diagnostic**: `STREAM_INVALID_JPEG` emitted for missing SOI or EOI markers
- **Passthrough**: Raw JPEG bytes passed through unchanged
- **Tests**:
- `test_dctdecode_passthrough_valid_jpeg` - verifies bytes unchanged with SOI/EOI
- `test_dctdecode_passthrough_missing_soi` - verifies warning without SOI
- `test_dctdecode_passthrough_missing_eoi` - verifies warning without EOI
- `prop_dct_decode_never_panics` - proptest for random input
### 2. JBIG2Decode Passthrough
- **Location**: `crates/pdftract-core/src/parser/stream.rs` lines 3697-3716
- **Diagnostic**: `OCR_JBIG2_UNSUPPORTED` emitted when full-render feature is disabled
- **Passthrough**: Raw JBIG2 bytes passed through unchanged
- **Globals Recording**: `/JBIG2Globals` reference extracted and stored in StreamMeta
- **Tests**:
- `test_jbig2_passthrough` - integration test for passthrough
- `prop_jbig2_decode_never_panics` - proptest for random input
- `prop_jbig2_passthrough_never_panics` - proptest via get_decoder
### 3. JPXDecode (JPEG2000) Passthrough
- **Location**: `crates/pdftract-core/src/parser/stream.rs` lines 3745-3757
- **JP2 Box Magic Validation**: Checks first 12 bytes match JP2 signature (00 00 00 0C 6A 50 20 20 0D 0A 87 0A)
- **Diagnostics**:
- `OCR_JPX_UNSUPPORTED` emitted when full-render AND libopenjp2 are unavailable
- `STREAM_INVALID_JPX` emitted when JP2 box magic doesn't match (raw J2K or corrupt)
- **Passthrough**: Raw JPEG2000 bytes passed through unchanged
- **Tests**:
- `test_jpxstream_passthrough_valid_jp2` - verifies JP2 passthrough
- `test_jpxstream_passthrough_raw_j2k` - verifies raw J2K passthrough
- `test_jpxstream_passthrough_empty` - edge case
- `prop_jpx_decode_never_panics` - proptest for random input
### 4. CCITTFaxDecode Passthrough
- **Location**: `crates/pdftract-core/src/parser/stream.rs` lines 3667-3695
- **Diagnostic**: `OCR_CCITT_UNSUPPORTED` emitted when full-render AND libtiff are unavailable
- **Parameter Parsing**: Parses /K, /Columns, /Rows, /EncodedByteAlign, /EndOfLine, /BlackIs1
- **Defaults**: Uses DEFAULT_COLUMNS (1728) when /Columns missing
- **Passthrough**: Raw CCITT bytes passed through unchanged
- **Tests**:
- `test_ccittfax_passthrough_with_columns` - verifies passthrough with params
- `test_ccittfax_passthrough_missing_columns` - verifies default used
- `test_ccittfax_parse_params_with_all_fields` - verifies parameter parsing
- `prop_ccitt_decode_never_panics` - proptest for random input
## Acceptance Criteria Status
### Critical Test
- **PASS**: DCTDecode fixture with known JPEG — bytes unchanged, SOI marker present
- Test: `test_dctdecode_passthrough_valid_jpeg` (line 1951)
### Diagnostics
- **PASS**: JPEG without EOI marker passes through with STREAM_INVALID_JPEG warning
- Test: `test_dctdecode_passthrough_missing_eoi` (line 1982)
- **PASS**: JBIG2Decode without full-render emits OCR_JBIG2_UNSUPPORTED
- Emission at line 3703 (emits when cfg!(feature = "full-render") is false)
- **PASS**: JPXDecode without full-render emits OCR_JPX_UNSUPPORTED
- Emission at line 3750 (via JpxDecoder::emit_unsupported_diagnostic)
- **PASS**: CCITTFaxDecode without libtiff emits OCR_CCITT_UNSUPPORTED
- Emission at line 3690 (emits when !has_full_render && !has_libtiff)
### Validation
- **PASS**: JP2 box magic check detects malformed JPX with STREAM_INVALID_JPX
- Validation at line 3754 (via JpxDecoder::validate_jp2_magic)
### INV-8 Compliance
- **PASS**: Proptest random byte sequences for each filter never panic
- Tests: `prop_dct_decode_never_panics`, `prop_jbig2_decode_never_panics`,
`prop_jpx_decode_never_panics`, `prop_ccitt_decode_never_panics`
## Files Modified
### Core Implementation
- `crates/pdftract-core/src/parser/stream.rs`: Diagnostic emissions for all 4 filters
- `crates/pdftract-core/src/decoder/jbig2.rs`: JBIG2Decoder with diagnostic emission
- `crates/pdftract-core/src/decoder/jpx.rs`: JpxDecoder with JP2 validation and diagnostics
### Tests
- `tests/proptest/stream.rs`: Added proptest coverage for all 4 filters
- 14 new property tests verifying never-panic and passthrough behavior
## Feature Gate Behavior
### With full-render feature
- All diagnostics suppressed
- Image data passed to OCR pipeline for pdfium-render decoding
### Without full-render feature
- OCR_JBIG2_UNSUPPORTED emitted per JBIG2 stream (EC-11)
- OCR_JPX_UNSUPPORTED emitted per JPX stream (EC-12)
- OCR_CCITT_UNSUPPORTED emitted per CCITT stream (EC-13)
- Data still passed through for downstream consumption
## Verification Date
2026-05-28
## Notes
- Diagnostics emitted in `decode_stream_impl` function, not in individual decoder implementations
- This is because `StreamDecoder` trait doesn't provide a way to return diagnostics
- Passthrough pattern preserves all bytes unchanged, including malformed data (INV-8)