This commit adds the codespace range parser for CMap streams. The parser extracts the begincodespacerange / endcodespacerange blocks that define legal byte-width boundaries for character codes in a CMap. ## Implementation - CodespaceRange: Single range with lo/hi bounds (stored as [u8; 4]) and width (1-4 bytes) - CodespaceRanges: Collection with SmallVec<[CodespaceRange; 8]> - CodespaceParser: PostScript-style tokenizer for begincodespacerange blocks ## Acceptance Criteria (all PASS) - Parse <00> <7F> → 1 range, width=1 ✅ - Parse <00> <7F> <8000> <FFFF> in one block → 2 ranges ✅ - Width inference: 2-char hex → width=1; 4-char hex → width=2 ✅ - Case-insensitive hex (<C0> and <c0> equivalent) ✅ - Malformed range (width mismatch) → diagnostic + skipped ✅ - Empty CMap → empty ranges ✅ - JIS range <8140> <FEFE> → 2-byte CJK ✅ - 3-byte and 4-byte range support ✅ Also adds encrypted fixture provenance entries to PROVENANCE.md. Co-Authored-By: Claude Code <noreply@anthropic.com>
113 lines
5.4 KiB
Markdown
113 lines
5.4 KiB
Markdown
# pdftract-57np8: Image Filter Passthroughs Verification
|
|
|
|
## Task
|
|
Implement DCTDecode / JBIG2Decode / JPXDecode / CCITTFaxDecode passthroughs with SOI/EOI validation + OCR_*_UNSUPPORTED diagnostics
|
|
|
|
## Status: COMPLETE
|
|
|
|
All four image filter passthroughs are implemented in `crates/pdftract-core/src/parser/stream.rs` with proper validation and diagnostic emission.
|
|
|
|
## Implementation Summary
|
|
|
|
### 1. DCTDecode (JPEG) Passthrough
|
|
- **Location**: `crates/pdftract-core/src/parser/stream.rs` lines 3718-3743
|
|
- **SOI Marker Validation**: Checks first 2 bytes are 0xFF 0xD8 (SOI = Start Of Image)
|
|
- **EOI Marker Validation**: Checks last 2 bytes are 0xFF 0xD9 (EOI = End Of Image)
|
|
- **Diagnostic**: `STREAM_INVALID_JPEG` emitted for missing SOI or EOI markers
|
|
- **Passthrough**: Raw JPEG bytes passed through unchanged
|
|
- **Tests**:
|
|
- `test_dctdecode_passthrough_valid_jpeg` - verifies bytes unchanged with SOI/EOI
|
|
- `test_dctdecode_passthrough_missing_soi` - verifies warning without SOI
|
|
- `test_dctdecode_passthrough_missing_eoi` - verifies warning without EOI
|
|
- `prop_dct_decode_never_panics` - proptest for random input
|
|
|
|
### 2. JBIG2Decode Passthrough
|
|
- **Location**: `crates/pdftract-core/src/parser/stream.rs` lines 3697-3716
|
|
- **Diagnostic**: `OCR_JBIG2_UNSUPPORTED` emitted when full-render feature is disabled
|
|
- **Passthrough**: Raw JBIG2 bytes passed through unchanged
|
|
- **Globals Recording**: `/JBIG2Globals` reference extracted and stored in StreamMeta
|
|
- **Tests**:
|
|
- `test_jbig2_passthrough` - integration test for passthrough
|
|
- `prop_jbig2_decode_never_panics` - proptest for random input
|
|
- `prop_jbig2_passthrough_never_panics` - proptest via get_decoder
|
|
|
|
### 3. JPXDecode (JPEG2000) Passthrough
|
|
- **Location**: `crates/pdftract-core/src/parser/stream.rs` lines 3745-3757
|
|
- **JP2 Box Magic Validation**: Checks first 12 bytes match JP2 signature (00 00 00 0C 6A 50 20 20 0D 0A 87 0A)
|
|
- **Diagnostics**:
|
|
- `OCR_JPX_UNSUPPORTED` emitted when full-render AND libopenjp2 are unavailable
|
|
- `STREAM_INVALID_JPX` emitted when JP2 box magic doesn't match (raw J2K or corrupt)
|
|
- **Passthrough**: Raw JPEG2000 bytes passed through unchanged
|
|
- **Tests**:
|
|
- `test_jpxstream_passthrough_valid_jp2` - verifies JP2 passthrough
|
|
- `test_jpxstream_passthrough_raw_j2k` - verifies raw J2K passthrough
|
|
- `test_jpxstream_passthrough_empty` - edge case
|
|
- `prop_jpx_decode_never_panics` - proptest for random input
|
|
|
|
### 4. CCITTFaxDecode Passthrough
|
|
- **Location**: `crates/pdftract-core/src/parser/stream.rs` lines 3667-3695
|
|
- **Diagnostic**: `OCR_CCITT_UNSUPPORTED` emitted when full-render AND libtiff are unavailable
|
|
- **Parameter Parsing**: Parses /K, /Columns, /Rows, /EncodedByteAlign, /EndOfLine, /BlackIs1
|
|
- **Defaults**: Uses DEFAULT_COLUMNS (1728) when /Columns missing
|
|
- **Passthrough**: Raw CCITT bytes passed through unchanged
|
|
- **Tests**:
|
|
- `test_ccittfax_passthrough_with_columns` - verifies passthrough with params
|
|
- `test_ccittfax_passthrough_missing_columns` - verifies default used
|
|
- `test_ccittfax_parse_params_with_all_fields` - verifies parameter parsing
|
|
- `prop_ccitt_decode_never_panics` - proptest for random input
|
|
|
|
## Acceptance Criteria Status
|
|
|
|
### Critical Test
|
|
- **PASS**: DCTDecode fixture with known JPEG — bytes unchanged, SOI marker present
|
|
- Test: `test_dctdecode_passthrough_valid_jpeg` (line 1951)
|
|
|
|
### Diagnostics
|
|
- **PASS**: JPEG without EOI marker passes through with STREAM_INVALID_JPEG warning
|
|
- Test: `test_dctdecode_passthrough_missing_eoi` (line 1982)
|
|
- **PASS**: JBIG2Decode without full-render emits OCR_JBIG2_UNSUPPORTED
|
|
- Emission at line 3703 (emits when cfg!(feature = "full-render") is false)
|
|
- **PASS**: JPXDecode without full-render emits OCR_JPX_UNSUPPORTED
|
|
- Emission at line 3750 (via JpxDecoder::emit_unsupported_diagnostic)
|
|
- **PASS**: CCITTFaxDecode without libtiff emits OCR_CCITT_UNSUPPORTED
|
|
- Emission at line 3690 (emits when !has_full_render && !has_libtiff)
|
|
|
|
### Validation
|
|
- **PASS**: JP2 box magic check detects malformed JPX with STREAM_INVALID_JPX
|
|
- Validation at line 3754 (via JpxDecoder::validate_jp2_magic)
|
|
|
|
### INV-8 Compliance
|
|
- **PASS**: Proptest random byte sequences for each filter never panic
|
|
- Tests: `prop_dct_decode_never_panics`, `prop_jbig2_decode_never_panics`,
|
|
`prop_jpx_decode_never_panics`, `prop_ccitt_decode_never_panics`
|
|
|
|
## Files Modified
|
|
|
|
### Core Implementation
|
|
- `crates/pdftract-core/src/parser/stream.rs`: Diagnostic emissions for all 4 filters
|
|
- `crates/pdftract-core/src/decoder/jbig2.rs`: JBIG2Decoder with diagnostic emission
|
|
- `crates/pdftract-core/src/decoder/jpx.rs`: JpxDecoder with JP2 validation and diagnostics
|
|
|
|
### Tests
|
|
- `tests/proptest/stream.rs`: Added proptest coverage for all 4 filters
|
|
- 14 new property tests verifying never-panic and passthrough behavior
|
|
|
|
## Feature Gate Behavior
|
|
|
|
### With full-render feature
|
|
- All diagnostics suppressed
|
|
- Image data passed to OCR pipeline for pdfium-render decoding
|
|
|
|
### Without full-render feature
|
|
- OCR_JBIG2_UNSUPPORTED emitted per JBIG2 stream (EC-11)
|
|
- OCR_JPX_UNSUPPORTED emitted per JPX stream (EC-12)
|
|
- OCR_CCITT_UNSUPPORTED emitted per CCITT stream (EC-13)
|
|
- Data still passed through for downstream consumption
|
|
|
|
## Verification Date
|
|
2026-05-28
|
|
|
|
## Notes
|
|
- Diagnostics emitted in `decode_stream_impl` function, not in individual decoder implementations
|
|
- This is because `StreamDecoder` trait doesn't provide a way to return diagnostics
|
|
- Passthrough pattern preserves all bytes unchanged, including malformed data (INV-8)
|