Implements the slide_deck document profile for PowerPoint/Keynote/Google Slides exports as PDF. Includes 5 fixtures, expected outputs, and regression tests. Components: - profiles/builtin/slide_deck/profile.yaml - Profile configuration - tests/fixtures/profiles/slide_deck/ - 5 PDF fixtures with expected outputs - crates/pdftract-cli/tests/test_slide_deck.rs - Regression tests (12 PASS) Fixtures cover: 1. pitch_deck - Sales pitch (10 slides) 2. academic_lecture - Academic lecture (40 slides) 3. corporate_kickoff - Corporate kickoff (15 slides) 4. bilingual_deck - Bilingual EN/ES (12 slides) 5. googleslides_handout - Google Slides handout mode (4 pages, 3 slides/page) Extracted fields: title, presenter, date, slide_titles Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.3 KiB
pdftract-1f0cj: ID-to-EI raw-bytes scanner verification
Summary
The ID-to-EI raw-bytes scanner (scan_inline_image_data in crates/pdftract-core/src/parser/inline_image.rs) is already fully implemented and meets all acceptance criteria.
Implementation Details
Location: crates/pdftract-core/src/parser/inline_image.rs:335-390
How it works
- Cursor positioning: Starts immediately after the
IDkeyword and its required whitespace byte - Scanning algorithm: Byte-by-byte scan looking for pattern
[ws, 0x45, 0x49]where:wsis any PDF whitespace byte (0x00, 0x09, 0x0A, 0x0C, 0x0D, 0x20)0x45is 'E',0x49is 'I'
- Returns:
(image_bytes: Vec<u8>, bytes_consumed: usize)where:image_bytesexcludes the preceding whitespace and EI itselfbytes_consumedincludes everything from ID end to EI end
- Lexer advancement:
lexer.skip_bytes(bytes_consumed as u64)positions cursor after EI
Key design decisions
- Whitespace-preceded rule: The EI delimiter must be preceded by whitespace per PDF spec 8.9.7. This distinguishes the terminator from spurious
0x45 0x49sequences that may appear in compressed image data. - End-of-stream handling: If no EI is found, the scanner returns all remaining bytes and emits
InlineImageNoEidiagnostic. This handles malformed PDFs gracefully. - Empty image: Valid per spec -
ID EIimmediately returns empty slice.
Acceptance Criteria Verification
| Criterion | Status | Notes |
|---|---|---|
ABCD<ws>EI → returns b"ABCD" |
PASS | Test at line 868-876 |
ABCDEI<ws>EI → returns b"ABCDEI" |
PASS | Test at line 879-888 (inner EI not preceded by ws) |
| No EI → returns remaining bytes + diagnostic | PASS | Test at line 902-917 |
| Lexer positioned after EI | PASS | Test at line 973-985 |
Test Coverage
The module includes comprehensive tests in crates/pdftract-core/src/parser/inline_image.rs:749-986:
test_scan_inline_image_data_basic- Basic casetest_scan_inline_image_data_with_embedded_ei- EI in data not preceded by wstest_scan_inline_image_data_empty- Empty imagetest_scan_inline_image_data_no_ei- No terminatortest_scan_inline_image_data_various_whitespace- All 6 ws bytestest_scan_inline_image_data_binary_content- Binary data with 0x45/0x49 bytestest_scan_inline_image_data_lexer_position- Lexer advancement verification
Known Limitations
Per the task description's "Critical considerations":
Image data may contain the pattern
<ws>EISPURIOUSLY (e.g., a JBIG2 stream might have such bytes); this is RARE but possible. Acceptable solution: trust the spec's filter+dimensions-determine-length convention OR adopt the whitespace-EI heuristic and accept that malformed images may cause early termination. The plan picks the whitespace heuristic; document as a known limitation.
This implementation uses the whitespace-EI heuristic. In the rare case that compressed image data contains a literal <ws>EI sequence, the scanner will terminate early. A more robust solution would use the inline image header's width/height/bpc/colorspace to compute the exact expected byte length, but that is deferred to a future version (v0.2.0+ per ADR).
References
- Plan section: Phase 3.5 Parsing (line 1610-1620)
- PDF spec: ISO 32000-1:2008, section 8.9.7 "Inline Images"