Two fixes:
1. Hex string lexer now flushes dangling nibble when encountering invalid
characters. For `<4X8Y>`, the X and Y are invalid, so we flush nibble 4
as 0x40, then flush nibble 8 as 0x80, producing `\x40\x80`.
2. Fixed skip_whitespace_and_comments() to properly handle whitespace
after comments. The previous logic only continued looping if the next
byte was `%`, missing cases where whitespace follows a comment.
All 52 lexer tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add test_cycle_detection_in_page_tree to verify that circular references
in the /Pages tree are detected and handled gracefully without panicking.
The test creates a page tree with a cycle (parent -> child1 -> child2 -> child1)
and verifies that the flattener returns the valid pages while pruning the
cyclic portion.
Acceptance criteria verified:
- 3-level /Pages inheritance with MediaBox: PASS
- EC-09 missing MediaBox defaults to US Letter: PASS
- /Pages tree with cycles detected: PASS
- /Rotate value 45 clamped to 0: PASS
- Page count validation: PASS
- proptest random shapes never panic: PASS
- INV-8 no panics on invalid input: PASS
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-5tmcg
Bead-Id: pdftract-4iier
- Remove incorrect #[cfg(feature = "proptest")] since proptest is not behind a feature
- Update verification note to reflect 30 passing tests (includes 2 proptest tests)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix stream.rs test cases to use PdfStream::new() correctly (takes PdfDict directly, not wrapped in PdfObject::Dict)
- Fix catalog.rs test cases to use PdfObject::Dict(Box::new(dict)) (API change)
- Update parse_catalog to return Ok(empty_catalog) with STRUCT_MISSING_KEY diagnostic instead of Err when /Pages is missing (per bead acceptance criteria)
All catalog parser tests pass:
- 27 tests including 6 proptests for INV-8 compliance
- PageLabels number tree with mixed roman/arabic styles
- Tagged PDF detection via /MarkInfo
- Optional fields (Outlines, Version, etc.)
- proptest: random PdfObject as /Root never panics
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.
Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields
Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences
All 27 tests pass including proptests for fuzzing robustness (INV-8).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Changed Diagnostic::msg from String to Cow<'static, str> to avoid
allocations for static error messages. Static messages now use
Cow::Borrowed, while dynamic formatted messages use Cow::Owned.
Also fixed peek_token lifetime issue - was returning reference to
local variable, now returns reference from cache.
Acceptance criteria:
- Token enum with all required variants
- Lexer struct with position tracking and diagnostics
- Diagnostic uses Cow<'static, str> for zero-allocation static messages
- All public methods implemented: new, next_token, peek_token, position, take_diagnostics
- All internal helpers implemented
Refs: pdftract-4hn1
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-4hn1