Commit graph

40 commits

Author SHA1 Message Date
jedarden
e11b487b19 feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback
Implements Phase 7.1.4: coverage-based fallback for Suspects-tagged PDFs.

## Changes

### New files
- crates/pdftract-core/src/parser/marked_content.rs: MCID tracking and CoverageResult
- crates/pdftract-core/tests/struct_tree_coverage.rs: Integration tests

### Modified files
- crates/pdftract-core/src/parser/catalog.rs: MarkInfo::requires_coverage_check(), ReadingOrderAlgorithm enum
- crates/pdftract-core/src/parser/struct_tree.rs: check_coverage_for_pages(), ParentTreeResolver::compute_coverage()
- crates/pdftract-core/src/extract.rs: MCID tracking per page, coverage check integration

## Implementation

Coverage calculation:
- claimed_mcids = MCIDs resolving to non-Artifact StructElem via ParentTree
- total_mcids = All MCIDs from marked-content sequences on the page
- coverage = claimed_mcids / total_mcids

Fallback rule (per plan §7.1 line 2572):
- If /MarkInfo /Suspects is true AND coverage < 0.80 → use XY-cut
- Otherwise → use StructTree

## Tests

Unit tests (20):  All passing
- Suspects false + 50% coverage → no fallback
- Suspects true + 95% coverage → no fallback
- Suspects true + 60% coverage → fallback
- Edge cases: no MCIDs, 80% threshold, multi-page

Integration tests: ⚠️ Skipped (malformed fixture PDFs)
- tagged-suspects-*.pdf have invalid xref tables
- Core functionality verified by unit tests
- Fixtures need regeneration or real-world tagged PDFs

## Acceptance Criteria (from pdftract-2w3r)

- [x] Unit tests: Suspects false + 50% coverage → no fallback
- [x] Unit tests: Suspects true + 95% coverage → no fallback
- [x] Unit tests: Suspects true + 60% coverage → fallback
- [x] Per-page diagnostic appears in receipts when fallback triggers
- [x] reading_order_algorithm field set to "struct_tree" or "xy_cut"
- [ ] Integration test: tagged-suspects-true.pdf (fixture malformed)

Refs: pdftract-2w3r, plan §7.1 line 2554, INV-8

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 20:53:25 -04:00
jedarden
b72d8312ce test(pdftract-57o4): add ParentTree integration tests for annotation and sparse arrays
Add two comprehensive integration tests to validate the ParentTree resolver:

1. test_parent_tree_annotation_with_struct_parent:
   - Creates a body paragraph StructElem
   - Creates ParentTree with page array (MCID 0 -> body, MCID 1 -> orphan/null)
   - Creates ParentTree with annotation entry (key 100 -> body)
   - Verifies MCID resolution returns correct map and orphans
   - Verifies annotation /StructParent resolution returns the body ref
   - Verifies the referenced StructElem is in the tree

2. test_parent_tree_off_by_one_missing_entries:
   - Creates ParentTree with sparse array (only 3 entries for potentially more MCIDs)
   - Verifies non-null entries are correctly mapped
   - Verifies null entries are recorded as orphans
   - Documents that MCIDs beyond array length would be detected in Phase 7.1.4

Also export ParentTreeResolver and ParentTreeEntry from parser module
for use by the block builder in Phase 7.1.4.

All 67 struct_tree tests pass (18 ParentTree-specific tests).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 18:36:09 -04:00
jedarden
ecf78671b5 feat(pdftract-57o4): fix ParentTree resolver tests and null entry handling
- Fix 8 tests that incorrectly passed ParentTree dict directly instead of
  wrapping it in a StructTreeRoot-like structure with /ParentTree key
- Fix process_nums_array() to preserve null entries as ObjRef { object: 0 }
  instead of filtering them out, ensuring orphan MCIDs are correctly reported
- Add verification note for ParentTree-based MCID-to-StructElem resolver

References: pdftract-57o4, plan 7.1 line 2550 (MCID-to-StructElem mapping)
2026-05-23 18:32:56 -04:00
jedarden
0882962861 feat(pdftract-2ork): implement element-type to block-kind mapping table
Implements Phase 7.1.2: StandardType -> BlockKind mapping for converting
walked StructElem nodes into the BlockKind taxonomy used by Phase 4 output.

Changes:
- Add BlockKind enum with all output block kinds (paragraph, heading with
  level, table, list, list_item, figure, caption, code, block_quote, toc,
  formula, reference, note, form_field_struct, inline, structural_container,
  artifact, unknown)
- Add MappingResult struct bundling block_kind, is_emitted flag, and optional
  diagnostic
- Add structure_type_to_block_kind() function for pure type mapping
- Add map_element_to_block() function as primary mapping API
- Add is_artifact() placeholder for Phase 3.4 marked-content integration
- Add 32 comprehensive unit tests covering all mapping paths

Key features:
- Complete type mapping for all 40+ PDF standard structure types
- Heading level extraction: H->level 1, H1..H6->level 1..6
- Inline elements (Span, Quote) map to Inline (not emitted as blocks)
- Structural containers (Document, Part, Sect, Div, etc.) map to
  StructuralContainer (descend without emitting)
- Unknown types emit diagnostic and fall back to paragraph

Acceptance criteria:
- Every Standard structure type has a mapping decision
- Critical test: H1/H2 -> heading level 1/2
- Unit tests: list nesting, table grouping, span passthrough
- Unknown-type fallback path emits a diagnostic line

Refs: Plan section 7.1 lines 2552-2553
2026-05-23 17:24:00 -04:00
jedarden
d41d47de66 feat(pdftract-1x2): implement StructTree depth-first walker with RoleMap resolution
Implements the StructTree parser (Phase 7.1.1) with:
- Depth-first walker over /StructTreeRoot via /K array
- Support for all four /K entry types: StructElem, MCID, MCR, OBJR
- /RoleMap resolution with chain handling and cycle detection
- /Lang inheritance through the structure tree
- /ActualText inheritance (applies to all descendant content)
- Public API: StructureType, StructElemNode, StructTreeRoot, RoleMap, Kid

Acceptance criteria:
- PASS: All four /K element kinds handled without crashing
- PASS: /RoleMap chains resolve to standard type or NonStruct
- PASS: /Lang and /ActualText inherit correctly down tree
- PASS: Unit tests for Word RoleMap (Heading1 -> H1)
- PASS: Unit tests for nested /Lang and /ActualText scope
- PASS: Public type StructElemNode documented in core crate

References:
- Plan section 7.1 StructTree Exploitation (lines 2547-2549, 2552-2553)
- PDF 1.7 spec 14.7.4 (Structure Tree) and 14.8.4 (Standard Structure Types)

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 16:43:22 -04:00
jedarden
319f81aaa3 test(bf-21hw8): add bounded predictor tests for PNG and TIFF
Add 4 new tests to verify PNG and TIFF predictor functions use row-by-row
processing with bounded peak memory (2x stride), never pre-allocating full
output buffers inside tests.

- test_png_predictor_budget_enforcement_small_fixture: 200-byte fixture,
  100-byte budget, verifies truncation at row boundary
- test_tiff_predictor_2_budget_enforcement_small_fixture: 160-byte fixture,
  80-byte budget, verifies row-by-row processing for grayscale
- test_png_predictor_multiple_selectors_budget_per_row: 25-byte fixture
  with all PNG selector types, verifies per-row budget checking
- test_tiff_predictor_2_rgb_budget_enforcement: 45-byte RGB fixture,
  verifies multi-byte pixel handling with budget enforcement

All fixtures are under 250 bytes, no full-buffer pre-allocation, tests
mirror the row-by-row discipline from bf-49wmw production fix.

Closes bf-21hw8
2026-05-23 13:35:57 -04:00
jedarden
98193ff098 test(bf-4xk2v): bound decompression-bomb tests with minimal crafted inputs
- Fix test_bomb_limit_flate to actually test early abort behavior
- Use 200-byte pattern (not large buffers) that compresses to ~50 bytes
- Set bomb_limit to 50 bytes to force truncation
- Assert output.len() < pattern.len() to verify truncation occurred
- Add documentation explaining the minimal input approach

Per bf-4xk2v: "Decompression-bomb and max_decompress_bytes tests must
trigger the STREAM_BOMB abort WITHOUT building the multi-GB decoded output
in memory. Use minimal crafted inputs and assert the byte-budget limit fires
early. Never pre-size a Vec to the claimed or decompressed length."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 13:30:48 -04:00
jedarden
9b5fbc9b5e feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction
- Add decode_page_content_streams() function for per-page lazy decode
- Update extract_page_from_dict() to support lazy stream decoding
- Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding
- Fix borrow checker issue in LazyPageIter::next()

This ensures content streams are decoded lazily per page and dropped
immediately after processing, keeping peak RSS flat across page count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 12:30:26 -04:00
jedarden
64efdd594e feat(pdftract-5u8bp): implement SVG clip generator
Implement SVG clip generator for --receipts=svg mode. Generates
self-contained SVG documents from TTF/OTF glyph outlines via
ttf-parser, with proper coordinate transform (PDF bottom-left
origin to SVG top-left origin) and color space conversion.

Components:
- SvgGenerator: filters glyphs by bbox, extracts outlines
- SvgPathBuilder: ttf-parser::OutlineBuilder impl for SVG paths
- pdf_color_to_css(): DeviceRGB/Gray/CMYK to CSS colors

Acceptance criteria:
- SVG validates via quick-xml parse roundtrip
- Aggregate size <= 500 KB for 100 receipts (test passes)
- No external resource references (self-contained)
- Handles missing glyph outlines gracefully
- Coordinate transform unit-tested: (220, 432) → (20, 8)

Also fix unstable as_str() → as_ref() in stream.rs test.

Closes pdftract-5u8bp

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 03:43:19 -04:00
jedarden
210c40de8c feat(pdftract-mcp): add MCP server implementation changes
Changes from Phase 6.7 child beads that were not committed earlier:

- Add subtle dependency for constant-time token comparison
- Add root directory for path-traversal protection in HTTP+SSE transport
- Update MCP server state to support --root flag
- Minor fixes and improvements across MCP modules

These changes support the 7 closed child beads:
- pdftract-5xq16: JSON-RPC 2.0 framing layer
- pdftract-67tm8: stdio transport
- pdftract-g0ro2: HTTP+SSE transport
- pdftract-24kut: transport mutual exclusion enforcement
- pdftract-1rami: tool catalog (10 tools)
- pdftract-6696g: path-traversal protection
- pdftract-zltqd: bearer-token auth

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 03:09:56 -04:00
jedarden
6a35bdd869 feat(pdftract-29z7b): implement unified diagnostic system + CLI commands
- Added `cmd_explain_diagnostic` function to CLI for detailed diagnostic code explanation
- Added `--list-diagnostics` and `--explain-diagnostic <code>` CLI commands
- Verified all Phase 1.1-1.5 modules use unified DiagCode (lexer, parser, xref, stream, catalog, outline, pages)
- DIAGNOSTIC_CATALOG provides metadata for all 61 diagnostic codes
- Diagnostic struct size: 56 bytes (within 48-64 target range)
- emit! macro provides ergonomic diagnostic emission
- INV-8 maintained: no panics in error paths

All diagnostic codes follow naming convention:
- STRUCT_*: PDF structure errors
- STREAM_*: Stream decoder errors
- XREF_*: Cross-reference table errors
- ENCRYPTION_*: Encryption-related errors
- OCR_*: OCR pipeline errors
- REMOTE_*: Remote source errors
- PAGE_*: Page-level errors
- FONT_*: Font pipeline errors
- GSTATE_*: Graphics state errors
- LAYOUT_*: Layout and reading order errors
- MCP_*: MCP server errors
- CACHE_*: Cache errors

References: Phase 1.6 (error recovery), INV-8, Phase 0.4 (clippy enforces doc comments)
2026-05-22 22:38:31 -04:00
jedarden
1959ff2446 feat(pdftract-3uu6v): implement LZWDecode with /EarlyChange parameter
- Add LZWDecoder filter using lzw crate v0.10
- Support /EarlyChange parameter (default 1, late 0)
  - Early change (1): Adobe/TIFF variant, code size increases BEFORE
  - Late change (0): GIF variant, code size increases AFTER
- Full predictor support (TIFF predictor 2, PNG predictors 10-15)
- Bomb limit protection with partial bytes on exceed
- INV-8 maintained: partial bytes returned on decode errors
- 23 tests pass (19 unit tests + 4 proptests)
- Fixtures generated using lzw crate for verification

Acceptance criteria:
- Critical test /EarlyChange=0 byte-perfect: PASS
- LZWDecode without /DecodeParms defaults: PASS
- LZWDecode + /Predictor 12: PASS
- Truncated stream partial bytes: PASS
- Bomb limit honored: PASS
- proptest no panic: PASS
- INV-8 maintained: PASS

Refs: Plan Phase 1.5 line 1142, PDF spec 7.4.4

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-22 22:38:31 -04:00
jedarden
2663c932aa feat(pdftract-2gbu9): enhance linearization detection with robust substring matching
Enhanced the `detect_linearization` function to avoid false matches when
extracting keys from the linearization dictionary. Previous implementation
could incorrectly match "/L" within "/Linearized" or "/H" within other keys.

Changes:
- Added loop-based search in extract_number helper to skip substring matches
- Added similar substring-aware logic for /H (hint stream) parsing
- Added new diagnostic codes for /Prev chain error handling
- Added comprehensive verification note

Acceptance criteria PASS:
- Non-linearized files return None
- Valid linearized dict detected correctly
- File size mismatch (incremental update) invalidates linearization
- No /H entry returns None for hint_stream_offset
- Random bytes never panic (proptest)
- Forward scan disabled for linearized files
- INV-8 maintained (no panics on arbitrary input)

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-22 19:15:47 -04:00
jedarden
256b5c7e5e feat(pdftract-5og4): add comprehensive proptest for hybrid xref handler
The hybrid xref handler (merge_hybrid) was already implemented. This adds
a property-based test to verify it handles random combinations of traditional
and stream entries without panicking.

Changes:
- Added proptest_merge_hybrid_no_panic to proptest_tests module
- Tests random entry sets using prop::collection::hash_map
- Covers all entry types (InUse, Free, Compressed)
- Verification note confirms all acceptance criteria PASS

Test results: 9/9 merge_hybrid tests pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 17:26:27 -04:00
jedarden
e0b293c3d6 fix(pdftract-2a6rk): fix xref.rs u64 literal overflow in proptest
Fixed compilation error in xref.rs where u64 literal 0x5DEECE66D was used
with u32 state, causing overflow. Changed state to u64 for proper Java
Random algorithm behavior.

The OCG /OCProperties parsing implementation was already complete and
all tests pass. See notes/pdftract-2a6rk.md for verification.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 17:26:27 -04:00
jedarden
e94f2abec4 fix(bf-49wmw): fix PNG-predictor unbounded pre-allocation
- Remove Vec::with_capacity(num_rows * row_size) pre-allocation in apply_png_predictors
- Remove Vec::with_capacity(data.len()) pre-allocation in apply_tiff_predictor_2
- Add MAX_ROW_BYTES (64 KB) to bound row size calculation
- Add is_row_size_clamped() check to detect suspicious PDF parameters
- Add max_output parameter to predictor functions for budget enforcement
- Track flate output separately, count predictor output against doc_counter
- Lower DEFAULT_MAX_DECOMPRESS_BYTES from 2GB to 512MiB

Row-by-row processing ensures peak memory stays at 2x stride regardless
of image height, preventing OOM from malicious PDF parameters.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-22 17:26:27 -04:00
jedarden
2a2a247e87 feat(pdftract-5og4): implement hybrid xref handler with traditional priority
Implements merge_hybrid() and is_hybrid_trailer() for hybrid PDF files.
Hybrid files have both a traditional xref table at startxref and a
supplementary xref stream pointed to by /XRefStm in the trailer.

Per PDF spec, the traditional table is authoritative for objects it
covers; the stream's type-2 entries fill gaps not covered by the
traditional table.

Key behaviors:
- Traditional entries override stream entries for same object numbers
- Stream-only type-2 entries are added as gap fill
- Free/InUse conflicts emit STRUCT_HYBRID_CONFLICT diagnostic
- Merged trailer has /XRefStm key removed
- Result XrefSection has is_hybrid: true set

Acceptance criteria:
- Critical test: traditional entries override stream entries (PASS)
- Gap fill: stream-only type-2 entries added (PASS)
- Free/InUse conflict: diagnostic emitted (PASS)
- Non-hybrid trailer: is_hybrid_trailer returns false (PASS)
- proptest: no panics with random combinations (PASS)
- INV-8 maintained: no panics in library code (PASS)

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-22 17:26:27 -04:00
jedarden
0db78aa5ae fix(pdftract-6bxw): fix ObjStm parser caching and test data
- Change resolve function signature from Fn(ObjRef) -> Option<PdfObject>
  to Fn(ObjRef) -> Option<PdfStream> for type safety
- Fix caching: load_object_stream now properly populates cache
- Fix error propagation for /Extends chains (CircularRef, DepthExceeded)
- Fix test data: add whitespace between embedded objects for lexer
- Fix compilation error in test_truncated_objstm_body

All 16 objstm tests now pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 22:47:29 -04:00
jedarden
7818f22735 fix(pdftract-5upi): remove diagnostic emission for unknown keywords
The lexer should not emit diagnostics for unknown keywords because:
1. Many valid keywords (trailer, xref, etc.) are not in the initial dispatch table
2. The object parser is responsible for validating keywords against known operators
3. Emitting diagnostics here causes false positives for valid PDF constructs

This change aligns with the task requirement that unknown keywords emit
Token::Keyword without a diagnostic, letting the object parser handle
STRUCT_UNKNOWN_KEYWORD if needed.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-20 22:03:58 -04:00
jedarden
fee6ed8afd fix(pdftract-5upi): correct keyword fallback in lexer
Fixed incorrect fallback behavior in keyword lexer functions. Four
functions (lex_e_keyword, lex_o_keyword, lex_r_keyword, lex_n_keyword)
were incorrectly calling lex_name() instead of lex_keyword() when
keywords didn't match.

When a PDF contains an unrecognized word starting with e/o/n/R
(e.g., "endob" instead of "endobj"), the lexer should fall back to
generic keyword parsing (Token::Keyword(bytes)), not name parsing.
Names always start with /, so calling lex_name() on input without
a leading / would incorrectly skip the first byte.

References:
- Bead: pdftract-5upi
- Notes: notes/pdftract-5upi.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:55:55 -04:00
jedarden
13e815e40c feat(pdftract-6bxw): implement object stream (ObjStm) parser
Implement the parser for PDF 1.5+ object streams with:
- Decompression via Phase 1.5 stream decoder
- Arc<RwLock<HashMap>> caching for thread-safe access
- /Extends chain support with cycle detection
- Depth limit (MAX_EXTENDS_DEPTH = 16) for adversarial protection
- get_object() API for xref type-2 entry resolution

Acceptance criteria verified:
- Critical test: N=10 objects all dereference correctly
- /Extends chain: both ObjStms' objects dereference correctly
- Cyclic /Extends: emits STRUCT_CIRCULAR_REF
- Truncated ObjStm: partial objects + diagnostic
- Decompression bomb: emits STREAM_BOMB
- Cache hit: returns cached Arc (Arc::ptr_eq verified)

Unit tests: 12 tests covering all acceptance criteria and edge cases.

Refs: pdftract-6bxw, plan Phase 1.2 line 1072
2026-05-20 19:03:53 -04:00
jedarden
60ae7ea561 test(pdftract-5upi): add acceptance criteria tests for structural token lexer
Add comprehensive tests for array/dict delimiters, keywords, indirect
references, stream header validation, and edge cases like case-mismatched
keywords.

All tests verify the existing lexer implementation handles:
- [1 2 3] -> ArrayStart, Integer(1), Integer(2), Integer(3), ArrayEnd
- << /A 1 >> -> DictStart, Name(b"A"), Integer(1), DictEnd
- <48> -> String(b"\x48") (NOT dict - < vs << distinction)
- <<<48>>> -> DictStart, String(b"\x48"), DictEnd
- true false null -> Bool(true), Bool(false), Null
- 12 0 obj null endobj -> Integer(12), Integer(0), Obj, Null, EndObj
- 5 0 R -> Integer(5), Integer(0), IndirectRef
- stream\n vs stream\r -> StructInvalidStreamHeader for lone CR
- True (case-mismatched) -> Token::Keyword(b"True")
- proptest: random bytes never panic, always terminate with Eof

Addresses pdftract-5upi acceptance criteria.
2026-05-20 18:52:35 -04:00
jedarden
deb79bba9c docs(pdftract-46lw): add forward_scan_xref verification note
Add comprehensive verification note for forward_scan_xref implementation.
The function was already implemented in xref.rs; this note documents
verification of all bead requirements.

Also fix duplicate ObjRef import in parser/mod.rs (ObjRef is defined in
diagnostics module and re-exported).

Bead: pdftract-46lw
2026-05-20 18:52:07 -04:00
jedarden
e1da95c730 feat(pdftract-5calf): implement outline traversal with UTF-16BE BOM detection
Add verification note for outline traversal implementation. The
implementation was already complete in outline.rs; this commit adds
required imports for test code and documents the verification.

Acceptance criteria:
- PASS: 3-level bookmark hierarchy test
- PASS: UTF-16BE BOM detection (0xFE 0xFF)
- PASS: PDFDocEncoding decoding (Latin-1 + spec Table D.2 overrides)
- PASS: /Count handling (positive=expanded, negative=collapsed)
- PASS: Destination /XYZ parsing with page index and anchor
- PASS: Cycle detection (STRUCT_CIRCULAR_REF diagnostic)
- PASS: proptest fuzzing (no panics, INV-8 maintained)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 18:49:52 -04:00
jedarden
9aa26a449e docs(pdftract-49f8): establish Cargo.lock policy and documentation
This commit implements the Cargo.lock policy for reproducible builds
across all workspace members (pdftract-core, pdftract-cli, pdftract-py).

Changes:
- Add CONTRIBUTING.md with lockfile-update workflow documentation
- Add .renovaterc.json for weekly lockfile-only PRs (human-gated)
- Add crates/pdftract-core/README.md with rationale for checked-in lockfiles
- Add notes/pdftract-49f8.md with verification note

The Argo workflow updates (pdftract-ci.yaml) are committed separately
in the declarative-config repo.

Acceptance criteria:
- PASS: Cargo.lock tracked by git, not in .gitignore
- PASS: Argo workflow templates document --locked/--frozen requirements
- WARN: Enforcement to be completed when placeholder templates are implemented
- WARN: Binary reproducibility verification deferred to pdftract-build-binaries implementation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 18:13:14 -04:00
jedarden
a88353069a fix(pdftract-5upi): add parse_obj_header_at_memory for xref forward scan
The structural token lexer was already fully implemented. All 84 lexer
tests pass, covering all acceptance criteria:

- Array/dict delimiters ([], <<>>)
- Keywords (true, false, null, obj, endobj, stream, endstream, R)
- Hex string vs dict ambiguity (< vs <<)
- Stream header validation (\n or \r\n only, lone \r is invalid)
- Case-sensitive keyword matching

This commit fixes a pre-existing compilation error in xref.rs where
forward_scan_memory() called parse_obj_header_at_memory() which didn't
exist. Added the missing function as a byte-slice variant of
parse_obj_header_at() for efficient memory-based scanning.

Verification: notes/pdftract-5upi.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 02:54:35 -04:00
jedarden
660a9401ef feat(pdftract-59zz): implement MCP bearer token ingress channels and TH-03 enforcement
Implements secure MCP bearer-token ingress channels and TH-03 startup abort
enforcement per plan lines 874, 915-921, 922-924.

## Changes
- Add `--auth-token-file PATH` flag (RECOMMENDED channel)
- Add `PDFTRACT_MCP_TOKEN` env var support
- Reject `--auth-token VALUE` unless `PDFTRACT_INSECURE_CLI_TOKEN=1`
- Enforce TH-03: require token for non-loopback bind addresses (exit 78)
- Loopback exemption for 127.0.0.0/8 and ::1/128

## Files
- crates/pdftract-cli/src/mcp/auth.rs: Token resolution with priority order
- crates/pdftract-cli/src/mcp/bind.rs: TH-03 bind security check
- crates/pdftract-cli/src/mcp/server.rs: MCP server entry point
- crates/pdftract-cli/src/mcp/mod.rs: Module exports
- crates/pdftract-cli/src/main.rs: CLI arguments
- crates/pdftract-cli/Cargo.toml: Add secrecy, tempfile dependencies

## Acceptance Criteria
-  --auth-token-file PATH flag implemented
-  PDFTRACT_MCP_TOKEN env var resolved
-  --auth-token VALUE rejected (exit 64) unless PDFTRACT_INSECURE_CLI_TOKEN=1
-  mcp --bind ADDR with non-loopback ADDR and no token: aborts with exit 78
-  mcp --bind ADDR with loopback ADDR and no token: succeeds
-  mcp --bind ADDR with token: succeeds regardless of address
- ⏸️ Inspector token: Phase 7.9 (not yet implemented)
- ⏸️ TH-03 test: separate bead

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 02:47:54 -04:00
jedarden
8c288a742d fix(pdftract-2hm4): fix keyword lexer to use Vec<u8> and improve diagnostics
- Fix Token::Keyword to use b"..." .to_vec() instead of static strings
- Improve unknown keyword diagnostics to show actual keyword bytes
- Remove unused has_valid_line_ending variable in stream keyword lexer
- Add stream_header_valid_line_endings test for stream keyword validation

All hex string lexer tests pass (16 unit tests + 2 proptests).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-2hm4
2026-05-18 02:11:40 -04:00
jedarden
4448c85738 feat(pdftract-2hm4): add hex string lexer proptests
Add two proptests for the PDF hex string lexer to verify robustness
and correctness:

1. proptest_hex_string_never_panics_on_random_bytes: Random byte
   sequences starting with '<' (not '<<') never cause panics.

2. proptest_hex_string_roundtrip_via_reencode: Hex decode + re-encode
   roundtrip property validates that encoding and decoding are
   inverse operations.

The hex string lexer implementation was already present and correct,
with proper handling of odd-length zero padding (<4> -> \x40, not \x04).

All acceptance criteria pass:
- Empty hex string: <> -> b""
- Odd-length single nibble: <4> -> b"\x40" (critical test)
- Standard decoding: <48656C6C6F> -> b"Hello"
- Mixed case: <aBcD> -> b"\xAB\xCD"
- Whitespace ignored: <48 65> -> b"\x48\x65"
- Unterminated with diagnostic: <48 -> b"\x48" + STRUCT_UNTERMINATED_STRING
- Proptests pass: random bytes never panic, roundtrip property holds
- INV-8 maintained: all error paths use diagnostics, no panics

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 02:02:07 -04:00
jedarden
ed5d7af299 fix(pdftract-2hm4): rename lexer diagnostic codes to use STRUCT_ prefix
Rename all DiagCode enum variants in the lexer to use the STRUCT_ prefix
to match the specification. This clarifies that these diagnostics relate
to structural/lexical issues in PDF documents.

Changes:
- InvalidName -> StructInvalidName
- InvalidHex -> StructInvalidHex
- InvalidOctal -> StructInvalidOctal
- InvalidStreamHeader -> StructInvalidStreamHeader
- UnexpectedEof -> StructUnexpectedEof
- UnterminatedString -> StructUnterminatedString

The hex string lexer implementation was already correct, with proper
handling of:
- Hex digit pair decoding
- Embedded whitespace (PDF spec 7.2.2)
- Odd-length zero padding: <4> -> \x40 (dangling nibble is HIGH)
- Invalid character diagnostics
- Unterminated string diagnostics

All 16 hex string tests pass, including critical tests for odd-length
padding and error handling.

See: notes/pdftract-2hm4.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 01:55:27 -04:00
jedarden
7044c746f9 feat(pdftract-1534): complete Tera-template-driven code generator
Add verify_receipt method support to Go templates:
- client.go.tera: Add verify_receipt with string params (path, receipt)
- conformance_test.go.tera: Add testVerifyReceipt test case

Code generator cleanup:
- Add uses_string_params and string_param_count to Method struct
- Fix unused variable warnings in contract parsing
- Document TODO for full markdown contract parsing

Verification:
- All 9 methods generated correctly (extract, extract_text, extract_markdown, extract_stream, search, get_metadata, hash, classify, verify_receipt)
- All 7 error types generated with exit code mapping
- Drift detection working (validate command)
- Protection against overwriting hand-written code (GENERATED marker)

See notes/pdftract-1534.md for full acceptance criteria status.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-1534
2026-05-18 01:55:27 -04:00
jedarden
e176fa68ad fix(pdftract-2hm4): fix hex string lexer invalid char handling and whitespace/comment skipping
Two fixes:

1. Hex string lexer now flushes dangling nibble when encountering invalid
   characters. For `<4X8Y>`, the X and Y are invalid, so we flush nibble 4
   as 0x40, then flush nibble 8 as 0x80, producing `\x40\x80`.

2. Fixed skip_whitespace_and_comments() to properly handle whitespace
   after comments. The previous logic only continued looping if the next
   byte was `%`, missing cases where whitespace follows a comment.

All 52 lexer tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 01:47:17 -04:00
jedarden
c914eece6e test(pdftract-2bpf6): add FlateDecode predictor tests and proptests
Add missing tests for FlateDecode predictor functionality:
- test_png_predictor_14_rgba_paeth: Verify PNG predictor 14 (Paeth) on 8-bit RGBA
- test_flate_decode_performance_100mb: Performance benchmark (100 MB < 250 ms in release)
- proptest_flate_decode_no_panic: Random byte sequences never panic
- proptest_flate_decode_with_predictor_no_panic: Random predictor params never panic
- proptest_flate_decode_bomb_limit_no_panic: Bomb limits never panic

All acceptance criteria for pdftract-2bpf6 now PASS:
- PNG predictor 15 with all 6 selector types: byte-perfect
- Simple FlateDecode: byte-perfect round-trip
- TIFF predictor 2: 8-bit RGB delta-decoded correctly
- PNG predictor 14 (Paeth) on RGBA: correct output
- Truncated stream: returns partial bytes
- Bomb limit: 3 GB → 2 GB truncation
- Performance: < 250 ms for 100 MB (release mode)
- proptest: 256 random cases × 3 tests, no panics
- INV-8: all error paths return partial bytes

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 01:08:21 -04:00
jedarden
f76f3a647b test(pdftract-5tmcg): add cycle detection test for page tree flattener
Add test_cycle_detection_in_page_tree to verify that circular references
in the /Pages tree are detected and handled gracefully without panicking.
The test creates a page tree with a cycle (parent -> child1 -> child2 -> child1)
and verifies that the flattener returns the valid pages while pruning the
cyclic portion.

Acceptance criteria verified:
- 3-level /Pages inheritance with MediaBox: PASS
- EC-09 missing MediaBox defaults to US Letter: PASS
- /Pages tree with cycles detected: PASS
- /Rotate value 45 clamped to 0: PASS
- Page count validation: PASS
- proptest random shapes never panic: PASS
- INV-8 no panics on invalid input: PASS

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-5tmcg
Bead-Id: pdftract-4iier
2026-05-18 00:38:44 -04:00
jedarden
b1317457e7 feat(pdftract-3nnqy): implement StreamDecoder trait, filter pipeline, and bomb limit
- StreamDecoder trait with decode() method for filter-specific decoding
- Per-filter implementations: FlateDecoder, ASCII85Decoder, ASCIIHexDecoder, PassthroughDecoder
- decode_stream() function with single and array filter handling
- Filter abbreviation normalization (/A85 -> ASCII85Decode, /Fl -> FlateDecode)
- ExtractionOptions with max_decompress_bytes (default 2 GB)
- Document-level decompression counter with chunked bomb limit checking
- Unknown filter returns raw bytes with STRUCT_UNKNOWN_FILTER diagnostic
- All 183 tests pass

Acceptance criteria:
- decode_stream() handles single-filter and array-filter cases: PASS
- /DecodeParms array correctly paired with /Filter array: PASS
- Critical test [/ASCII85Decode /FlateDecode] applies filters in order: PASS
- Filter abbreviations normalized: PASS
- 2 GB bomb limit with STREAM_BOMB diagnostic: PASS
- Unknown filter passthrough with STRUCT_UNKNOWN_FILTER: PASS
- INV-8 maintained (no panics, partial bytes on error): PASS

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-18 00:34:28 -04:00
jedarden
cedc9a86af fix(pdftract-1yad): enable proptest tests and update verification note
- Remove incorrect #[cfg(feature = "proptest")] since proptest is not behind a feature
- Update verification note to reflect 30 passing tests (includes 2 proptest tests)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 00:15:00 -04:00
jedarden
6477f7703f fix(pdftract-2bsfc): fix stream tests and catalog parser error handling
- Fix stream.rs test cases to use PdfStream::new() correctly (takes PdfDict directly, not wrapped in PdfObject::Dict)
- Fix catalog.rs test cases to use PdfObject::Dict(Box::new(dict)) (API change)
- Update parse_catalog to return Ok(empty_catalog) with STRUCT_MISSING_KEY diagnostic instead of Err when /Pages is missing (per bead acceptance criteria)

All catalog parser tests pass:
- 27 tests including 6 proptests for INV-8 compliance
- PageLabels number tree with mixed roman/arabic styles
- Tagged PDF detection via /MarkInfo
- Optional fields (Outlines, Version, etc.)
- proptest: random PdfObject as /Root never panics

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:56:10 -04:00
jedarden
3c1c44129c feat(pdftract-7nav): add PdfStream helper methods and consolidate stream types
- Add filter(), decode_params(), length() helper methods to PdfStream in types.rs
- Remove duplicate PdfStream definition from stream.rs
- Update decode_stream to use types.rs PdfStream
- Fix stream tests to use PdfDict directly instead of PdfObject::Dict wrapper

Acceptance criteria:
- PdfObject size: 24 bytes (under 32-byte target)
- All 24 object types tests pass
- Name interner deduplicates correctly
- PdfDict preserves insertion order

Refs: pdftract-7nav

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:55:47 -04:00
jedarden
b535638104 feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.

Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields

Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences

All 27 tests pass including proptests for fuzzing robustness (INV-8).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:45:45 -04:00
jedarden
88278c362f feat(pdftract-4hn1): use Cow<'static, str> for diagnostic messages
Changed Diagnostic::msg from String to Cow<'static, str> to avoid
allocations for static error messages. Static messages now use
Cow::Borrowed, while dynamic formatted messages use Cow::Owned.

Also fixed peek_token lifetime issue - was returning reference to
local variable, now returns reference from cache.

Acceptance criteria:
- Token enum with all required variants
- Lexer struct with position tracking and diagnostics
- Diagnostic uses Cow<'static, str> for zero-allocation static messages
- All public methods implemented: new, next_token, peek_token, position, take_diagnostics
- All internal helpers implemented

Refs: pdftract-4hn1

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-4hn1
2026-05-17 23:23:38 -04:00