Commit graph

11 commits

Author SHA1 Message Date
jedarden
857f928732 feat(pdftract-5omc): implement SDK conformance test runner pattern
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.

- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
  * Full test suite loader and executor
  * Comparison engine with min/max, string constraints, tolerances
  * Skip logic for unsupported features and schema versions
  * Report generation in JSON format

- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
  * pdftract compare - Compare actual vs expected with tolerances
  * Cross-language comparison tool to avoid reimplementations

- Documentation (docs/conformance/sdk-contract.md)
  * Complete pattern specification with pseudocode
  * Per-language runner locations
  * CI integration requirements

- Python reference stub (tests/python-conformance/test_conformance.py)
  * Full pytest-based implementation following the pattern

Closes: pdftract-5omc
2026-05-18 01:22:23 -04:00
jedarden
02488a354c fix(pdftract-2t9): update regression-corpus step image and secret
Changes:
- Use pdftract-test-glibc:1.78 image (has aws/b2 CLI preinstalled)
- Use b2-readonly secret instead of armor-secrets
- Update env var names to ARMOR_ACCESS_KEY_ID/ARMOR_SECRET_ACCESS_KEY
- Remove apt-get install step (tools already in image)

The cer-diff tool was already implemented in a previous commit.
This commit fixes the image and secret references per the bead spec.

References pdftract-2t9 acceptance criteria:
- regression-corpus step runs on every PR (✓ already in workflow)
- Uses pdftract-test-glibc:1.78 image (✓ fixed)
- Uses b2-readonly secret (✓ fixed)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 01:20:53 -04:00
jedarden
c914eece6e test(pdftract-2bpf6): add FlateDecode predictor tests and proptests
Add missing tests for FlateDecode predictor functionality:
- test_png_predictor_14_rgba_paeth: Verify PNG predictor 14 (Paeth) on 8-bit RGBA
- test_flate_decode_performance_100mb: Performance benchmark (100 MB < 250 ms in release)
- proptest_flate_decode_no_panic: Random byte sequences never panic
- proptest_flate_decode_with_predictor_no_panic: Random predictor params never panic
- proptest_flate_decode_bomb_limit_no_panic: Bomb limits never panic

All acceptance criteria for pdftract-2bpf6 now PASS:
- PNG predictor 15 with all 6 selector types: byte-perfect
- Simple FlateDecode: byte-perfect round-trip
- TIFF predictor 2: 8-bit RGB delta-decoded correctly
- PNG predictor 14 (Paeth) on RGBA: correct output
- Truncated stream: returns partial bytes
- Bomb limit: 3 GB → 2 GB truncation
- Performance: < 250 ms for 100 MB (release mode)
- proptest: 256 random cases × 3 tests, no panics
- INV-8: all error paths return partial bytes

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 01:08:21 -04:00
jedarden
6aabfa0c96 feat(pdftract-q15sh): implement v1 fingerprint algorithm
Implement Merkle SHA-256 fingerprint algorithm for PDF structural
fingerprinting as specified in Phase 1.7 of the plan.

Components:
- FingerprintInput struct with page data and catalog flags
- Per-page hashing: content streams (normalized), resources (sorted),
  geometry (4dp banker's rounding)
- Structure tree hash for tagged PDFs
- Catalog feature flag byte (encryption, JS, XFA, OCG)

Acceptance criteria:
- INV-3: 100% reproducible fingerprints (test passes)
- INV-13: Output format ^pdftract-v1:[0-9a-f]{64}$ (test passes)
- Performance: 100-page PDF in < 1ms (test passes)
- KU-7: WARN - no linearized fixtures available

Closes pdftract-q15sh

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 01:02:30 -04:00
jedarden
f76f3a647b test(pdftract-5tmcg): add cycle detection test for page tree flattener
Add test_cycle_detection_in_page_tree to verify that circular references
in the /Pages tree are detected and handled gracefully without panicking.
The test creates a page tree with a cycle (parent -> child1 -> child2 -> child1)
and verifies that the flattener returns the valid pages while pruning the
cyclic portion.

Acceptance criteria verified:
- 3-level /Pages inheritance with MediaBox: PASS
- EC-09 missing MediaBox defaults to US Letter: PASS
- /Pages tree with cycles detected: PASS
- /Rotate value 45 clamped to 0: PASS
- Page count validation: PASS
- proptest random shapes never panic: PASS
- INV-8 no panics on invalid input: PASS

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-5tmcg
Bead-Id: pdftract-4iier
2026-05-18 00:38:44 -04:00
jedarden
b1317457e7 feat(pdftract-3nnqy): implement StreamDecoder trait, filter pipeline, and bomb limit
- StreamDecoder trait with decode() method for filter-specific decoding
- Per-filter implementations: FlateDecoder, ASCII85Decoder, ASCIIHexDecoder, PassthroughDecoder
- decode_stream() function with single and array filter handling
- Filter abbreviation normalization (/A85 -> ASCII85Decode, /Fl -> FlateDecode)
- ExtractionOptions with max_decompress_bytes (default 2 GB)
- Document-level decompression counter with chunked bomb limit checking
- Unknown filter returns raw bytes with STRUCT_UNKNOWN_FILTER diagnostic
- All 183 tests pass

Acceptance criteria:
- decode_stream() handles single-filter and array-filter cases: PASS
- /DecodeParms array correctly paired with /Filter array: PASS
- Critical test [/ASCII85Decode /FlateDecode] applies filters in order: PASS
- Filter abbreviations normalized: PASS
- 2 GB bomb limit with STREAM_BOMB diagnostic: PASS
- Unknown filter passthrough with STRUCT_UNKNOWN_FILTER: PASS
- INV-8 maintained (no panics, partial bytes on error): PASS

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-18 00:34:28 -04:00
jedarden
cedc9a86af fix(pdftract-1yad): enable proptest tests and update verification note
- Remove incorrect #[cfg(feature = "proptest")] since proptest is not behind a feature
- Update verification note to reflect 30 passing tests (includes 2 proptest tests)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 00:15:00 -04:00
jedarden
6477f7703f fix(pdftract-2bsfc): fix stream tests and catalog parser error handling
- Fix stream.rs test cases to use PdfStream::new() correctly (takes PdfDict directly, not wrapped in PdfObject::Dict)
- Fix catalog.rs test cases to use PdfObject::Dict(Box::new(dict)) (API change)
- Update parse_catalog to return Ok(empty_catalog) with STRUCT_MISSING_KEY diagnostic instead of Err when /Pages is missing (per bead acceptance criteria)

All catalog parser tests pass:
- 27 tests including 6 proptests for INV-8 compliance
- PageLabels number tree with mixed roman/arabic styles
- Tagged PDF detection via /MarkInfo
- Optional fields (Outlines, Version, etc.)
- proptest: random PdfObject as /Root never panics

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:56:10 -04:00
jedarden
3c1c44129c feat(pdftract-7nav): add PdfStream helper methods and consolidate stream types
- Add filter(), decode_params(), length() helper methods to PdfStream in types.rs
- Remove duplicate PdfStream definition from stream.rs
- Update decode_stream to use types.rs PdfStream
- Fix stream tests to use PdfDict directly instead of PdfObject::Dict wrapper

Acceptance criteria:
- PdfObject size: 24 bytes (under 32-byte target)
- All 24 object types tests pass
- Name interner deduplicates correctly
- PdfDict preserves insertion order

Refs: pdftract-7nav

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:55:47 -04:00
jedarden
b535638104 feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.

Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields

Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences

All 27 tests pass including proptests for fuzzing robustness (INV-8).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:45:45 -04:00
jedarden
88278c362f feat(pdftract-4hn1): use Cow<'static, str> for diagnostic messages
Changed Diagnostic::msg from String to Cow<'static, str> to avoid
allocations for static error messages. Static messages now use
Cow::Borrowed, while dynamic formatted messages use Cow::Owned.

Also fixed peek_token lifetime issue - was returning reference to
local variable, now returns reference from cache.

Acceptance criteria:
- Token enum with all required variants
- Lexer struct with position tracking and diagnostics
- Diagnostic uses Cow<'static, str> for zero-allocation static messages
- All public methods implemented: new, next_token, peek_token, position, take_diagnostics
- All internal helpers implemented

Refs: pdftract-4hn1

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-4hn1
2026-05-17 23:23:38 -04:00