jedarden/pdftract

Author	SHA1	Message	Date
jedarden	e176fa68ad	fix(pdftract-2hm4): fix hex string lexer invalid char handling and whitespace/comment skipping Two fixes: 1. Hex string lexer now flushes dangling nibble when encountering invalid characters. For `<4X8Y>`, the X and Y are invalid, so we flush nibble 4 as 0x40, then flush nibble 8 as 0x80, producing `\x40\x80`. 2. Fixed skip_whitespace_and_comments() to properly handle whitespace after comments. The previous logic only continued looping if the next byte was `%`, missing cases where whitespace follows a comment. All 52 lexer tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 01:47:17 -04:00
jedarden	9456d8e231	feat(pdftract-5omc): implement per-language conformance test runner pattern Implements the conformance test runner pattern for all 10 SDKs as specified in the plan (line 3547). Each SDK now has a dedicated conformance test runner. Created: - tests/sdk-conformance/report-schema.json: JSON schema for conformance reports - docs/notes/sdk-conformance-runner.md: Pattern documentation and reference - crates/pdftract-cli/tests/conformance.rs: Rust cargo test target - tests/conformance/test_conformance.py: Python pytest harness - tests/conformance/conformance.test.ts: Node.js vitest runner - tests/conformance/conformance_test.go: Go go test runner - tests/conformance/ConformanceTest.java: Java JUnit 5 runner - tests/conformance/ConformanceTests.cs: .NET xUnit runner - tests/conformance/conformance.c: C standalone binary - tests/conformance/conformance_test.rb: Ruby minitest runner - tests/conformance/ConformanceTest.php: PHP PHPUnit runner - tests/conformance/ConformanceTests.swift: Swift XCTest runner All runners implement: - Loading of tests/sdk-conformance/cases.json - Execution of test cases with language-native method invocations - Comparison of results against expected values with numeric tolerances - Emission of machine-readable conformance-report.json - Non-zero exit on failures/errors for CI gating Acceptance criteria: - PASS: All 10 SDKs have language-specific runners - PASS: Runners consume shared cases.json - PASS: Runners emit JSON reports matching schema - PASS: Runners exit non-zero on failure - WARN: README integration pending SDK repo creation - WARN: Stub implementations return placeholder results References: - Plan line 3547: "Every SDK has a pdftract-sdk-conformance test runner" - Plan line 3589: "Conformance suite results published as Argo artifact" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: pdftract-5omc	2026-05-18 01:32:24 -04:00
jedarden	857f928732	feat(pdftract-5omc): implement SDK conformance test runner pattern Implement the conformance test runner pattern that every SDK will implement to validate against the shared test suite. - Rust reference implementation (crates/pdftract-core/tests/conformance.rs) * Full test suite loader and executor * Comparison engine with min/max, string constraints, tolerances * Skip logic for unsupported features and schema versions * Report generation in JSON format - CLI compare subcommand (crates/pdftract-cli/src/main.rs) * pdftract compare - Compare actual vs expected with tolerances * Cross-language comparison tool to avoid reimplementations - Documentation (docs/conformance/sdk-contract.md) * Complete pattern specification with pseudocode * Per-language runner locations * CI integration requirements - Python reference stub (tests/python-conformance/test_conformance.py) * Full pytest-based implementation following the pattern Closes: pdftract-5omc	2026-05-18 01:22:23 -04:00
jedarden	02488a354c	fix(pdftract-2t9): update regression-corpus step image and secret Changes: - Use pdftract-test-glibc:1.78 image (has aws/b2 CLI preinstalled) - Use b2-readonly secret instead of armor-secrets - Update env var names to ARMOR_ACCESS_KEY_ID/ARMOR_SECRET_ACCESS_KEY - Remove apt-get install step (tools already in image) The cer-diff tool was already implemented in a previous commit. This commit fixes the image and secret references per the bead spec. References pdftract-2t9 acceptance criteria: - regression-corpus step runs on every PR (✓ already in workflow) - Uses pdftract-test-glibc:1.78 image (✓ fixed) - Uses b2-readonly secret (✓ fixed) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 01:20:53 -04:00
jedarden	c914eece6e	test(pdftract-2bpf6): add FlateDecode predictor tests and proptests Add missing tests for FlateDecode predictor functionality: - test_png_predictor_14_rgba_paeth: Verify PNG predictor 14 (Paeth) on 8-bit RGBA - test_flate_decode_performance_100mb: Performance benchmark (100 MB < 250 ms in release) - proptest_flate_decode_no_panic: Random byte sequences never panic - proptest_flate_decode_with_predictor_no_panic: Random predictor params never panic - proptest_flate_decode_bomb_limit_no_panic: Bomb limits never panic All acceptance criteria for pdftract-2bpf6 now PASS: - PNG predictor 15 with all 6 selector types: byte-perfect - Simple FlateDecode: byte-perfect round-trip - TIFF predictor 2: 8-bit RGB delta-decoded correctly - PNG predictor 14 (Paeth) on RGBA: correct output - Truncated stream: returns partial bytes - Bomb limit: 3 GB → 2 GB truncation - Performance: < 250 ms for 100 MB (release mode) - proptest: 256 random cases × 3 tests, no panics - INV-8: all error paths return partial bytes Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 01:08:21 -04:00
jedarden	6aabfa0c96	feat(pdftract-q15sh): implement v1 fingerprint algorithm Implement Merkle SHA-256 fingerprint algorithm for PDF structural fingerprinting as specified in Phase 1.7 of the plan. Components: - FingerprintInput struct with page data and catalog flags - Per-page hashing: content streams (normalized), resources (sorted), geometry (4dp banker's rounding) - Structure tree hash for tagged PDFs - Catalog feature flag byte (encryption, JS, XFA, OCG) Acceptance criteria: - INV-3: 100% reproducible fingerprints (test passes) - INV-13: Output format ^pdftract-v1:[0-9a-f]{64}$ (test passes) - Performance: 100-page PDF in < 1ms (test passes) - KU-7: WARN - no linearized fixtures available Closes pdftract-q15sh Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 01:02:30 -04:00
jedarden	f76f3a647b	test(pdftract-5tmcg): add cycle detection test for page tree flattener Add test_cycle_detection_in_page_tree to verify that circular references in the /Pages tree are detected and handled gracefully without panicking. The test creates a page tree with a cycle (parent -> child1 -> child2 -> child1) and verifies that the flattener returns the valid pages while pruning the cyclic portion. Acceptance criteria verified: - 3-level /Pages inheritance with MediaBox: PASS - EC-09 missing MediaBox defaults to US Letter: PASS - /Pages tree with cycles detected: PASS - /Rotate value 45 clamped to 0: PASS - Page count validation: PASS - proptest random shapes never panic: PASS - INV-8 no panics on invalid input: PASS Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: pdftract-5tmcg Bead-Id: pdftract-4iier	2026-05-18 00:38:44 -04:00
jedarden	b1317457e7	feat(pdftract-3nnqy): implement StreamDecoder trait, filter pipeline, and bomb limit - StreamDecoder trait with decode() method for filter-specific decoding - Per-filter implementations: FlateDecoder, ASCII85Decoder, ASCIIHexDecoder, PassthroughDecoder - decode_stream() function with single and array filter handling - Filter abbreviation normalization (/A85 -> ASCII85Decode, /Fl -> FlateDecode) - ExtractionOptions with max_decompress_bytes (default 2 GB) - Document-level decompression counter with chunked bomb limit checking - Unknown filter returns raw bytes with STRUCT_UNKNOWN_FILTER diagnostic - All 183 tests pass Acceptance criteria: - decode_stream() handles single-filter and array-filter cases: PASS - /DecodeParms array correctly paired with /Filter array: PASS - Critical test [/ASCII85Decode /FlateDecode] applies filters in order: PASS - Filter abbreviations normalized: PASS - 2 GB bomb limit with STREAM_BOMB diagnostic: PASS - Unknown filter passthrough with STRUCT_UNKNOWN_FILTER: PASS - INV-8 maintained (no panics, partial bytes on error): PASS Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-05-18 00:34:28 -04:00
jedarden	cedc9a86af	fix(pdftract-1yad): enable proptest tests and update verification note - Remove incorrect #[cfg(feature = "proptest")] since proptest is not behind a feature - Update verification note to reflect 30 passing tests (includes 2 proptest tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 00:15:00 -04:00
jedarden	6477f7703f	fix(pdftract-2bsfc): fix stream tests and catalog parser error handling - Fix stream.rs test cases to use PdfStream::new() correctly (takes PdfDict directly, not wrapped in PdfObject::Dict) - Fix catalog.rs test cases to use PdfObject::Dict(Box::new(dict)) (API change) - Update parse_catalog to return Ok(empty_catalog) with STRUCT_MISSING_KEY diagnostic instead of Err when /Pages is missing (per bead acceptance criteria) All catalog parser tests pass: - 27 tests including 6 proptests for INV-8 compliance - PageLabels number tree with mixed roman/arabic styles - Tagged PDF detection via /MarkInfo - Optional fields (Outlines, Version, etc.) - proptest: random PdfObject as /Root never panics Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:56:10 -04:00
jedarden	3c1c44129c	feat(pdftract-7nav): add PdfStream helper methods and consolidate stream types - Add filter(), decode_params(), length() helper methods to PdfStream in types.rs - Remove duplicate PdfStream definition from stream.rs - Update decode_stream to use types.rs PdfStream - Fix stream tests to use PdfDict directly instead of PdfObject::Dict wrapper Acceptance criteria: - PdfObject size: 24 bytes (under 32-byte target) - All 24 object types tests pass - Name interner deduplicates correctly - PdfDict preserves insertion order Refs: pdftract-7nav Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:55:47 -04:00
jedarden	b535638104	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree Implement the document catalog parser (/Root traversal) for PDF documents. The catalog parser extracts all key entries from the document catalog including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names, Metadata, PageLabels, OCProperties, OpenAction, AA, and Version. Key structures: - MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects - PageLabelStyle: enum for all label styles (D, R, r, A, a) - PageLabel: single page label with style, prefix, and start value - PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support - OcProperties: stub for OCG implementation (delegated to dedicated bead) - Catalog: main catalog struct with all required and optional fields Number tree implementation: - Parses /Nums arrays (leaf nodes with alternating key-value pairs) - Supports /Kids arrays (internal nodes for recursive tree traversal) - Provides get_label_with_start() and get_label() methods for lookup - Correctly formats roman numerals (uppercase/lowercase) and letter sequences All 27 tests pass including proptests for fuzzing robustness (INV-8). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:45:45 -04:00
jedarden	88278c362f	feat(pdftract-4hn1): use Cow<'static, str> for diagnostic messages Changed Diagnostic::msg from String to Cow<'static, str> to avoid allocations for static error messages. Static messages now use Cow::Borrowed, while dynamic formatted messages use Cow::Owned. Also fixed peek_token lifetime issue - was returning reference to local variable, now returns reference from cache. Acceptance criteria: - Token enum with all required variants - Lexer struct with position tracking and diagnostics - Diagnostic uses Cow<'static, str> for zero-allocation static messages - All public methods implemented: new, next_token, peek_token, position, take_diagnostics - All internal helpers implemented Refs: pdftract-4hn1 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: pdftract-4hn1	2026-05-17 23:23:38 -04:00

13 commits