The apostrophe in 'banker's_rounding' is invalid Rust 2021 syntax.
Changed to 'bankers_rounding' to fix compilation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit implements the Cargo.lock policy for reproducible builds
across all workspace members (pdftract-core, pdftract-cli, pdftract-py).
Changes:
- Add CONTRIBUTING.md with lockfile-update workflow documentation
- Add .renovaterc.json for weekly lockfile-only PRs (human-gated)
- Add crates/pdftract-core/README.md with rationale for checked-in lockfiles
- Add notes/pdftract-49f8.md with verification note
The Argo workflow updates (pdftract-ci.yaml) are committed separately
in the declarative-config repo.
Acceptance criteria:
- PASS: Cargo.lock tracked by git, not in .gitignore
- PASS: Argo workflow templates document --locked/--frozen requirements
- WARN: Enforcement to be completed when placeholder templates are implemented
- WARN: Binary reproducibility verification deferred to pdftract-build-binaries implementation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The structural token lexer was already fully implemented. All 84 lexer
tests pass, covering all acceptance criteria:
- Array/dict delimiters ([], <<>>)
- Keywords (true, false, null, obj, endobj, stream, endstream, R)
- Hex string vs dict ambiguity (< vs <<)
- Stream header validation (\n or \r\n only, lone \r is invalid)
- Case-sensitive keyword matching
This commit fixes a pre-existing compilation error in xref.rs where
forward_scan_memory() called parse_obj_header_at_memory() which didn't
exist. Added the missing function as a byte-slice variant of
parse_obj_header_at() for efficient memory-based scanning.
Verification: notes/pdftract-5upi.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement TH-07 password ingress channels for CLI:
- --password-stdin flag (reads one line from stdin)
- PDFTRACT_PASSWORD env var
- --password VALUE (rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1)
Exit code 64 for insecure password usage with stderr hint.
Stderr warning emitted when --password VALUE accepted via opt-in.
Priority order: stdin > env var > value (opt-in) > none.
Empty password (bare newline) treated as no password.
Acceptance criteria:
- --password-stdin: PASS
- PDFTRACT_PASSWORD: PASS
- --password VALUE rejection (exit 64): PASS
- Stderr warning on opt-in: PASS
- Exit codes: PASS
- Python/MCP/Serve: N/A (crates don't exist yet)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix Token::Keyword to use b"..." .to_vec() instead of static strings
- Improve unknown keyword diagnostics to show actual keyword bytes
- Remove unused has_valid_line_ending variable in stream keyword lexer
- Add stream_header_valid_line_endings test for stream keyword validation
All hex string lexer tests pass (16 unit tests + 2 proptests).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-2hm4
Add two proptests for the PDF hex string lexer to verify robustness
and correctness:
1. proptest_hex_string_never_panics_on_random_bytes: Random byte
sequences starting with '<' (not '<<') never cause panics.
2. proptest_hex_string_roundtrip_via_reencode: Hex decode + re-encode
roundtrip property validates that encoding and decoding are
inverse operations.
The hex string lexer implementation was already present and correct,
with proper handling of odd-length zero padding (<4> -> \x40, not \x04).
All acceptance criteria pass:
- Empty hex string: <> -> b""
- Odd-length single nibble: <4> -> b"\x40" (critical test)
- Standard decoding: <48656C6C6F> -> b"Hello"
- Mixed case: <aBcD> -> b"\xAB\xCD"
- Whitespace ignored: <48 65> -> b"\x48\x65"
- Unterminated with diagnostic: <48 -> b"\x48" + STRUCT_UNTERMINATED_STRING
- Proptests pass: random bytes never panic, roundtrip property holds
- INV-8 maintained: all error paths use diagnostics, no panics
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rename all DiagCode enum variants in the lexer to use the STRUCT_ prefix
to match the specification. This clarifies that these diagnostics relate
to structural/lexical issues in PDF documents.
Changes:
- InvalidName -> StructInvalidName
- InvalidHex -> StructInvalidHex
- InvalidOctal -> StructInvalidOctal
- InvalidStreamHeader -> StructInvalidStreamHeader
- UnexpectedEof -> StructUnexpectedEof
- UnterminatedString -> StructUnterminatedString
The hex string lexer implementation was already correct, with proper
handling of:
- Hex digit pair decoding
- Embedded whitespace (PDF spec 7.2.2)
- Odd-length zero padding: <4> -> \x40 (dangling nibble is HIGH)
- Invalid character diagnostics
- Unterminated string diagnostics
All 16 hex string tests pass, including critical tests for odd-length
padding and error handling.
See: notes/pdftract-2hm4.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add verify_receipt method support to Go templates:
- client.go.tera: Add verify_receipt with string params (path, receipt)
- conformance_test.go.tera: Add testVerifyReceipt test case
Code generator cleanup:
- Add uses_string_params and string_param_count to Method struct
- Fix unused variable warnings in contract parsing
- Document TODO for full markdown contract parsing
Verification:
- All 9 methods generated correctly (extract, extract_text, extract_markdown, extract_stream, search, get_metadata, hash, classify, verify_receipt)
- All 7 error types generated with exit code mapping
- Drift detection working (validate command)
- Protection against overwriting hand-written code (GENERATED marker)
See notes/pdftract-1534.md for full acceptance criteria status.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-1534
Add verify_receipt method support to Go templates:
- client.go.tera: Add verify_receipt with string params (path, receipt)
- conformance_test.go.tera: Add testVerifyReceipt test case
Code generator cleanup:
- Add uses_string_params and string_param_count to Method struct
- Fix unused variable warnings in contract parsing
- Document TODO for full markdown contract parsing
Verification:
- All 9 methods generated correctly (extract, extract_text, extract_markdown, extract_stream, search, get_metadata, hash, classify, verify_receipt)
- All 7 error types generated with exit code mapping
- Drift detection working (validate command)
- Protection against overwriting hand-written code (GENERATED marker)
See notes/pdftract-1534.md for full acceptance criteria status.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two fixes:
1. Hex string lexer now flushes dangling nibble when encountering invalid
characters. For `<4X8Y>`, the X and Y are invalid, so we flush nibble 4
as 0x40, then flush nibble 8 as 0x80, producing `\x40\x80`.
2. Fixed skip_whitespace_and_comments() to properly handle whitespace
after comments. The previous logic only continued looping if the next
byte was `%`, missing cases where whitespace follows a comment.
All 52 lexer tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements the conformance test runner pattern for all 10 SDKs as specified
in the plan (line 3547). Each SDK now has a dedicated conformance test runner.
Created:
- tests/sdk-conformance/report-schema.json: JSON schema for conformance reports
- docs/notes/sdk-conformance-runner.md: Pattern documentation and reference
- crates/pdftract-cli/tests/conformance.rs: Rust cargo test target
- tests/conformance/test_conformance.py: Python pytest harness
- tests/conformance/conformance.test.ts: Node.js vitest runner
- tests/conformance/conformance_test.go: Go go test runner
- tests/conformance/ConformanceTest.java: Java JUnit 5 runner
- tests/conformance/ConformanceTests.cs: .NET xUnit runner
- tests/conformance/conformance.c: C standalone binary
- tests/conformance/conformance_test.rb: Ruby minitest runner
- tests/conformance/ConformanceTest.php: PHP PHPUnit runner
- tests/conformance/ConformanceTests.swift: Swift XCTest runner
All runners implement:
- Loading of tests/sdk-conformance/cases.json
- Execution of test cases with language-native method invocations
- Comparison of results against expected values with numeric tolerances
- Emission of machine-readable conformance-report.json
- Non-zero exit on failures/errors for CI gating
Acceptance criteria:
- PASS: All 10 SDKs have language-specific runners
- PASS: Runners consume shared cases.json
- PASS: Runners emit JSON reports matching schema
- PASS: Runners exit non-zero on failure
- WARN: README integration pending SDK repo creation
- WARN: Stub implementations return placeholder results
References:
- Plan line 3547: "Every SDK has a pdftract-sdk-conformance test runner"
- Plan line 3589: "Conformance suite results published as Argo artifact"
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-5omc
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.
- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
* Full test suite loader and executor
* Comparison engine with min/max, string constraints, tolerances
* Skip logic for unsupported features and schema versions
* Report generation in JSON format
- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
* pdftract compare - Compare actual vs expected with tolerances
* Cross-language comparison tool to avoid reimplementations
- Documentation (docs/conformance/sdk-contract.md)
* Complete pattern specification with pseudocode
* Per-language runner locations
* CI integration requirements
- Python reference stub (tests/python-conformance/test_conformance.py)
* Full pytest-based implementation following the pattern
Closes: pdftract-5omc
Changes:
- Use pdftract-test-glibc:1.78 image (has aws/b2 CLI preinstalled)
- Use b2-readonly secret instead of armor-secrets
- Update env var names to ARMOR_ACCESS_KEY_ID/ARMOR_SECRET_ACCESS_KEY
- Remove apt-get install step (tools already in image)
The cer-diff tool was already implemented in a previous commit.
This commit fixes the image and secret references per the bead spec.
References pdftract-2t9 acceptance criteria:
- regression-corpus step runs on every PR (✓ already in workflow)
- Uses pdftract-test-glibc:1.78 image (✓ fixed)
- Uses b2-readonly secret (✓ fixed)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement Merkle SHA-256 fingerprint algorithm for PDF structural
fingerprinting as specified in Phase 1.7 of the plan.
Components:
- FingerprintInput struct with page data and catalog flags
- Per-page hashing: content streams (normalized), resources (sorted),
geometry (4dp banker's rounding)
- Structure tree hash for tagged PDFs
- Catalog feature flag byte (encryption, JS, XFA, OCG)
Acceptance criteria:
- INV-3: 100% reproducible fingerprints (test passes)
- INV-13: Output format ^pdftract-v1:[0-9a-f]{64}$ (test passes)
- Performance: 100-page PDF in < 1ms (test passes)
- KU-7: WARN - no linearized fixtures available
Closes pdftract-q15sh
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add test_cycle_detection_in_page_tree to verify that circular references
in the /Pages tree are detected and handled gracefully without panicking.
The test creates a page tree with a cycle (parent -> child1 -> child2 -> child1)
and verifies that the flattener returns the valid pages while pruning the
cyclic portion.
Acceptance criteria verified:
- 3-level /Pages inheritance with MediaBox: PASS
- EC-09 missing MediaBox defaults to US Letter: PASS
- /Pages tree with cycles detected: PASS
- /Rotate value 45 clamped to 0: PASS
- Page count validation: PASS
- proptest random shapes never panic: PASS
- INV-8 no panics on invalid input: PASS
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-5tmcg
Bead-Id: pdftract-4iier
- Remove incorrect #[cfg(feature = "proptest")] since proptest is not behind a feature
- Update verification note to reflect 30 passing tests (includes 2 proptest tests)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix stream.rs test cases to use PdfStream::new() correctly (takes PdfDict directly, not wrapped in PdfObject::Dict)
- Fix catalog.rs test cases to use PdfObject::Dict(Box::new(dict)) (API change)
- Update parse_catalog to return Ok(empty_catalog) with STRUCT_MISSING_KEY diagnostic instead of Err when /Pages is missing (per bead acceptance criteria)
All catalog parser tests pass:
- 27 tests including 6 proptests for INV-8 compliance
- PageLabels number tree with mixed roman/arabic styles
- Tagged PDF detection via /MarkInfo
- Optional fields (Outlines, Version, etc.)
- proptest: random PdfObject as /Root never panics
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.
Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields
Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences
All 27 tests pass including proptests for fuzzing robustness (INV-8).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Changed Diagnostic::msg from String to Cow<'static, str> to avoid
allocations for static error messages. Static messages now use
Cow::Borrowed, while dynamic formatted messages use Cow::Owned.
Also fixed peek_token lifetime issue - was returning reference to
local variable, now returns reference from cache.
Acceptance criteria:
- Token enum with all required variants
- Lexer struct with position tracking and diagnostics
- Diagnostic uses Cow<'static, str> for zero-allocation static messages
- All public methods implemented: new, next_token, peek_token, position, take_diagnostics
- All internal helpers implemented
Refs: pdftract-4hn1
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-4hn1