The lexer should not emit diagnostics for unknown keywords because:
1. Many valid keywords (trailer, xref, etc.) are not in the initial dispatch table
2. The object parser is responsible for validating keywords against known operators
3. Emitting diagnostics here causes false positives for valid PDF constructs
This change aligns with the task requirement that unknown keywords emit
Token::Keyword without a diagnostic, letting the object parser handle
STRUCT_UNKNOWN_KEYWORD if needed.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Fixed incorrect fallback behavior in keyword lexer functions. Four
functions (lex_e_keyword, lex_o_keyword, lex_r_keyword, lex_n_keyword)
were incorrectly calling lex_name() instead of lex_keyword() when
keywords didn't match.
When a PDF contains an unrecognized word starting with e/o/n/R
(e.g., "endob" instead of "endobj"), the lexer should fall back to
generic keyword parsing (Token::Keyword(bytes)), not name parsing.
Names always start with /, so calling lex_name() on input without
a leading / would incorrectly skip the first byte.
References:
- Bead: pdftract-5upi
- Notes: notes/pdftract-5upi.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents that CycloneDX SBOM generation is fully implemented
in the Argo Workflows (declarative-config). The workflows:
- Generate pdftract-vX.Y.Z.cdx.json using cargo-cyclonedx
- Validate schema with cyclonedx-cli validate
- Attest to Docker images via cosign attest --type cyclonedx
- Attach to GitHub Release as an asset
- Include in SHA256SUMS aggregate
Acceptance criteria: 5 PASS, 1 WARN (grype test requires release)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the completed implementation of pdftract-release-cascade
WorkflowTemplate and pdftract-tag-trigger Argo Events Sensor.
Acceptance criteria:
- PASS: All infrastructure files committed in declarative-config
- WARN: Runtime verification deferred (kubectl not available in env)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the enhancements made to cosign keyless signing:
- Projected service account token with sigstore audience
- Explicit OIDC issuer URL configuration
- Improved digest extraction with fallback strategies
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Make diagnostics module visible to fingerprint module and fix
hash_page_geometry signature to match usage.
Changes:
- Add `pub mod diagnostics;` to lib.rs for module visibility
- Modify hash_page_geometry to create diagnostics internally
The canonicalize module already has complete implementation:
- canonicalize_f64: banker's rounding to 4dp for geometry
- normalize_content_stream: whitespace normalization via lexer
- serialize_dict_canonical: sorted-key dict serialization
- hash_resource_dict_canonical: order-independent resource hashing
Verification: notes/pdftract-154mz.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the WorkflowTemplate creation for mdBook → Cloudflare Pages CI.
Template committed to declarative-config 4fe4947.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the implementation of the pdftract-github-release
WorkflowTemplate, including artifact taxonomy, release notes
generation, and acceptance criteria status.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add per-PR property tests and nightly fuzz job infrastructure:
CI Changes (declarative-config):
- pdftract-ci.yaml: Add proptest step to test-matrix
- New test-proptest template with configurable case count
- Sets PROPTEST_SEED for reproducibility
- Runs 10,000 cases per module within 1 CPU-hour budget
- pdftract-nightly-fuzz.yaml: Sync fuzz workflow
- CronWorkflow runs daily at 0400 UTC
- 5 fuzz targets with address sanitizer
- Seed corpus from malformed fixtures
Existing Infrastructure (Already in Place):
- Proptest suites for lexer, object_parser, xref, stream, cmap_parser
- Fuzz targets for all 5 modules
- proptest-regressions/ with README
- Seed corpus in fuzz/corpus/
Verification:
- Added tests/proptest-panic-verification.rs
- Proptest infrastructure correctly structured
- Will catch deliberate panics within budget
Closes: pdftract-33v
Documents the implementation of the pdftract-crates-publish WorkflowTemplate
in jedarden/declarative-config.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OCG implementation was already complete in ocg.rs. All 20 tests pass:
- BaseState parsing (ON/OFF/Unchanged)
- /ON and /OFF array override handling
- OCMD policy preservation (AllOn, AnyOn, AllOff, AnyOff)
- INV-8 compliance verified via proptests
Phase 3 will consume OcProperties via is_visible() to suppress
glyphs in /OC /OCGRef BDC blocks when the referenced OCG is OFF.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Add details about the BytesSource cleanup bug fix and clarify that the
contract defines 7 error kinds, not 8 as initially stated in the task.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add source Source parameter to invoke, invokeJSON, invokeString, invokeStream
- Change BytesSource from []byte type to struct with data and tmpPath fields
- Add proper cleanup of temporary files after subprocess execution
- Fix source parameter pass-through in Extract, ExtractText, ExtractMarkdown, GetMetadata, Hash, Classify
This ensures BytesSource temporary files are cleaned up after use, preventing
file descriptor leaks. The BytesSource now creates a temp file on demand and
cleans it up automatically via defer in the invoke methods.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the publish-if-tag step in pdftract-ci that activates on
version tags (v*.*.*) and publishes cross-compiled binaries to
GitHub Releases.
Changes:
- Add tools/extract-release-notes.sh script for CHANGELOG parsing
- Update publish-if-tag template in pdftract-ci.yaml:
- Downloads all 5 build artifacts from build-matrix
- Generates SHA256SUMS checksums
- Extracts release notes from CHANGELOG.md
- Creates GitHub Release via gh CLI
- Supports both stable and pre-release tags (--prerelease flag)
- Uses --clobber for idempotent re-runs
The step uses Chainguard's gh:latest image and authenticates via
github-pdftract-release Secret (GH_TOKEN key). Optional signing
infrastructure is deferred to Release Engineering epic.
Co-Authored-By: Claude Code (glm-4.7) <noreply@anthropic.com>
Add quality-matrix implementation to pdftract-ci with msrv-check step
using rust:1.78-slim to detect usage of newer Rust features.
Changes:
- .ci/argo-workflows/pdftract-ci.yaml: Implement quality-matrix DAG with
msrv-check, clippy-fmt, and cargo-audit templates
- CHANGELOG.md: New file documenting MSRV bump policy (MINOR version
event, warning period, update checklist)
The MSRV gate prevents silent drift that would break downstream consumers
on older toolchains. Any Rust 1.79+ feature (e.g., let-else, core::error::Error)
will fail the msrv-check step, triggering a policy review.
See notes/pdftract-2w02.md for acceptance criteria verification.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Add MSRV (Minimum Supported Rust Version) pinning to 1.78 for
pdftract-core and pdftract-cli. The MSRV gate prevents silent
absorption of newer Rust features that would break downstream
consumers on older toolchains.
Changes:
- CI: Add quality-matrix DAG with msrv-check step (rust:1.78-slim)
- CI: Add clippy-check, fmt-check, cargo-audit, cargo-deny templates
- README: Add MSRV badge (shields.io)
- clippy.toml: Enable msrv=1.78 for MSRV-aware lints
- CONTRIBUTING.md: Document MSRV bump policy (MINOR version event)
The rust-version was already declared in workspace Cargo.toml;
this bead adds the CI enforcement and documentation.
Refs: pdftract-2w02
Add comprehensive verification note for forward_scan_xref implementation.
The function was already implemented in xref.rs; this note documents
verification of all bead requirements.
Also fix duplicate ObjRef import in parser/mod.rs (ObjRef is defined in
diagnostics module and re-exported).
Bead: pdftract-46lw
The apostrophe in 'banker's_rounding' is invalid Rust 2021 syntax.
Changed to 'bankers_rounding' to fix compilation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The stub template was already created in commit 642949b in
jedarden/declarative-config. This note documents the acceptance
criteria verification status.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Crypt filter was already implemented in the codebase. This note
documents the verification of acceptance criteria and test coverage.
Acceptance criteria verified:
- /Identity crypt passes through unchanged
- Custom crypt returns ENCRYPTION_UNSUPPORTED
- Missing /DecodeParms defaults to /Identity
- Works correctly with FlateDecode
- Comprehensive test coverage including proptests
- INV-8 maintained (no panics)
Also add missing malformed fixture entries to PROVENANCE.md.
Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit implements the Cargo.lock policy for reproducible builds
across all workspace members (pdftract-core, pdftract-cli, pdftract-py).
Changes:
- Add CONTRIBUTING.md with lockfile-update workflow documentation
- Add .renovaterc.json for weekly lockfile-only PRs (human-gated)
- Add crates/pdftract-core/README.md with rationale for checked-in lockfiles
- Add notes/pdftract-49f8.md with verification note
The Argo workflow updates (pdftract-ci.yaml) are committed separately
in the declarative-config repo.
Acceptance criteria:
- PASS: Cargo.lock tracked by git, not in .gitignore
- PASS: Argo workflow templates document --locked/--frozen requirements
- WARN: Enforcement to be completed when placeholder templates are implemented
- WARN: Binary reproducibility verification deferred to pdftract-build-binaries implementation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The workspace-level Cargo.lock is checked into version control
for reproducible builds. All Argo build steps enforce --locked
--frozen to ensure dependency versions match exactly.
This commit includes lockfile updates for new dependencies
(lzw, memchr) added during development.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Configure workspace with pdftract-core, pdftract-cli, pdftract-py members
- Add workspace.package metadata: version, edition, rust-version (1.78), license (MIT OR Apache-2.0)
- Add workspace.dependencies for shared external deps (anyhow, flate2, lzw, memchr, secrecy, serde, thiserror, tracing)
- Create .cargo/config.toml with CI and development build aliases
- All member crates reference workspace metadata via workspace = true
- pdftract-py configured as cdylib with pyo3 extension-module feature
Acceptance criteria:
- PASS: 3 workspace members listed by cargo metadata
- PASS: All crates use workspace metadata references
- WARN: cargo build fails due to code compilation errors (separate concern)
Refs: pdftract-279, plan lines 3343-3367
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The structural token lexer was already fully implemented. All 84 lexer
tests pass, covering all acceptance criteria:
- Array/dict delimiters ([], <<>>)
- Keywords (true, false, null, obj, endobj, stream, endstream, R)
- Hex string vs dict ambiguity (< vs <<)
- Stream header validation (\n or \r\n only, lone \r is invalid)
- Case-sensitive keyword matching
This commit fixes a pre-existing compilation error in xref.rs where
forward_scan_memory() called parse_obj_header_at_memory() which didn't
exist. Added the missing function as a byte-slice variant of
parse_obj_header_at() for efficient memory-based scanning.
Verification: notes/pdftract-5upi.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement TH-07 password ingress channels for CLI:
- --password-stdin flag (reads one line from stdin)
- PDFTRACT_PASSWORD env var
- --password VALUE (rejected unless PDFTRACT_INSECURE_CLI_PASSWORD=1)
Exit code 64 for insecure password usage with stderr hint.
Stderr warning emitted when --password VALUE accepted via opt-in.
Priority order: stdin > env var > value (opt-in) > none.
Empty password (bare newline) treated as no password.
Acceptance criteria:
- --password-stdin: PASS
- PDFTRACT_PASSWORD: PASS
- --password VALUE rejection (exit 64): PASS
- Stderr warning on opt-in: PASS
- Exit codes: PASS
- Python/MCP/Serve: N/A (crates don't exist yet)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix Token::Keyword to use b"..." .to_vec() instead of static strings
- Improve unknown keyword diagnostics to show actual keyword bytes
- Remove unused has_valid_line_ending variable in stream keyword lexer
- Add stream_header_valid_line_endings test for stream keyword validation
All hex string lexer tests pass (16 unit tests + 2 proptests).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-2hm4
Add two proptests for the PDF hex string lexer to verify robustness
and correctness:
1. proptest_hex_string_never_panics_on_random_bytes: Random byte
sequences starting with '<' (not '<<') never cause panics.
2. proptest_hex_string_roundtrip_via_reencode: Hex decode + re-encode
roundtrip property validates that encoding and decoding are
inverse operations.
The hex string lexer implementation was already present and correct,
with proper handling of odd-length zero padding (<4> -> \x40, not \x04).
All acceptance criteria pass:
- Empty hex string: <> -> b""
- Odd-length single nibble: <4> -> b"\x40" (critical test)
- Standard decoding: <48656C6C6F> -> b"Hello"
- Mixed case: <aBcD> -> b"\xAB\xCD"
- Whitespace ignored: <48 65> -> b"\x48\x65"
- Unterminated with diagnostic: <48 -> b"\x48" + STRUCT_UNTERMINATED_STRING
- Proptests pass: random bytes never panic, roundtrip property holds
- INV-8 maintained: all error paths use diagnostics, no panics
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The PDF name object lexer was already fully implemented with
all acceptance criteria passing. Added verification note documenting
test results.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Bead-Id: pdftract-5dng
Rename all DiagCode enum variants in the lexer to use the STRUCT_ prefix
to match the specification. This clarifies that these diagnostics relate
to structural/lexical issues in PDF documents.
Changes:
- InvalidName -> StructInvalidName
- InvalidHex -> StructInvalidHex
- InvalidOctal -> StructInvalidOctal
- InvalidStreamHeader -> StructInvalidStreamHeader
- UnexpectedEof -> StructUnexpectedEof
- UnterminatedString -> StructUnterminatedString
The hex string lexer implementation was already correct, with proper
handling of:
- Hex digit pair decoding
- Embedded whitespace (PDF spec 7.2.2)
- Odd-length zero padding: <4> -> \x40 (dangling nibble is HIGH)
- Invalid character diagnostics
- Unterminated string diagnostics
All 16 hex string tests pass, including critical tests for odd-length
padding and error handling.
See: notes/pdftract-2hm4.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add verify_receipt method support to Go templates:
- client.go.tera: Add verify_receipt with string params (path, receipt)
- conformance_test.go.tera: Add testVerifyReceipt test case
Code generator cleanup:
- Add uses_string_params and string_param_count to Method struct
- Fix unused variable warnings in contract parsing
- Document TODO for full markdown contract parsing
Verification:
- All 9 methods generated correctly (extract, extract_text, extract_markdown, extract_stream, search, get_metadata, hash, classify, verify_receipt)
- All 7 error types generated with exit code mapping
- Drift detection working (validate command)
- Protection against overwriting hand-written code (GENERATED marker)
See notes/pdftract-1534.md for full acceptance criteria status.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-1534