The encrypt_padded_mut API requires the buffer to be large enough to
hold the padded ciphertext. The tests were using plaintext.to_vec() which
only allocated plaintext.len() bytes, insufficient for padding.
Changed pattern:
- Before: plaintext.to_vec() (insufficient space)
- After: vec![0u8; plaintext.len() + 16] with copy_from_slice
Also fixed incorrect usage: encrypt_padded_mut returns Result<(), Error>,
not a length. Use data_copy.len() directly for ciphertext length.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This implements proper GIL release around all blocking extraction calls
so Python threads can run concurrently during PDF processing.
Changes:
- extract_py: Wrap extract_pdf call with py.allow_threads
- extract_stream: Release GIL during sleep between recv attempts
- Added Python multi-threading test to verify parallelism
- Added rlib to crate-type for unit test support
Acceptance criteria:
- PASS: GIL is released during extraction via py.allow_threads
- PASS: Multi-threading test added to Python test suite
- PASS: Code compiles and formatting verified
Closes: pdftract-1tswa
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace custom exception structs with PyO3's create_exception! macro to ensure
proper Python inheritance. EncryptionError now inherits from PdftractError,
enabling isinstance(e, PdftractError) to return True for all exception types.
Changes:
- Use create_exception! macro for all 8 exception types
- Update map_error_to_py to set attributes via PyErr::value(py).setattr()
- Register exceptions with py.get_type::<T>() in module init
- Add unit tests for hierarchy and attributes
Closes: pdftract-4ewgr
Phase 7.7.3: Add threads field to ExtractionResult with ThreadJson schema integration.
Changes:
- Added ThreadJson and BeadJson structs to schema/mod.rs
- Added thread_to_json() function to threads/mod.rs
- Added build_page_ref_to_index() helper to parser/pages.rs
- Added threads field to ExtractionResult in extract.rs
- Implemented Phase 7.7 extraction logic with discover_threads/walk_beads
- Added threads_to_markdown() and collapse_page_ranges() to markdown.rs
- Updated JSON schema with ThreadJson and BeadJson definitions
- Added thread_to_py() and bead_to_py() conversions in pdftract-py
- Exported ThreadJson, BeadJson from lib.rs
All 32 threads module tests pass. All 35 markdown tests pass.
Verification: notes/pdftract-3h9xo.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add attachments field to ExtractionResult struct
- Implement extract_attachments helper function to walk /AF array
- Add base64 encoding for attachment content in AttachmentBuilder::into_json
- Update result_to_json to include attachments in output
- Add PyO3 bindings for attachments with base64 data decoded to bytes
- Export AttachmentJson from pdftract-core root
- Add base64 dependency to pdftract-core and pdftract-py
Per plan 7.5.3:
- Attachments > 50 MB are truncated (metadata only, data: null, truncated: true)
- Base64 encoding uses RFC 4648 standard alphabet with padding
- CLI --text mode excludes attachments (existing behavior maintained)
- JSON sink includes attachments array
Closes: pdftract-3j2u
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements the Python SDK with all 9 contract methods, 8 exception
classes, type definitions, asyncio wrappers, and subprocess fallback.
Changes:
- Add Python wrapper module with extract, extract_text, extract_markdown,
extract_stream, search, get_metadata, hash, classify, verify_receipt
- Add exception hierarchy: PdftractError base class with 7 subclasses
- Add dataclass type definitions: Document, Page, Span, Block, Match,
Fingerprint, Classification, Metadata
- Add asyncio module with async wrappers for 4 long-running methods
- Add subprocess fallback for when native module fails to import
- Add conformance test runner under tests/test_conformance.py
- Update pyproject.toml with dynamic version from Cargo
Closes: pdftract-2nu0s
Add callback-based streaming API to pdftract-core and PyO3 bindings that
return a Python iterator yielding page dicts incrementally. This provides
memory-efficient extraction for large PDFs via the iterator protocol.
Core changes:
- Add extract_pdf_streaming() callback-based function to pdftract-core
- Export extract_pdf_streaming in lib.rs
PyO3 bindings:
- Add StreamIterator PyClass with __iter__/__next__ methods
- Add extract_stream_fn() spawning background thread with mpsc channel
- Add *Frame types for efficient Python dict serialization
- Integrate into pdftract Python module
Closes: pdftract-bnba5
Add dual MIT OR Apache-2.0 licensing at repo root with proper copyright
notices. Configure all workspace and non-workspace crates to declare the
license. Wire license files into Python wheels and Docker images.
Files added:
- LICENSE-MIT: MIT License with "Copyright (c) 2026 Jed Cabanero"
- LICENSE-APACHE: Apache License 2.0 (verbatim from apache.org)
Files modified:
- Cargo.toml: Updated authors to "Jed Cabanero <me@jedcabanero.com>"
- crates/pdftract-py/pyproject.toml: Added license-files to maturin config
- crates/pdftract-cer-diff/Cargo.toml: Added license.workspace = true
- xtask/Cargo.toml: Added license = "MIT OR Apache-2.0"
- fuzz/Cargo.toml: Added license = "MIT OR Apache-2.0"
- Cargo-dist.toml: Created to include license files in binary archives
- notes/pdftract-aawrz.md: Verification note
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements the stdio transport for the MCP server, enabling communication
with local agents (Claude Desktop, Claude Code, Continue, Cursor) over
standard input/output with Content-Length framing.
Core features:
- LSP-style Content-Length framing with \r\n terminators
- JSON-RPC 2.0 message parsing and serialization
- INV-9 compliance: stdout contains only JSON-RPC frames
- Panic hook redirects panics to stderr
- SIGTERM handler for graceful shutdown
- Parse errors return -32700 with id: null, then continue
Acceptance criteria:
- ✅ Piping tools/list with framing produces expected response < 50ms
- ✅ EOF on stdin → clean exit within 100ms
- ✅ Malformed JSON → -32700 error, subsequent requests work
- ✅ No println!/log output to stdout (INV-9 enforced)
- ✅ Panics go to stderr, no partial JSON on stdout
- ✅ SIGTERM → exit 0, SIGINT → immediate non-zero exit
Tests added:
- crates/pdftract-cli/tests/mcp-stdio.rs (8 integration tests, all pass)
- All 49 existing unit tests continue to pass
Refs: pdftract-67tm8, plan Phase 6.7.2
Fixed compilation error in xref.rs where u64 literal 0x5DEECE66D was used
with u32 state, causing overflow. Changed state to u64 for proper Java
Random algorithm behavior.
The OCG /OCProperties parsing implementation was already complete and
all tests pass. See notes/pdftract-2a6rk.md for verification.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add details about the BytesSource cleanup bug fix and clarify that the
contract defines 7 error kinds, not 8 as initially stated in the task.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit implements the Cargo.lock policy for reproducible builds
across all workspace members (pdftract-core, pdftract-cli, pdftract-py).
Changes:
- Add CONTRIBUTING.md with lockfile-update workflow documentation
- Add .renovaterc.json for weekly lockfile-only PRs (human-gated)
- Add crates/pdftract-core/README.md with rationale for checked-in lockfiles
- Add notes/pdftract-49f8.md with verification note
The Argo workflow updates (pdftract-ci.yaml) are committed separately
in the declarative-config repo.
Acceptance criteria:
- PASS: Cargo.lock tracked by git, not in .gitignore
- PASS: Argo workflow templates document --locked/--frozen requirements
- WARN: Enforcement to be completed when placeholder templates are implemented
- WARN: Binary reproducibility verification deferred to pdftract-build-binaries implementation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>