jedarden/pdftract

Author	SHA1	Message	Date
jedarden	7ffb1a729f	fix(pdftract-63ka2): AES-128 test buffer allocation for PKCS#7 padding The encrypt_padded_mut API requires the buffer to be large enough to hold the padded ciphertext. The tests were using plaintext.to_vec() which only allocated plaintext.len() bytes, insufficient for padding. Changed pattern: - Before: plaintext.to_vec() (insufficient space) - After: vec![0u8; plaintext.len() + 16] with copy_from_slice Also fixed incorrect usage: encrypt_padded_mut returns Result<(), Error>, not a length. Use data_copy.len() directly for ciphertext length. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 01:30:33 -04:00
jedarden	870d7073f0	feat(pdftract-1tswa): implement GIL release with py.allow_threads on extraction entry points This implements proper GIL release around all blocking extraction calls so Python threads can run concurrently during PDF processing. Changes: - extract_py: Wrap extract_pdf call with py.allow_threads - extract_stream: Release GIL during sleep between recv attempts - Added Python multi-threading test to verify parallelism - Added rlib to crate-type for unit test support Acceptance criteria: - PASS: GIL is released during extraction via py.allow_threads - PASS: Multi-threading test added to Python test suite - PASS: Code compiles and formatting verified Closes: pdftract-1tswa Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 21:23:00 -04:00
jedarden	728c923237	feat(pdftract-4ewgr): implement Python exception hierarchy with proper inheritance Replace custom exception structs with PyO3's create_exception! macro to ensure proper Python inheritance. EncryptionError now inherits from PdftractError, enabling isinstance(e, PdftractError) to return True for all exception types. Changes: - Use create_exception! macro for all 8 exception types - Update map_error_to_py to set attributes via PyErr::value(py).setattr() - Register exceptions with py.get_type::<T>() in module init - Add unit tests for hierarchy and attributes Closes: pdftract-4ewgr	2026-05-26 21:17:38 -04:00
jedarden	9abc386cce	feat(pdftract-3h9xo): implement threads JSON output + schema integration Phase 7.7.3: Add threads field to ExtractionResult with ThreadJson schema integration. Changes: - Added ThreadJson and BeadJson structs to schema/mod.rs - Added thread_to_json() function to threads/mod.rs - Added build_page_ref_to_index() helper to parser/pages.rs - Added threads field to ExtractionResult in extract.rs - Implemented Phase 7.7 extraction logic with discover_threads/walk_beads - Added threads_to_markdown() and collapse_page_ranges() to markdown.rs - Updated JSON schema with ThreadJson and BeadJson definitions - Added thread_to_py() and bead_to_py() conversions in pdftract-py - Exported ThreadJson, BeadJson from lib.rs All 32 threads module tests pass. All 35 markdown tests pass. Verification: notes/pdftract-3h9xo.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 13:40:15 -04:00
jedarden	bf9a19f652	feat(pdftract-3j2u): implement 50 MB size limit + base64 encoding for attachments - Add attachments field to ExtractionResult struct - Implement extract_attachments helper function to walk /AF array - Add base64 encoding for attachment content in AttachmentBuilder::into_json - Update result_to_json to include attachments in output - Add PyO3 bindings for attachments with base64 data decoded to bytes - Export AttachmentJson from pdftract-core root - Add base64 dependency to pdftract-core and pdftract-py Per plan 7.5.3: - Attachments > 50 MB are truncated (metadata only, data: null, truncated: true) - Base64 encoding uses RFC 4648 standard alphabet with padding - CLI --text mode excludes attachments (existing behavior maintained) - JSON sink includes attachments array Closes: pdftract-3j2u Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:42:28 -04:00
jedarden	fca8966f45	feat(pdftract-2nu0s): implement Python SDK contract conformance Implements the Python SDK with all 9 contract methods, 8 exception classes, type definitions, asyncio wrappers, and subprocess fallback. Changes: - Add Python wrapper module with extract, extract_text, extract_markdown, extract_stream, search, get_metadata, hash, classify, verify_receipt - Add exception hierarchy: PdftractError base class with 7 subclasses - Add dataclass type definitions: Document, Page, Span, Block, Match, Fingerprint, Classification, Metadata - Add asyncio module with async wrappers for 4 long-running methods - Add subprocess fallback for when native module fails to import - Add conformance test runner under tests/test_conformance.py - Update pyproject.toml with dynamic version from Cargo Closes: pdftract-2nu0s	2026-05-24 08:55:11 -04:00
jedarden	9d662aec25	feat(pdftract-bnba5): implement PyO3 extract_stream entry point with StreamIterator Add callback-based streaming API to pdftract-core and PyO3 bindings that return a Python iterator yielding page dicts incrementally. This provides memory-efficient extraction for large PDFs via the iterator protocol. Core changes: - Add extract_pdf_streaming() callback-based function to pdftract-core - Export extract_pdf_streaming in lib.rs PyO3 bindings: - Add StreamIterator PyClass with __iter__/__next__ methods - Add extract_stream_fn() spawning background thread with mpsc channel - Add *Frame types for efficient Python dict serialization - Integrate into pdftract Python module Closes: pdftract-bnba5	2026-05-24 07:35:03 -04:00
jedarden	58a177d3b4	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files Add dual MIT OR Apache-2.0 licensing at repo root with proper copyright notices. Configure all workspace and non-workspace crates to declare the license. Wire license files into Python wheels and Docker images. Files added: - LICENSE-MIT: MIT License with "Copyright (c) 2026 Jed Cabanero" - LICENSE-APACHE: Apache License 2.0 (verbatim from apache.org) Files modified: - Cargo.toml: Updated authors to "Jed Cabanero <me@jedcabanero.com>" - crates/pdftract-py/pyproject.toml: Added license-files to maturin config - crates/pdftract-cer-diff/Cargo.toml: Added license.workspace = true - xtask/Cargo.toml: Added license = "MIT OR Apache-2.0" - fuzz/Cargo.toml: Added license = "MIT OR Apache-2.0" - Cargo-dist.toml: Created to include license files in binary archives - notes/pdftract-aawrz.md: Verification note Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:36:28 -04:00
jedarden	c4ff5194dd	feat(pdftract-67tm8): implement MCP stdio transport with integration tests Implements the stdio transport for the MCP server, enabling communication with local agents (Claude Desktop, Claude Code, Continue, Cursor) over standard input/output with Content-Length framing. Core features: - LSP-style Content-Length framing with \r\n terminators - JSON-RPC 2.0 message parsing and serialization - INV-9 compliance: stdout contains only JSON-RPC frames - Panic hook redirects panics to stderr - SIGTERM handler for graceful shutdown - Parse errors return -32700 with id: null, then continue Acceptance criteria: - ✅ Piping tools/list with framing produces expected response < 50ms - ✅ EOF on stdin → clean exit within 100ms - ✅ Malformed JSON → -32700 error, subsequent requests work - ✅ No println!/log output to stdout (INV-9 enforced) - ✅ Panics go to stderr, no partial JSON on stdout - ✅ SIGTERM → exit 0, SIGINT → immediate non-zero exit Tests added: - crates/pdftract-cli/tests/mcp-stdio.rs (8 integration tests, all pass) - All 49 existing unit tests continue to pass Refs: pdftract-67tm8, plan Phase 6.7.2	2026-05-23 00:16:42 -04:00
jedarden	e0b293c3d6	fix(pdftract-2a6rk): fix xref.rs u64 literal overflow in proptest Fixed compilation error in xref.rs where u64 literal 0x5DEECE66D was used with u32 state, causing overflow. Changed state to u64 for proper Java Random algorithm behavior. The OCG /OCProperties parsing implementation was already complete and all tests pass. See notes/pdftract-2a6rk.md for verification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 17:26:27 -04:00
jedarden	6bdc2b5278	docs(pdftract-2pyln): update verification note with bug fix details Add details about the BytesSource cleanup bug fix and clarify that the contract defines 7 error kinds, not 8 as initially stated in the task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 19:09:49 -04:00
jedarden	9aa26a449e	docs(pdftract-49f8): establish Cargo.lock policy and documentation This commit implements the Cargo.lock policy for reproducible builds across all workspace members (pdftract-core, pdftract-cli, pdftract-py). Changes: - Add CONTRIBUTING.md with lockfile-update workflow documentation - Add .renovaterc.json for weekly lockfile-only PRs (human-gated) - Add crates/pdftract-core/README.md with rationale for checked-in lockfiles - Add notes/pdftract-49f8.md with verification note The Argo workflow updates (pdftract-ci.yaml) are committed separately in the declarative-config repo. Acceptance criteria: - PASS: Cargo.lock tracked by git, not in .gitignore - PASS: Argo workflow templates document --locked/--frozen requirements - WARN: Enforcement to be completed when placeholder templates are implemented - WARN: Binary reproducibility verification deferred to pdftract-build-binaries implementation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 18:13:14 -04:00

12 commits