Commit graph

12 commits

Author SHA1 Message Date
jedarden
7ffb1a729f fix(pdftract-63ka2): AES-128 test buffer allocation for PKCS#7 padding
The encrypt_padded_mut API requires the buffer to be large enough to
hold the padded ciphertext. The tests were using plaintext.to_vec() which
only allocated plaintext.len() bytes, insufficient for padding.

Changed pattern:
- Before: plaintext.to_vec() (insufficient space)
- After: vec![0u8; plaintext.len() + 16] with copy_from_slice

Also fixed incorrect usage: encrypt_padded_mut returns Result<(), Error>,
not a length. Use data_copy.len() directly for ciphertext length.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:30:33 -04:00
jedarden
870d7073f0 feat(pdftract-1tswa): implement GIL release with py.allow_threads on extraction entry points
This implements proper GIL release around all blocking extraction calls
so Python threads can run concurrently during PDF processing.

Changes:
- extract_py: Wrap extract_pdf call with py.allow_threads
- extract_stream: Release GIL during sleep between recv attempts
- Added Python multi-threading test to verify parallelism
- Added rlib to crate-type for unit test support

Acceptance criteria:
- PASS: GIL is released during extraction via py.allow_threads
- PASS: Multi-threading test added to Python test suite
- PASS: Code compiles and formatting verified

Closes: pdftract-1tswa

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 21:23:00 -04:00
jedarden
728c923237 feat(pdftract-4ewgr): implement Python exception hierarchy with proper inheritance
Replace custom exception structs with PyO3's create_exception! macro to ensure
proper Python inheritance. EncryptionError now inherits from PdftractError,
enabling isinstance(e, PdftractError) to return True for all exception types.

Changes:
- Use create_exception! macro for all 8 exception types
- Update map_error_to_py to set attributes via PyErr::value(py).setattr()
- Register exceptions with py.get_type::<T>() in module init
- Add unit tests for hierarchy and attributes

Closes: pdftract-4ewgr
2026-05-26 21:17:38 -04:00
jedarden
9abc386cce feat(pdftract-3h9xo): implement threads JSON output + schema integration
Phase 7.7.3: Add threads field to ExtractionResult with ThreadJson schema integration.

Changes:
- Added ThreadJson and BeadJson structs to schema/mod.rs
- Added thread_to_json() function to threads/mod.rs
- Added build_page_ref_to_index() helper to parser/pages.rs
- Added threads field to ExtractionResult in extract.rs
- Implemented Phase 7.7 extraction logic with discover_threads/walk_beads
- Added threads_to_markdown() and collapse_page_ranges() to markdown.rs
- Updated JSON schema with ThreadJson and BeadJson definitions
- Added thread_to_py() and bead_to_py() conversions in pdftract-py
- Exported ThreadJson, BeadJson from lib.rs

All 32 threads module tests pass. All 35 markdown tests pass.

Verification: notes/pdftract-3h9xo.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 13:40:15 -04:00
jedarden
bf9a19f652 feat(pdftract-3j2u): implement 50 MB size limit + base64 encoding for attachments
- Add attachments field to ExtractionResult struct
- Implement extract_attachments helper function to walk /AF array
- Add base64 encoding for attachment content in AttachmentBuilder::into_json
- Update result_to_json to include attachments in output
- Add PyO3 bindings for attachments with base64 data decoded to bytes
- Export AttachmentJson from pdftract-core root
- Add base64 dependency to pdftract-core and pdftract-py

Per plan 7.5.3:
- Attachments > 50 MB are truncated (metadata only, data: null, truncated: true)
- Base64 encoding uses RFC 4648 standard alphabet with padding
- CLI --text mode excludes attachments (existing behavior maintained)
- JSON sink includes attachments array

Closes: pdftract-3j2u

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:42:28 -04:00
jedarden
fca8966f45 feat(pdftract-2nu0s): implement Python SDK contract conformance
Implements the Python SDK with all 9 contract methods, 8 exception
classes, type definitions, asyncio wrappers, and subprocess fallback.

Changes:
- Add Python wrapper module with extract, extract_text, extract_markdown,
  extract_stream, search, get_metadata, hash, classify, verify_receipt
- Add exception hierarchy: PdftractError base class with 7 subclasses
- Add dataclass type definitions: Document, Page, Span, Block, Match,
  Fingerprint, Classification, Metadata
- Add asyncio module with async wrappers for 4 long-running methods
- Add subprocess fallback for when native module fails to import
- Add conformance test runner under tests/test_conformance.py
- Update pyproject.toml with dynamic version from Cargo

Closes: pdftract-2nu0s
2026-05-24 08:55:11 -04:00
jedarden
9d662aec25 feat(pdftract-bnba5): implement PyO3 extract_stream entry point with StreamIterator
Add callback-based streaming API to pdftract-core and PyO3 bindings that
return a Python iterator yielding page dicts incrementally. This provides
memory-efficient extraction for large PDFs via the iterator protocol.

Core changes:
- Add extract_pdf_streaming() callback-based function to pdftract-core
- Export extract_pdf_streaming in lib.rs

PyO3 bindings:
- Add StreamIterator PyClass with __iter__/__next__ methods
- Add extract_stream_fn() spawning background thread with mpsc channel
- Add *Frame types for efficient Python dict serialization
- Integrate into pdftract Python module

Closes: pdftract-bnba5
2026-05-24 07:35:03 -04:00
jedarden
58a177d3b4 docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files
Add dual MIT OR Apache-2.0 licensing at repo root with proper copyright
notices. Configure all workspace and non-workspace crates to declare the
license. Wire license files into Python wheels and Docker images.

Files added:
- LICENSE-MIT: MIT License with "Copyright (c) 2026 Jed Cabanero"
- LICENSE-APACHE: Apache License 2.0 (verbatim from apache.org)

Files modified:
- Cargo.toml: Updated authors to "Jed Cabanero <me@jedcabanero.com>"
- crates/pdftract-py/pyproject.toml: Added license-files to maturin config
- crates/pdftract-cer-diff/Cargo.toml: Added license.workspace = true
- xtask/Cargo.toml: Added license = "MIT OR Apache-2.0"
- fuzz/Cargo.toml: Added license = "MIT OR Apache-2.0"
- Cargo-dist.toml: Created to include license files in binary archives
- notes/pdftract-aawrz.md: Verification note

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:36:28 -04:00
jedarden
c4ff5194dd feat(pdftract-67tm8): implement MCP stdio transport with integration tests
Implements the stdio transport for the MCP server, enabling communication
with local agents (Claude Desktop, Claude Code, Continue, Cursor) over
standard input/output with Content-Length framing.

Core features:
- LSP-style Content-Length framing with \r\n terminators
- JSON-RPC 2.0 message parsing and serialization
- INV-9 compliance: stdout contains only JSON-RPC frames
- Panic hook redirects panics to stderr
- SIGTERM handler for graceful shutdown
- Parse errors return -32700 with id: null, then continue

Acceptance criteria:
-  Piping tools/list with framing produces expected response < 50ms
-  EOF on stdin → clean exit within 100ms
-  Malformed JSON → -32700 error, subsequent requests work
-  No println!/log output to stdout (INV-9 enforced)
-  Panics go to stderr, no partial JSON on stdout
-  SIGTERM → exit 0, SIGINT → immediate non-zero exit

Tests added:
- crates/pdftract-cli/tests/mcp-stdio.rs (8 integration tests, all pass)
- All 49 existing unit tests continue to pass

Refs: pdftract-67tm8, plan Phase 6.7.2
2026-05-23 00:16:42 -04:00
jedarden
e0b293c3d6 fix(pdftract-2a6rk): fix xref.rs u64 literal overflow in proptest
Fixed compilation error in xref.rs where u64 literal 0x5DEECE66D was used
with u32 state, causing overflow. Changed state to u64 for proper Java
Random algorithm behavior.

The OCG /OCProperties parsing implementation was already complete and
all tests pass. See notes/pdftract-2a6rk.md for verification.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 17:26:27 -04:00
jedarden
6bdc2b5278 docs(pdftract-2pyln): update verification note with bug fix details
Add details about the BytesSource cleanup bug fix and clarify that the
contract defines 7 error kinds, not 8 as initially stated in the task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 19:09:49 -04:00
jedarden
9aa26a449e docs(pdftract-49f8): establish Cargo.lock policy and documentation
This commit implements the Cargo.lock policy for reproducible builds
across all workspace members (pdftract-core, pdftract-cli, pdftract-py).

Changes:
- Add CONTRIBUTING.md with lockfile-update workflow documentation
- Add .renovaterc.json for weekly lockfile-only PRs (human-gated)
- Add crates/pdftract-core/README.md with rationale for checked-in lockfiles
- Add notes/pdftract-49f8.md with verification note

The Argo workflow updates (pdftract-ci.yaml) are committed separately
in the declarative-config repo.

Acceptance criteria:
- PASS: Cargo.lock tracked by git, not in .gitignore
- PASS: Argo workflow templates document --locked/--frozen requirements
- WARN: Enforcement to be completed when placeholder templates are implemented
- WARN: Binary reproducibility verification deferred to pdftract-build-binaries implementation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 18:13:14 -04:00