- Add decode_page_content_streams() function for per-page lazy decode
- Update extract_page_from_dict() to support lazy stream decoding
- Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding
- Fix borrow checker issue in LazyPageIter::next()
This ensures content streams are decoded lazily per page and dropped
immediately after processing, keeping peak RSS flat across page count.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add Homebrew formula template (homebrew-formula.rb.erb)
- Add vcpkg port template with submission instructions
- Add C conformance test (conformance.c) with thread safety verification
- Add simple link test (simple_test.c) to verify library linkage
- Add hash test (test_hash.c) for hash API verification
- Add parse debug test (test_parse.rs) for development
- Add test fixtures (test-minimal.pdf, valid-minimal.pdf)
- Add PROVENANCE.md entry for valid-minimal.pdf
All tests pass: version, abi_version, free(NULL), hash, extract methods.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Implement the libpdftract native FFI library as a cdylib + staticlib
with cbindgen-generated headers and full extern "C" API.
Components:
- crates/pdftract-libpdftract/ with cdylib + staticlib targets
- All 9 contract methods + utility functions as extern "C"
- cbindgen config and generated pdftract.h header
- pkg-config template (pdftract.pc.in)
- Homebrew formula template (distribution/homebrew/)
- vcpkg port template (distribution/vcpkg/)
- C conformance test (tests/conformance.c)
API features:
- Owned JSON strings returned via CString::into_raw()
- Caller frees with pdftract_free() (not libc free())
- Thread-local error storage (pdftract_last_error)
- Thread-safe and reentrant (no global mutable state)
- ABI version function for compatibility checking
Verification:
- cargo build produces libpdftract.so and libpdftract.a
- Conformance test compiles and runs successfully
- Thread safety verified with 4 concurrent threads
References:
- Plan line 3477: SDK Architecture / The Ten SDKs
- Bead: pdftract-1eaxm
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 12 synthetic malformed fixtures (generate_test_corpus.py output, tracked in
PROVENANCE.md) existed only as untracked files and were swept by a cleanup stash,
breaking the provenance pre-commit hook for all commits. Restore from stash and
commit them as tracked files so they cannot be lost again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Crypt filter was already implemented in the codebase. This note
documents the verification of acceptance criteria and test coverage.
Acceptance criteria verified:
- /Identity crypt passes through unchanged
- Custom crypt returns ENCRYPTION_UNSUPPORTED
- Missing /DecodeParms defaults to /Identity
- Works correctly with FlateDecode
- Comprehensive test coverage including proptests
- INV-8 maintained (no panics)
Also add missing malformed fixture entries to PROVENANCE.md.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Add pre-commit hook that runs check-provenance.sh before each commit
to ensure fixture files always have valid provenance entries. Update
PROVENANCE.md with validation section documenting the hook usage.
Acceptance criteria:
- PROVENANCE.md exists with one row per fixture file ✓
- Every fixture file enumerated; no orphans ✓
- License column populated; only approved licenses ✓
- SHA256 column populated; matches actual content ✓
- check-provenance.sh validates manifest; CI gate green ✓
- Synthetic fixtures point at generation scripts ✓
Refs: pdftract-5z5d8
Co-Authored-By: Claude Code <noreply@anthropic.com>
Fixed scripts/check-provenance.sh to properly validate PROVENANCE.md
against actual fixture files. The script was failing silently due to
subshell EXIT trap removing temp files before parent could read them,
and arithmetic expansion returning exit code 1 on zero value.
Changes:
- Replaced subshell pipes with process substitution
- Moved temp file cleanup to after reading
- Added validated variable initialization
- Added || true to prevent exit on zero arithmetic
All 200 classifier corpus fixtures have valid provenance entries
with matching SHA256 hashes. PROVENANCE.md already existed with
complete documentation.
Refs: pdftract-5z5d8
Co-Authored-By: Claude Code <noreply@anthropic.com>
- Create tests/fixtures/classifier/ with 200 synthetic PDFs:
- 50 invoices with bill-to/ship-to, item tables, totals
- 50 scientific papers with abstracts, sections, references
- 50 contracts with clauses, legal terminology, signatures
- 50 misc documents (8 receipts, 8 forms, 7 bank statements,
7 slide decks, 7 legal filings, 6 book excerpts, 7 magazines)
- Add MANIFEST.tsv mapping each document to its expected type
with source URL and license (all MIT-0 synthetic data)
- Add scripts/generate_test_corpus.py to regenerate the corpus
using reportlab for PDF generation
- Add tests/test_classifier_corpus.rs with validation harness:
- test_corpus_manifest_validity: verifies manifest structure
and file existence (PASSES)
- test_classifier_corpus_accuracy: will validate precision/
recall/F1 when classifier is implemented (SKIP for now)
- test_classifier_reproducibility: will verify deterministic
classification (SKIP for now)
- Add tests/fixtures/classifier/README.md documenting corpus
structure, generation process, and acceptance criteria
Total corpus size: ~0.4 MB (each PDF < 5 KB)
Acceptance criteria (from plan.md Phase 5.6):
- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: identical output for same document
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>