Add property-based testing infrastructure for the lexer module with 6+
property tests covering INV-8 (no panic), string/hex roundtrips, name
length bounds, and position monotonicity. Create 8 curated fixture files
with golden token outputs for critical edge cases including EC-01 empty
file test and whitespace-only inputs.
Changes:
- Add prop_string_roundtrip to tests/proptest/lexer.rs
- Create tests/lexer/fixtures/ with 8 fixtures + .tokens.txt golden files
- Add gen_lexer_golden.rs binary for regenerating golden outputs
- Fix missing ObjRef import in marked_content_operators.rs
Acceptance criteria:
- cargo test --features proptest -p pdftract-core: 105 lexer tests pass
- tests/lexer/fixtures/ contains 8 fixtures with .tokens.txt outputs
- EC-01 empty file test: 0-byte input -> Token::Eof, no panic
- Whitespace-only file test passes
- INV-8 verified by prop_lexer_never_panics
Closes: pdftract-sy8x
Implement step 5 (white-border padding: 10 px on all sides), wire all
preprocessing steps into the final preprocess(input, ImageSource) ->
GrayImage entry point, and curate fixtures for the three image-source
paths (PhysicalScan / DigitalOrigin / Jbig2).
Changes:
- Add add_border_padding() function: creates (width+20) x (height+20)
image with 10px white border on all sides
- Add preprocess() pipeline orchestrator: applies deskew, contrast
normalization, binarization, denoising, and padding in correct order
- Skip contrast, binarization, and denoising for JBIG2 images
- Generate test fixtures for skewed_2deg, uneven_lighting, clean_digital,
and jbig2_scan scenarios
- Add integration tests for all critical test scenarios
- Add A4-page benchmarks targeting < 500ms for physical/digital, < 200ms
for JBIG2
Refs:
- Plan section: Phase 5.3 step 5 (line 1878) + critical tests (lines 1882-1885)
- Bead: pdftract-27n3
- Note: notes/pdftract-27n3.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add decode_page_content_streams() function for per-page lazy decode
- Update extract_page_from_dict() to support lazy stream decoding
- Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding
- Fix borrow checker issue in LazyPageIter::next()
This ensures content streams are decoded lazily per page and dropped
immediately after processing, keeping peak RSS flat across page count.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add Homebrew formula template (homebrew-formula.rb.erb)
- Add vcpkg port template with submission instructions
- Add C conformance test (conformance.c) with thread safety verification
- Add simple link test (simple_test.c) to verify library linkage
- Add hash test (test_hash.c) for hash API verification
- Add parse debug test (test_parse.rs) for development
- Add test fixtures (test-minimal.pdf, valid-minimal.pdf)
- Add PROVENANCE.md entry for valid-minimal.pdf
All tests pass: version, abi_version, free(NULL), hash, extract methods.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Implement the libpdftract native FFI library as a cdylib + staticlib
with cbindgen-generated headers and full extern "C" API.
Components:
- crates/pdftract-libpdftract/ with cdylib + staticlib targets
- All 9 contract methods + utility functions as extern "C"
- cbindgen config and generated pdftract.h header
- pkg-config template (pdftract.pc.in)
- Homebrew formula template (distribution/homebrew/)
- vcpkg port template (distribution/vcpkg/)
- C conformance test (tests/conformance.c)
API features:
- Owned JSON strings returned via CString::into_raw()
- Caller frees with pdftract_free() (not libc free())
- Thread-local error storage (pdftract_last_error)
- Thread-safe and reentrant (no global mutable state)
- ABI version function for compatibility checking
Verification:
- cargo build produces libpdftract.so and libpdftract.a
- Conformance test compiles and runs successfully
- Thread safety verified with 4 concurrent threads
References:
- Plan line 3477: SDK Architecture / The Ten SDKs
- Bead: pdftract-1eaxm
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Updated test_api_null.c to run 10,000 alloc/free cycles (was 100)
- Updated verification note to mark memory roundtrip as PASS
- Improved stream_next implementation to use reference-based approach
instead of Box::from_raw/leak dance for cleaner memory handling
All acceptance criteria for pdftract-5ya9x now PASS:
- 12 exported symbols verified via nm -D
- C client tests (test_api.c, test_api_null.c)
- C++ client test (test_extract.cpp)
- Null pointer safety
- Panic safety (catch_unwind on all entry points)
- Memory roundtrip (10,000 iterations)
- Thread safety (8 pthreads)
Co-Authored-By: Claude Code <noreply@anthropic.com>
Changes from Phase 6.7 child beads that were not committed earlier:
- Add subtle dependency for constant-time token comparison
- Add root directory for path-traversal protection in HTTP+SSE transport
- Update MCP server state to support --root flag
- Minor fixes and improvements across MCP modules
These changes support the 7 closed child beads:
- pdftract-5xq16: JSON-RPC 2.0 framing layer
- pdftract-67tm8: stdio transport
- pdftract-g0ro2: HTTP+SSE transport
- pdftract-24kut: transport mutual exclusion enforcement
- pdftract-1rami: tool catalog (10 tools)
- pdftract-6696g: path-traversal protection
- pdftract-zltqd: bearer-token auth
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixed compilation error in xref.rs where u64 literal 0x5DEECE66D was used
with u32 state, causing overflow. Changed state to u64 for proper Java
Random algorithm behavior.
The OCG /OCProperties parsing implementation was already complete and
all tests pass. See notes/pdftract-2a6rk.md for verification.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 12 synthetic malformed fixtures (generate_test_corpus.py output, tracked in
PROVENANCE.md) existed only as untracked files and were swept by a cleanup stash,
breaking the provenance pre-commit hook for all commits. Restore from stash and
commit them as tracked files so they cannot be lost again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add per-PR property tests and nightly fuzz job infrastructure:
CI Changes (declarative-config):
- pdftract-ci.yaml: Add proptest step to test-matrix
- New test-proptest template with configurable case count
- Sets PROPTEST_SEED for reproducibility
- Runs 10,000 cases per module within 1 CPU-hour budget
- pdftract-nightly-fuzz.yaml: Sync fuzz workflow
- CronWorkflow runs daily at 0400 UTC
- 5 fuzz targets with address sanitizer
- Seed corpus from malformed fixtures
Existing Infrastructure (Already in Place):
- Proptest suites for lexer, object_parser, xref, stream, cmap_parser
- Fuzz targets for all 5 modules
- proptest-regressions/ with README
- Seed corpus in fuzz/corpus/
Verification:
- Added tests/proptest-panic-verification.rs
- Proptest infrastructure correctly structured
- Will catch deliberate panics within budget
Closes: pdftract-33v
The Crypt filter was already implemented in the codebase. This note
documents the verification of acceptance criteria and test coverage.
Acceptance criteria verified:
- /Identity crypt passes through unchanged
- Custom crypt returns ENCRYPTION_UNSUPPORTED
- Missing /DecodeParms defaults to /Identity
- Works correctly with FlateDecode
- Comprehensive test coverage including proptests
- INV-8 maintained (no panics)
Also add missing malformed fixture entries to PROVENANCE.md.
Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit implements the Cargo.lock policy for reproducible builds
across all workspace members (pdftract-core, pdftract-cli, pdftract-py).
Changes:
- Add CONTRIBUTING.md with lockfile-update workflow documentation
- Add .renovaterc.json for weekly lockfile-only PRs (human-gated)
- Add crates/pdftract-core/README.md with rationale for checked-in lockfiles
- Add notes/pdftract-49f8.md with verification note
The Argo workflow updates (pdftract-ci.yaml) are committed separately
in the declarative-config repo.
Acceptance criteria:
- PASS: Cargo.lock tracked by git, not in .gitignore
- PASS: Argo workflow templates document --locked/--frozen requirements
- WARN: Enforcement to be completed when placeholder templates are implemented
- WARN: Binary reproducibility verification deferred to pdftract-build-binaries implementation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements the conformance test runner pattern for all 10 SDKs as specified
in the plan (line 3547). Each SDK now has a dedicated conformance test runner.
Created:
- tests/sdk-conformance/report-schema.json: JSON schema for conformance reports
- docs/notes/sdk-conformance-runner.md: Pattern documentation and reference
- crates/pdftract-cli/tests/conformance.rs: Rust cargo test target
- tests/conformance/test_conformance.py: Python pytest harness
- tests/conformance/conformance.test.ts: Node.js vitest runner
- tests/conformance/conformance_test.go: Go go test runner
- tests/conformance/ConformanceTest.java: Java JUnit 5 runner
- tests/conformance/ConformanceTests.cs: .NET xUnit runner
- tests/conformance/conformance.c: C standalone binary
- tests/conformance/conformance_test.rb: Ruby minitest runner
- tests/conformance/ConformanceTest.php: PHP PHPUnit runner
- tests/conformance/ConformanceTests.swift: Swift XCTest runner
All runners implement:
- Loading of tests/sdk-conformance/cases.json
- Execution of test cases with language-native method invocations
- Comparison of results against expected values with numeric tolerances
- Emission of machine-readable conformance-report.json
- Non-zero exit on failures/errors for CI gating
Acceptance criteria:
- PASS: All 10 SDKs have language-specific runners
- PASS: Runners consume shared cases.json
- PASS: Runners emit JSON reports matching schema
- PASS: Runners exit non-zero on failure
- WARN: README integration pending SDK repo creation
- WARN: Stub implementations return placeholder results
References:
- Plan line 3547: "Every SDK has a pdftract-sdk-conformance test runner"
- Plan line 3589: "Conformance suite results published as Argo artifact"
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-5omc
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.
- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
* Full test suite loader and executor
* Comparison engine with min/max, string constraints, tolerances
* Skip logic for unsupported features and schema versions
* Report generation in JSON format
- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
* pdftract compare - Compare actual vs expected with tolerances
* Cross-language comparison tool to avoid reimplementations
- Documentation (docs/conformance/sdk-contract.md)
* Complete pattern specification with pseudocode
* Per-language runner locations
* CI integration requirements
- Python reference stub (tests/python-conformance/test_conformance.py)
* Full pytest-based implementation following the pattern
Closes: pdftract-5omc
Add tests/sdk-conformance/ containing the shared, language-neutral test
specification for all pdftract SDKs. The suite includes 32 cases covering
all 9 contract methods (extract, extract_text, extract_markdown,
extract_stream, search, get_metadata, hash, classify, verify_receipt)
across vector, scanned, encrypted, fillable-form, mixed, large, broken,
and remote PDFs.
- cases.json: 32 test cases with id, fixture, method, options, expected,
tolerances, feature tags, and min_schema_version
- schema.json: JSON Schema v7 draft for validating test case structure
- validate_suite.py: Validation script that checks structure and fixture
existence
- fixtures/: Test PDFs organized by category (symlinks to classifier
fixtures for shared files)
See notes/pdftract-1527.md for verification details.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add pre-commit hook that runs check-provenance.sh before each commit
to ensure fixture files always have valid provenance entries. Update
PROVENANCE.md with validation section documenting the hook usage.
Acceptance criteria:
- PROVENANCE.md exists with one row per fixture file ✓
- Every fixture file enumerated; no orphans ✓
- License column populated; only approved licenses ✓
- SHA256 column populated; matches actual content ✓
- check-provenance.sh validates manifest; CI gate green ✓
- Synthetic fixtures point at generation scripts ✓
Refs: pdftract-5z5d8
Co-Authored-By: Claude Code <noreply@anthropic.com>
Fixed scripts/check-provenance.sh to properly validate PROVENANCE.md
against actual fixture files. The script was failing silently due to
subshell EXIT trap removing temp files before parent could read them,
and arithmetic expansion returning exit code 1 on zero value.
Changes:
- Replaced subshell pipes with process substitution
- Moved temp file cleanup to after reading
- Added validated variable initialization
- Added || true to prevent exit on zero arithmetic
All 200 classifier corpus fixtures have valid provenance entries
with matching SHA256 hashes. PROVENANCE.md already existed with
complete documentation.
Refs: pdftract-5z5d8
Co-Authored-By: Claude Code <noreply@anthropic.com>
- Create tests/fixtures/classifier/ with 200 synthetic PDFs:
- 50 invoices with bill-to/ship-to, item tables, totals
- 50 scientific papers with abstracts, sections, references
- 50 contracts with clauses, legal terminology, signatures
- 50 misc documents (8 receipts, 8 forms, 7 bank statements,
7 slide decks, 7 legal filings, 6 book excerpts, 7 magazines)
- Add MANIFEST.tsv mapping each document to its expected type
with source URL and license (all MIT-0 synthetic data)
- Add scripts/generate_test_corpus.py to regenerate the corpus
using reportlab for PDF generation
- Add tests/test_classifier_corpus.rs with validation harness:
- test_corpus_manifest_validity: verifies manifest structure
and file existence (PASSES)
- test_classifier_corpus_accuracy: will validate precision/
recall/F1 when classifier is implemented (SKIP for now)
- test_classifier_reproducibility: will verify deterministic
classification (SKIP for now)
- Add tests/fixtures/classifier/README.md documenting corpus
structure, generation process, and acceptance criteria
Total corpus size: ~0.4 MB (each PDF < 5 KB)
Acceptance criteria (from plan.md Phase 5.6):
- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: identical output for same document
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>