The native PyO3 module returns raw dicts via pythonize, but the Python SDK API expects typed dataclass objects (Document, Page, Metadata, etc.) to be consistent with the subprocess fallback and test expectations. Updated wrapper functions in __init__.py to convert native results: - extract(): wraps dict in Document.from_dict() - extract_stream(): wraps yielded page dicts in Page.from_dict() - get_metadata(): wraps dict in Metadata() - hash(): wraps string in Fingerprint.from_string() - classify(): wraps dict in Classification() - search(): wraps yielded match dicts in Match The native PyO3 entry points (extract, extract_text, extract_stream) were already implemented with: - extract: uses extract_pdf + pythonize for PyDict conversion - extract_text: uses extract_text for plain String return - extract_stream: uses extract_pdf_streaming with custom StreamIterator All kwargs parsing with strict validation (unknown kwargs raise TypeError) was already in place. Acceptance criteria: - pdftract.extract() returns Document object with pages/metadata - pdftract.extract_text() returns plain text string - pdftract.extract_stream() yields Page objects - Unknown kwarg raises TypeError |
||
|---|---|---|
| .. | ||
| ci | ||
| check-provenance.sh | ||
| check-secrets.sh | ||
| check_doc_coverage.sh | ||
| debug_stream_fixtures.py | ||
| doc_coverage.py | ||
| doc_coverage.rs | ||
| doc_coverage.sh | ||
| fetch-shape-corpus.sh | ||
| generate-minimal-pdf.sh | ||
| generate_document_model_fixtures.sh | ||
| generate_test_corpus.py | ||
| README.md | ||
| run-fuzz-with-limits.sh | ||
| run-proptest-with-limits.sh | ||
| rustdoc_coverage.py | ||
Scripts
This directory contains utility scripts for pdftract development and testing.
Memory Ceiling Enforcement
Fuzz Tests (run-fuzz-with-limits.sh)
Runs cargo-fuzz targets with memory limits to ensure pathological inputs fail fast:
scripts/run-fuzz-with-limits.sh [target]
Memory limits:
- Cgroup MemoryMax: 1536 MB (hard ceiling)
- Libfuzzer RSS limit: 1024 MB (per-execution)
- Libfuzzer malloc limit: 1024 MB (total)
Environment:
FUZZ_TIME_SECONDS: Time per target (default: 60)MEMORY_MAX_MB: Cgroup limit in MB (default: 1536)RSS_LIMIT_MB: Libfuzzer RSS limit (default: 1024)
Implementation: Uses cgroup v2 MemoryMax (preferred) or cgroup v1 memory.limit_in_bytes with OOM killer disabled for clean failure mode.
Property Tests (run-proptest-with-limits.sh)
Runs proptest modules with memory limits:
scripts/run-proptest-with-limits.sh [test_name]
Memory limits:
- Cgroup MemoryMax: 2048 MB (hard ceiling)
Environment:
PROPTEST_CASES: Test cases per module (default: 1000)MEMORY_MAX_MB: Cgroup limit in MB (default: 2048)PROPTEST_SEED: Proptest seed (default: random)
Proptest modules: lexer, object_parser, xref, stream, cmap_parser
Input size caps: All proptest strategies are bounded:
- Lexer/object parser: up to 10 KB inputs
- Xref/stream parsers: up to 100 KB inputs
- Nested structures: depth-limited (e.g., 500 for parser depth checks)
These bounds ensure tests complete quickly while still exercising edge cases.
Why Memory Ceilings?
Per bf-1g1fd and the Quality Targets (plan.md Phase 0.4), adversarial inputs must not OOM the host. Memory ceilings enforce:
- Clean failure mode - Allocation errors instead of host OOM
- Fast failure - Pathological cases abort immediately at the limit
- Regressions as test failures - Memory growth is caught in CI
CI enforces these limits via cgroup MemoryMax in .ci/argo-workflows/pdftract-ci.yaml (proptests) and .ci/argo-workflows/pdftract-nightly-fuzz.yaml (fuzz).
Other Scripts
generate-minimal-pdf.sh
Generates minimal valid PDF documents for testing.
check-provenance.sh
Verifies binary provenance and SBOM signatures.
check-secrets.sh
Scans for accidental secrets in committed code.
generate_test_corpus.py
Generates synthetic PDF test corpus.