History

jedarden bb7146cffe fix(pdftract-2uk9z): wrap native module results in typed Python objects The native PyO3 module returns raw dicts via pythonize, but the Python SDK API expects typed dataclass objects (Document, Page, Metadata, etc.) to be consistent with the subprocess fallback and test expectations. Updated wrapper functions in __init__.py to convert native results: - extract(): wraps dict in Document.from_dict() - extract_stream(): wraps yielded page dicts in Page.from_dict() - get_metadata(): wraps dict in Metadata() - hash(): wraps string in Fingerprint.from_string() - classify(): wraps dict in Classification() - search(): wraps yielded match dicts in Match The native PyO3 entry points (extract, extract_text, extract_stream) were already implemented with: - extract: uses extract_pdf + pythonize for PyDict conversion - extract_text: uses extract_text for plain String return - extract_stream: uses extract_pdf_streaming with custom StreamIterator All kwargs parsing with strict validation (unknown kwargs raise TypeError) was already in place. Acceptance criteria: - pdftract.extract() returns Document object with pages/metadata - pdftract.extract_text() returns plain text string - pdftract.extract_stream() yields Page objects - Unknown kwarg raises TypeError		2026-05-28 21:18:38 -04:00
..
ci	docs(pdftract-5l9m): add CI validation script and verification note	2026-05-18 01:05:33 -04:00
check-provenance.sh	fix(pdftract-5z5d8): fix provenance validation script	2026-05-17 23:43:37 -04:00
check-secrets.sh	feat(pdftract-59zz): implement MCP bearer token ingress channels and TH-03 enforcement	2026-05-18 02:47:54 -04:00
check_doc_coverage.sh	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
debug_stream_fixtures.py	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
doc_coverage.py	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
doc_coverage.rs	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
doc_coverage.sh	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
fetch-shape-corpus.sh	feat(glyph-shape): implement font corpus fetch script and shape DB generation	2026-05-24 09:48:29 -04:00
generate-minimal-pdf.sh	feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement	2026-05-23 13:22:55 -04:00
generate_document_model_fixtures.sh	fix(pdftract-2uk9z): wrap native module results in typed Python objects	2026-05-28 21:18:38 -04:00
generate_test_corpus.py	test(classifier): add 200-document labeled corpus for Phase 5.6	2026-05-17 07:16:02 -04:00
README.md	test(bf-5dnh1): add memory ceiling enforcement for proptests	2026-05-23 13:39:04 -04:00
run-fuzz-with-limits.sh	feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement	2026-05-23 13:22:55 -04:00
run-proptest-with-limits.sh	test(bf-5dnh1): add memory ceiling enforcement for proptests	2026-05-23 13:39:04 -04:00
rustdoc_coverage.py	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00

README.md

Scripts

This directory contains utility scripts for pdftract development and testing.

Memory Ceiling Enforcement

Fuzz Tests (`run-fuzz-with-limits.sh`)

Runs cargo-fuzz targets with memory limits to ensure pathological inputs fail fast:

scripts/run-fuzz-with-limits.sh [target]

Memory limits:

Cgroup MemoryMax: 1536 MB (hard ceiling)
Libfuzzer RSS limit: 1024 MB (per-execution)
Libfuzzer malloc limit: 1024 MB (total)

Environment:

FUZZ_TIME_SECONDS: Time per target (default: 60)
MEMORY_MAX_MB: Cgroup limit in MB (default: 1536)
RSS_LIMIT_MB: Libfuzzer RSS limit (default: 1024)

Implementation: Uses cgroup v2 MemoryMax (preferred) or cgroup v1 memory.limit_in_bytes with OOM killer disabled for clean failure mode.

Property Tests (`run-proptest-with-limits.sh`)

Runs proptest modules with memory limits:

scripts/run-proptest-with-limits.sh [test_name]

Memory limits:

Cgroup MemoryMax: 2048 MB (hard ceiling)

Environment:

PROPTEST_CASES: Test cases per module (default: 1000)
MEMORY_MAX_MB: Cgroup limit in MB (default: 2048)
PROPTEST_SEED: Proptest seed (default: random)

Proptest modules: lexer, object_parser, xref, stream, cmap_parser

Input size caps: All proptest strategies are bounded:

Lexer/object parser: up to 10 KB inputs
Xref/stream parsers: up to 100 KB inputs
Nested structures: depth-limited (e.g., 500 for parser depth checks)

These bounds ensure tests complete quickly while still exercising edge cases.

Why Memory Ceilings?

Per bf-1g1fd and the Quality Targets (plan.md Phase 0.4), adversarial inputs must not OOM the host. Memory ceilings enforce:

Clean failure mode - Allocation errors instead of host OOM
Fast failure - Pathological cases abort immediately at the limit
Regressions as test failures - Memory growth is caught in CI

CI enforces these limits via cgroup MemoryMax in .ci/argo-workflows/pdftract-ci.yaml (proptests) and .ci/argo-workflows/pdftract-nightly-fuzz.yaml (fuzz).

Other Scripts

`generate-minimal-pdf.sh`

Generates minimal valid PDF documents for testing.

`check-provenance.sh`

Verifies binary provenance and SBOM signatures.

`check-secrets.sh`

Scans for accidental secrets in committed code.

`generate_test_corpus.py`

Generates synthetic PDF test corpus.

README.md

Scripts

Memory Ceiling Enforcement

Fuzz Tests (run-fuzz-with-limits.sh)

Property Tests (run-proptest-with-limits.sh)

Why Memory Ceilings?

Other Scripts

generate-minimal-pdf.sh

check-provenance.sh