History

jedarden f85e5149dd feat(pdftract-91e1i): HTTP fetch sequence implementation Implement orchestration layer connecting HttpRangeSource to Phase 1.3 xref resolver and Phase 1.4 document model for remote PDF access: - Document::open_remote() public API for remote PDF loading - Progressive tail fetch (16 KB → 1 MB) for startxref location - Xref forward-scan disabled for remote sources (via is_remote check) - Page-by-page on-demand fetch via HttpRangeSource caching - Resource lazy load through XrefResolver cache - HEAD probe with 405 fallback, no Content-Length handling Acceptance criteria: ✅ open_remote(url) returns Document with correct page count ✅ HEAD failure modes (405, no Content-Length, 401) handled ✅ xref forward-scan disabled for remote (is_remote check) ✅ Page-by-page on-demand fetch (HttpRangeSource LRU cache) ✅ INV-8 maintained (all errors return Result) Files modified: - crates/pdftract-core/src/document.rs (Document::open_remote, from_source) - crates/pdftract-core/src/remote.rs (progressive tail fetch) - crates/pdftract-core/src/lib.rs (re-exports) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-28 13:17:00 -04:00
..
ci	docs(pdftract-5l9m): add CI validation script and verification note	2026-05-18 01:05:33 -04:00
check-provenance.sh	fix(pdftract-5z5d8): fix provenance validation script	2026-05-17 23:43:37 -04:00
check-secrets.sh	feat(pdftract-59zz): implement MCP bearer token ingress channels and TH-03 enforcement	2026-05-18 02:47:54 -04:00
debug_stream_fixtures.py	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
doc_coverage.py	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
doc_coverage.rs	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
doc_coverage.sh	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
fetch-shape-corpus.sh	feat(glyph-shape): implement font corpus fetch script and shape DB generation	2026-05-24 09:48:29 -04:00
generate-minimal-pdf.sh	feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement	2026-05-23 13:22:55 -04:00
generate_document_model_fixtures.sh	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00
generate_test_corpus.py	test(classifier): add 200-document labeled corpus for Phase 5.6	2026-05-17 07:16:02 -04:00
README.md	test(bf-5dnh1): add memory ceiling enforcement for proptests	2026-05-23 13:39:04 -04:00
run-fuzz-with-limits.sh	feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement	2026-05-23 13:22:55 -04:00
run-proptest-with-limits.sh	test(bf-5dnh1): add memory ceiling enforcement for proptests	2026-05-23 13:39:04 -04:00
rustdoc_coverage.py	feat(pdftract-91e1i): HTTP fetch sequence implementation	2026-05-28 13:17:00 -04:00

README.md

Scripts

This directory contains utility scripts for pdftract development and testing.

Memory Ceiling Enforcement

Fuzz Tests (`run-fuzz-with-limits.sh`)

Runs cargo-fuzz targets with memory limits to ensure pathological inputs fail fast:

scripts/run-fuzz-with-limits.sh [target]

Memory limits:

Cgroup MemoryMax: 1536 MB (hard ceiling)
Libfuzzer RSS limit: 1024 MB (per-execution)
Libfuzzer malloc limit: 1024 MB (total)

Environment:

FUZZ_TIME_SECONDS: Time per target (default: 60)
MEMORY_MAX_MB: Cgroup limit in MB (default: 1536)
RSS_LIMIT_MB: Libfuzzer RSS limit (default: 1024)

Implementation: Uses cgroup v2 MemoryMax (preferred) or cgroup v1 memory.limit_in_bytes with OOM killer disabled for clean failure mode.

Property Tests (`run-proptest-with-limits.sh`)

Runs proptest modules with memory limits:

scripts/run-proptest-with-limits.sh [test_name]

Memory limits:

Cgroup MemoryMax: 2048 MB (hard ceiling)

Environment:

PROPTEST_CASES: Test cases per module (default: 1000)
MEMORY_MAX_MB: Cgroup limit in MB (default: 2048)
PROPTEST_SEED: Proptest seed (default: random)

Proptest modules: lexer, object_parser, xref, stream, cmap_parser

Input size caps: All proptest strategies are bounded:

Lexer/object parser: up to 10 KB inputs
Xref/stream parsers: up to 100 KB inputs
Nested structures: depth-limited (e.g., 500 for parser depth checks)

These bounds ensure tests complete quickly while still exercising edge cases.

Why Memory Ceilings?

Per bf-1g1fd and the Quality Targets (plan.md Phase 0.4), adversarial inputs must not OOM the host. Memory ceilings enforce:

Clean failure mode - Allocation errors instead of host OOM
Fast failure - Pathological cases abort immediately at the limit
Regressions as test failures - Memory growth is caught in CI

CI enforces these limits via cgroup MemoryMax in .ci/argo-workflows/pdftract-ci.yaml (proptests) and .ci/argo-workflows/pdftract-nightly-fuzz.yaml (fuzz).

Other Scripts

`generate-minimal-pdf.sh`

Generates minimal valid PDF documents for testing.

`check-provenance.sh`

Verifies binary provenance and SBOM signatures.

`check-secrets.sh`

Scans for accidental secrets in committed code.

`generate_test_corpus.py`

Generates synthetic PDF test corpus.

README.md

Scripts

Memory Ceiling Enforcement

Fuzz Tests (run-fuzz-with-limits.sh)

Property Tests (run-proptest-with-limits.sh)

Why Memory Ceilings?

Other Scripts

generate-minimal-pdf.sh

check-provenance.sh