pdftract/scripts
jedarden 1c6f26ecaa fix(bf-4mkhv): clean up unused imports in hash.rs
The bead description mentioned compile errors in hash.rs from API drift,
but those errors were either already fixed or misattributed. The API usage
was already correct:
- compute_fingerprint already takes 3 arguments with source
- len() already propagates Result with ?
- read_at method already used correctly
- Catalog fields accessed via trailer correctly

Only cleanup: removed unused std::fs::File and std::io imports.

Verification: notes/bf-4mkhv.md
2026-06-01 09:43:48 -04:00
..
ci docs(pdftract-5l9m): add CI validation script and verification note 2026-05-18 01:05:33 -04:00
analyze-docs.sh fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
analyze_doc_coverage.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
audit_doc_coverage.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
check-provenance.sh fix(pdftract-5z5d8): fix provenance validation script 2026-05-17 23:43:37 -04:00
check-secrets.sh feat(pdftract-59zz): implement MCP bearer token ingress channels and TH-03 enforcement 2026-05-18 02:47:54 -04:00
check_doc_coverage.sh fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
count_doc_coverage.sh fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
count_rustdoc_coverage.rs fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
debug_stream_fixtures.py feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
doc_coverage.py fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
doc_coverage.rs feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
doc_coverage.sh fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
fetch-shape-corpus.sh feat(glyph-shape): implement font corpus fetch script and shape DB generation 2026-05-24 09:48:29 -04:00
generate-minimal-pdf.sh feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement 2026-05-23 13:22:55 -04:00
generate_document_model_fixtures.sh fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
generate_test_corpus.py test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
measure-doc-coverage.sh fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
measure-public-api-coverage.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
README.md test(bf-5dnh1): add memory ceiling enforcement for proptests 2026-05-23 13:39:04 -04:00
run-fuzz-with-limits.sh feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement 2026-05-23 13:22:55 -04:00
run-proptest-with-limits.sh test(bf-5dnh1): add memory ceiling enforcement for proptests 2026-05-23 13:39:04 -04:00
rustdoc_coverage.py feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
rustdoc_coverage.sh wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00

Scripts

This directory contains utility scripts for pdftract development and testing.

Memory Ceiling Enforcement

Fuzz Tests (run-fuzz-with-limits.sh)

Runs cargo-fuzz targets with memory limits to ensure pathological inputs fail fast:

scripts/run-fuzz-with-limits.sh [target]

Memory limits:

  • Cgroup MemoryMax: 1536 MB (hard ceiling)
  • Libfuzzer RSS limit: 1024 MB (per-execution)
  • Libfuzzer malloc limit: 1024 MB (total)

Environment:

  • FUZZ_TIME_SECONDS: Time per target (default: 60)
  • MEMORY_MAX_MB: Cgroup limit in MB (default: 1536)
  • RSS_LIMIT_MB: Libfuzzer RSS limit (default: 1024)

Implementation: Uses cgroup v2 MemoryMax (preferred) or cgroup v1 memory.limit_in_bytes with OOM killer disabled for clean failure mode.

Property Tests (run-proptest-with-limits.sh)

Runs proptest modules with memory limits:

scripts/run-proptest-with-limits.sh [test_name]

Memory limits:

  • Cgroup MemoryMax: 2048 MB (hard ceiling)

Environment:

  • PROPTEST_CASES: Test cases per module (default: 1000)
  • MEMORY_MAX_MB: Cgroup limit in MB (default: 2048)
  • PROPTEST_SEED: Proptest seed (default: random)

Proptest modules: lexer, object_parser, xref, stream, cmap_parser

Input size caps: All proptest strategies are bounded:

  • Lexer/object parser: up to 10 KB inputs
  • Xref/stream parsers: up to 100 KB inputs
  • Nested structures: depth-limited (e.g., 500 for parser depth checks)

These bounds ensure tests complete quickly while still exercising edge cases.

Why Memory Ceilings?

Per bf-1g1fd and the Quality Targets (plan.md Phase 0.4), adversarial inputs must not OOM the host. Memory ceilings enforce:

  1. Clean failure mode - Allocation errors instead of host OOM
  2. Fast failure - Pathological cases abort immediately at the limit
  3. Regressions as test failures - Memory growth is caught in CI

CI enforces these limits via cgroup MemoryMax in .ci/argo-workflows/pdftract-ci.yaml (proptests) and .ci/argo-workflows/pdftract-nightly-fuzz.yaml (fuzz).

Other Scripts

generate-minimal-pdf.sh

Generates minimal valid PDF documents for testing.

check-provenance.sh

Verifies binary provenance and SBOM signatures.

check-secrets.sh

Scans for accidental secrets in committed code.

generate_test_corpus.py

Generates synthetic PDF test corpus.