pdftract/scripts
jedarden d0f52751ce fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs
The indent trigger was using .abs() which fired on both increased indent
(non-indented → indented) AND decreased indent (indented → non-indented).
This caused drop-cap style paragraphs (indented first line, flush-left
continuation) to incorrectly split into two blocks.

Per plan Phase 4.4 heuristic #2, indent change should only trigger when the
current line is MORE indented (to the right, larger x0) than the block
average - i.e., a new paragraph starting after non-indented text. It should
NOT trigger for decreased indent (first line indented, rest flush-left).

Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold.

Tests:
- test_indented_first_line_new_block: PASS (non-indented → indented splits)
- test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together)
- All 179 line module tests: PASS
2026-06-07 13:43:19 -04:00
..
ci docs(pdftract-5l9m): add CI validation script and verification note 2026-05-18 01:05:33 -04:00
add_rustdoc.py fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
analyze-docs.sh fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
analyze_doc_coverage.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
analyze_doc_coverage.sh feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
audit_doc_coverage.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
check-provenance.sh fix(pdftract-5z5d8): fix provenance validation script 2026-05-17 23:43:37 -04:00
check-secrets.sh feat(pdftract-59zz): implement MCP bearer token ingress channels and TH-03 enforcement 2026-05-18 02:47:54 -04:00
check_doc_coverage.sh fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
check_lib_exports.py fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
count_doc_coverage.sh fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
count_rustdoc_coverage.rs fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
debug_stream_fixtures.py feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
doc_analysis.py feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
doc_coverage.py fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
doc_coverage.rs feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
doc_coverage.sh fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
doc_coverage_check.py fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
doc_coverage_check.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
doc_coverage_check.sh fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
doc_coverage_refined.sh fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
doc_coverage_v2.sh fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
doc_example_coverage.py fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
fetch-shape-corpus.sh feat(glyph-shape): implement font corpus fetch script and shape DB generation 2026-05-24 09:48:29 -04:00
find_pub_items_without_examples.sh fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
generate-minimal-pdf.sh feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement 2026-05-23 13:22:55 -04:00
generate_document_model_fixtures.sh fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
generate_test_corpus.py test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
list_api_items.py fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
measure-doc-coverage.py fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
measure-doc-coverage.sh fix(bf-4mkhv): clean up unused imports in hash.rs 2026-06-01 09:43:48 -04:00
measure-public-api-coverage.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
measure_doc_coverage.py feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
measure_doc_coverage.rs fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
measure_doc_coverage.sh feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
measure_doc_coverage_v2.sh fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
README.md test(bf-5dnh1): add memory ceiling enforcement for proptests 2026-05-23 13:39:04 -04:00
run-fuzz-with-limits.sh feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement 2026-05-23 13:22:55 -04:00
run-proptest-with-limits.sh test(bf-5dnh1): add memory ceiling enforcement for proptests 2026-05-23 13:39:04 -04:00
rustdoc_coverage.py feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
rustdoc_coverage.rs feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
rustdoc_coverage.sh wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00

Scripts

This directory contains utility scripts for pdftract development and testing.

Memory Ceiling Enforcement

Fuzz Tests (run-fuzz-with-limits.sh)

Runs cargo-fuzz targets with memory limits to ensure pathological inputs fail fast:

scripts/run-fuzz-with-limits.sh [target]

Memory limits:

  • Cgroup MemoryMax: 1536 MB (hard ceiling)
  • Libfuzzer RSS limit: 1024 MB (per-execution)
  • Libfuzzer malloc limit: 1024 MB (total)

Environment:

  • FUZZ_TIME_SECONDS: Time per target (default: 60)
  • MEMORY_MAX_MB: Cgroup limit in MB (default: 1536)
  • RSS_LIMIT_MB: Libfuzzer RSS limit (default: 1024)

Implementation: Uses cgroup v2 MemoryMax (preferred) or cgroup v1 memory.limit_in_bytes with OOM killer disabled for clean failure mode.

Property Tests (run-proptest-with-limits.sh)

Runs proptest modules with memory limits:

scripts/run-proptest-with-limits.sh [test_name]

Memory limits:

  • Cgroup MemoryMax: 2048 MB (hard ceiling)

Environment:

  • PROPTEST_CASES: Test cases per module (default: 1000)
  • MEMORY_MAX_MB: Cgroup limit in MB (default: 2048)
  • PROPTEST_SEED: Proptest seed (default: random)

Proptest modules: lexer, object_parser, xref, stream, cmap_parser

Input size caps: All proptest strategies are bounded:

  • Lexer/object parser: up to 10 KB inputs
  • Xref/stream parsers: up to 100 KB inputs
  • Nested structures: depth-limited (e.g., 500 for parser depth checks)

These bounds ensure tests complete quickly while still exercising edge cases.

Why Memory Ceilings?

Per bf-1g1fd and the Quality Targets (plan.md Phase 0.4), adversarial inputs must not OOM the host. Memory ceilings enforce:

  1. Clean failure mode - Allocation errors instead of host OOM
  2. Fast failure - Pathological cases abort immediately at the limit
  3. Regressions as test failures - Memory growth is caught in CI

CI enforces these limits via cgroup MemoryMax in .ci/argo-workflows/pdftract-ci.yaml (proptests) and .ci/argo-workflows/pdftract-nightly-fuzz.yaml (fuzz).

Other Scripts

generate-minimal-pdf.sh

Generates minimal valid PDF documents for testing.

check-provenance.sh

Verifies binary provenance and SBOM signatures.

check-secrets.sh

Scans for accidental secrets in committed code.

generate_test_corpus.py

Generates synthetic PDF test corpus.