The indent trigger was using .abs() which fired on both increased indent (non-indented → indented) AND decreased indent (indented → non-indented). This caused drop-cap style paragraphs (indented first line, flush-left continuation) to incorrectly split into two blocks. Per plan Phase 4.4 heuristic #2, indent change should only trigger when the current line is MORE indented (to the right, larger x0) than the block average - i.e., a new paragraph starting after non-indented text. It should NOT trigger for decreased indent (first line indented, rest flush-left). Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold. Tests: - test_indented_first_line_new_block: PASS (non-indented → indented splits) - test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together) - All 179 line module tests: PASS |
||
|---|---|---|
| .. | ||
| ci | ||
| add_rustdoc.py | ||
| analyze-docs.sh | ||
| analyze_doc_coverage.py | ||
| analyze_doc_coverage.sh | ||
| audit_doc_coverage.py | ||
| check-provenance.sh | ||
| check-secrets.sh | ||
| check_doc_coverage.sh | ||
| check_lib_exports.py | ||
| count_doc_coverage.sh | ||
| count_rustdoc_coverage.rs | ||
| debug_stream_fixtures.py | ||
| doc_analysis.py | ||
| doc_coverage.py | ||
| doc_coverage.rs | ||
| doc_coverage.sh | ||
| doc_coverage_check.py | ||
| doc_coverage_check.rs | ||
| doc_coverage_check.sh | ||
| doc_coverage_refined.sh | ||
| doc_coverage_v2.sh | ||
| doc_example_coverage.py | ||
| fetch-shape-corpus.sh | ||
| find_pub_items_without_examples.sh | ||
| generate-minimal-pdf.sh | ||
| generate_document_model_fixtures.sh | ||
| generate_test_corpus.py | ||
| list_api_items.py | ||
| measure-doc-coverage.py | ||
| measure-doc-coverage.sh | ||
| measure-public-api-coverage.py | ||
| measure_doc_coverage.py | ||
| measure_doc_coverage.rs | ||
| measure_doc_coverage.sh | ||
| measure_doc_coverage_v2.sh | ||
| README.md | ||
| run-fuzz-with-limits.sh | ||
| run-proptest-with-limits.sh | ||
| rustdoc_coverage.py | ||
| rustdoc_coverage.rs | ||
| rustdoc_coverage.sh | ||
Scripts
This directory contains utility scripts for pdftract development and testing.
Memory Ceiling Enforcement
Fuzz Tests (run-fuzz-with-limits.sh)
Runs cargo-fuzz targets with memory limits to ensure pathological inputs fail fast:
scripts/run-fuzz-with-limits.sh [target]
Memory limits:
- Cgroup MemoryMax: 1536 MB (hard ceiling)
- Libfuzzer RSS limit: 1024 MB (per-execution)
- Libfuzzer malloc limit: 1024 MB (total)
Environment:
FUZZ_TIME_SECONDS: Time per target (default: 60)MEMORY_MAX_MB: Cgroup limit in MB (default: 1536)RSS_LIMIT_MB: Libfuzzer RSS limit (default: 1024)
Implementation: Uses cgroup v2 MemoryMax (preferred) or cgroup v1 memory.limit_in_bytes with OOM killer disabled for clean failure mode.
Property Tests (run-proptest-with-limits.sh)
Runs proptest modules with memory limits:
scripts/run-proptest-with-limits.sh [test_name]
Memory limits:
- Cgroup MemoryMax: 2048 MB (hard ceiling)
Environment:
PROPTEST_CASES: Test cases per module (default: 1000)MEMORY_MAX_MB: Cgroup limit in MB (default: 2048)PROPTEST_SEED: Proptest seed (default: random)
Proptest modules: lexer, object_parser, xref, stream, cmap_parser
Input size caps: All proptest strategies are bounded:
- Lexer/object parser: up to 10 KB inputs
- Xref/stream parsers: up to 100 KB inputs
- Nested structures: depth-limited (e.g., 500 for parser depth checks)
These bounds ensure tests complete quickly while still exercising edge cases.
Why Memory Ceilings?
Per bf-1g1fd and the Quality Targets (plan.md Phase 0.4), adversarial inputs must not OOM the host. Memory ceilings enforce:
- Clean failure mode - Allocation errors instead of host OOM
- Fast failure - Pathological cases abort immediately at the limit
- Regressions as test failures - Memory growth is caught in CI
CI enforces these limits via cgroup MemoryMax in .ci/argo-workflows/pdftract-ci.yaml (proptests) and .ci/argo-workflows/pdftract-nightly-fuzz.yaml (fuzz).
Other Scripts
generate-minimal-pdf.sh
Generates minimal valid PDF documents for testing.
check-provenance.sh
Verifies binary provenance and SBOM signatures.
check-secrets.sh
Scans for accidental secrets in committed code.
generate_test_corpus.py
Generates synthetic PDF test corpus.