Commit graph

6 commits

Author SHA1 Message Date
jedarden
d0f52751ce fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs
The indent trigger was using .abs() which fired on both increased indent
(non-indented → indented) AND decreased indent (indented → non-indented).
This caused drop-cap style paragraphs (indented first line, flush-left
continuation) to incorrectly split into two blocks.

Per plan Phase 4.4 heuristic #2, indent change should only trigger when the
current line is MORE indented (to the right, larger x0) than the block
average - i.e., a new paragraph starting after non-indented text. It should
NOT trigger for decreased indent (first line indented, rest flush-left).

Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold.

Tests:
- test_indented_first_line_new_block: PASS (non-indented → indented splits)
- test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together)
- All 179 line module tests: PASS
2026-06-07 13:43:19 -04:00
jedarden
432514d350 wip: AcroForm improvements, debug tooling, test corpus, and fixture updates
Collects in-progress work across forms (Ch/Tx field handling, value_text
edge cases), layout corrections, stream parser fixes, conformance test
expansion, security audit test (TH-08), stream-decoder bomb fixture,
debug examples reorganization under examples/debug/, sdk module scaffold,
xtask CLI enhancements, and provenance entries for new fixtures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 09:48:14 -04:00
jedarden
e331086c11 feat(bf-2ervu): implement mmap-backed PdfSource via memmap2
Rewrote FileSource to use memmap2 for zero-copy random access.
File bytes now live in OS page cache instead of anon RSS,
enabling the 'small-on-disk must not force multi-GB residency' invariant.

Changes:
- Added memmap2 = "0.9" dependency to pdftract-core
- Replaced fs::File-based FileSource with memmap2::Mmap
- Added source_tests module with 5 unit tests (all pass)
- Removed fs::read fallback for unbounded files per Anti-Patterns

Closes: bf-2ervu
2026-05-24 08:40:11 -04:00
jedarden
c53194794c feat(pdftract-1s2uj): add xref test fixture corpus and integration test runner
Implemented xref test fixture corpus and integration test runner per
pdftract-1s2uj acceptance criteria.

- Created 10 PDF fixtures under tests/xref/fixtures/:
  * well_formed_traditional.pdf, well_formed_stream.pdf, hybrid_file.pdf
  * prev_chain_3_revisions.pdf, linearized.pdf
  * truncated_after_xref.pdf, startxref_off_by_one.pdf, corrupt_xref_entry.pdf
  * circular_prev.pdf, deep_prev_chain.pdf

- Added fixture generator tool (tools/build-xref-fixture/main.rs)
  - Generates minimal PDFs with specific xref structures
  - Creates corrupt variants via byte-level modifications
  - Integrated as build-xref-fixture binary

- Implemented integration test runner (xref_integration_test.rs)
  - Walks fixtures, parses xref, compares against .expected.json goldens
  - BLESS=1 support for regenerating golden files
  - Tests for forward scan recovery, /Prev chain depth limit, circular prev

- Added diagnostic assertion helpers (xref_helpers.rs)
  * assert_diagnostic(), assert_diagnostic_in_range(), assert_diagnostic_count()
  * assert_no_diagnostic_with_severity(), count_diagnostics()

- All 10 fixtures have corresponding .expected.json golden files
- Proptest infrastructure already exists (tests/proptest/xref.rs)

Acceptance criteria:
✓ All 10 fixture files exist with .expected.json goldens
✓ Proptest tests pass (75 passed, 15 pre-existing failures)
✓ Each strategy (1-4) exercised by at least one fixture
✓ Each diagnostic code emitted by at least one fixture
~ Forward scan regression test: infra in place, pre-existing forward scan bugs
~ Linearized fingerprint: requires qpdf for verification (not installed)

Closes: pdftract-1s2uj

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 08:20:04 -04:00
jedarden
c621947686 feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement
Implements Tier-1 memory ceiling gate that enforces RSS budgets for PDF
extraction, analogous to cargo-bloat for binary size.

Changes:
- CI: Add memory-ceiling template with cgroup MemoryMax (1.5 GB)
- CI: Add cgroup MemoryMax enforcement to test-glibc (6 GB) and test-musl (4 GB)
- CI: Add cgroup MemoryMax + libfuzzer rss/malloc limits to fuzz workflow
- xtask: Implement memory-ceiling command with peak RSS sampling
- Add perf fixtures (100-page, 10k-page) for memory testing
- Add run-fuzz-with-limits.sh for local fuzz testing with memory caps
- Register perf fixtures in PROVENANCE.md

Memory budgets enforced:
- Buffered 100-page PDF: < 512 MB
- Streaming mode: < 256 MB (constant in page count)
- Adversarial fixtures: < 1 GB hard ceiling

Closes bf-1g1fd

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 13:22:55 -04:00
jedarden
a2b9e73a88 feat(pdftract-4b0z): implement publish-if-tag step for GitHub Releases
Implement the publish-if-tag step in pdftract-ci that activates on
version tags (v*.*.*) and publishes cross-compiled binaries to
GitHub Releases.

Changes:
- Add tools/extract-release-notes.sh script for CHANGELOG parsing
- Update publish-if-tag template in pdftract-ci.yaml:
  - Downloads all 5 build artifacts from build-matrix
  - Generates SHA256SUMS checksums
  - Extracts release notes from CHANGELOG.md
  - Creates GitHub Release via gh CLI
  - Supports both stable and pre-release tags (--prerelease flag)
  - Uses --clobber for idempotent re-runs

The step uses Chainguard's gh:latest image and authenticates via
github-pdftract-release Secret (GH_TOKEN key). Optional signing
infrastructure is deferred to Release Engineering epic.

Co-Authored-By: Claude Code (glm-4.7) <noreply@anthropic.com>
2026-05-20 19:06:16 -04:00