pdftract/crates/pdftract-core
jedarden 09428e76f3 feat(pdftract-5w6i): implement AcroForm field walker with recursive walk and dot-joined names
Implements Phase 7.4.1: AcroForm field walker (recursive /Fields + dot-joined names).

## Changes

- Create `crates/pdftract-core/src/forms/mod.rs` module with:
  - `AcroFieldType` enum (Tx, Btn, Ch, Sig, Other)
  - `AcroFormField` struct with full field metadata
  - `walk_acroform_fields()` public API function
  - Recursive DFS traversal with /FT, /V, /DV, /Ff inheritance
  - Widget annotation to page index resolution
  - Cycle detection via visited set
  - Name collision handling (keep last, emit diagnostic)
  - Choice field option extraction for Ch fields

- Update `lib.rs` to export forms module and types

## Implementation Details

- Entry point: `/Catalog /AcroForm /Fields` array
- Dot-joined names: Concatenate `/T` values with "." separator
- Inheritance: `/FT`, `/V`, `/DV`, `/Ff` from parent to child
- Page resolution: Search page `/Annots` arrays for widget annotations
- Cycle detection: `visited` HashSet prevents infinite loops on malformed PDFs
- Name collisions: Track emitted names, keep last on duplicate

## Tests

All 15 unit tests pass:
- Flat 3 fields extraction
- Nested 2-level hierarchy with dot-joined names
- /FT inheritance from parent to child
- /FT override by child
- /Ff (flags) inheritance
- Empty /T segment handling
- Choice field /Opt array parsing
- All field types (Tx, Btn, Ch, Sig)
- Flag accessor methods (is_read_only, is_required, etc.)
- Button field is_checked() method

Closes: pdftract-5w6i

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 05:31:51 -04:00
..
benches feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
build feat(pdftract-43ry): implement predefined CMap registry 2026-05-23 23:00:59 -04:00
examples feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
proptest-regressions/parser/lexer feat(pdftract-1jjn): implement PDF numeric literal lexer with full edge case support 2026-05-23 23:17:04 -04:00
src feat(pdftract-5w6i): implement AcroForm field walker with recursive walk and dot-joined names 2026-05-24 05:31:51 -04:00
tests feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
__test__.pdf feat(pdftract-15pz8): implement multi-process safe cache operations 2026-05-23 05:31:11 -04:00
build.rs feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
Cargo.toml feat(pdftract-kdp6): implement profile loader secret key hardening 2026-05-24 04:41:04 -04:00
pdftract-core.cdx.json feat(pdftract-67tm8): implement MCP stdio transport with integration tests 2026-05-23 00:16:42 -04:00
README.md docs(pdftract-49f8): establish Cargo.lock policy and documentation 2026-05-20 18:13:14 -04:00

pdftract-core

The core Rust library for PDF text extraction. This crate provides the parsing, layout analysis, font encoding recovery, and text extraction primitives used by the CLI (pdftract-cli) and Python bindings (pdftract-py).

Cargo.lock Policy

This workspace checks in Cargo.lock at the repository root. This is unconventional for library crates—the Cargo Book historically suggested that only binary crates should check in lockfiles, allowing library consumers to resolve their own dependency versions.

pdftract departs from this convention for release reproducibility:

  1. SLSA Level 3 provenance requires that every milestone tag produces byte-identical artifacts across builds. Without a checked-in lockfile, two runs of cargo build on the same commit can resolve different transitive dependency versions, producing different binary hashes.

  2. Multi-output artifacts—this workspace produces Rust crates (pdftract-core, pdftract-cli), Python wheels (pdftract-py), and Docker images. All must be built from the same dependency tree.

  3. Supply-chain security—the lockfile pins checksums for all transitive dependencies, enabling cargo audit to detect yanked or compromised crates.

  4. Downstream consumers can still ignore the lockfile if needed. Cargo allows cargo build --frozen with a local lockfile override, or consumers can vendor the crate with their own dependency resolution.

The tradeoff—occasional merge conflicts when PRs update overlapping dependencies—is worth the guarantee of reproducible releases. See CONTRIBUTING.md for the lockfile-update workflow.

Modules

  • parser: PDF spec parsing (xref, trailer, object streams, indirect references)
  • font: Font encoding recovery, glyph name lookup, fingerprinting
  • layout: Page layout analysis, region segmentation, reading order
  • extract: Text extraction with provenance (bounding boxes, confidence scores)
  • ocr: Tesseract integration for raster pages

Usage

use pdftract_core::{extract_text, ExtractOptions};

let options = ExtractOptions::default();
let result = extract_text("document.pdf", &options)?;
println!("{}", result.text);