Implement coordinate transform from HOCR pixel space to PDF user-space points, accounting for the 10px white border added in preprocessing (Phase 5.3.4) and the DPI used at render time (Phase 5.2). Changes: - Add HOCR_BORDER_PADDING constant (10px) to match preprocessing padding - Add HocrWord::to_pdf_bbox() method for coordinate conversion - Add apply_rotation_to_bbox() helper for page rotation handling Coordinate transform steps: 1. Subtract padding (pixel space): hocr_px - 10 2. Scale to points: px * 72.0 / dpi 3. Flip Y-axis: pdf_y = page_height_pt - hocr_y_pt 4. Apply rotation (if specified): 0°, 90°, 180°, 270° 5. Add cell origin (if hybrid): offset by cell's PDF origin Tests added: - test_to_pdf_bbox_basic_conversion: Critical test from plan line 1908 - test_to_pdf_bbox_y_flip_sanity: Top-of-page word has highest PDF Y - test_to_pdf_bbox_padding_subtraction: Padding edge case - test_to_pdf_bbox_different_dpi: 200/300/400 DPI verification - test_to_pdf_bbox_hybrid_cell_offset: Cell-local to global coords - test_to_pdf_bbox_clamps_negative_coords: Bbox within padding - Rotation tests: 0°, 90°, 180°, 270°, and invalid angles Acceptance criteria: ✓ Critical test (line 1908): HOCR bbox at (10,10,100,30) at 300 DPI ✓ Y-flip sanity: top-of-page has highest PDF Y ✓ Hybrid cell test: cell offset applied correctly ○ 100-page OCR output: requires OCR infrastructure (deferred) Refs: pdftract-2gto, plan lines 1899-1927 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| benches | ||
| build | ||
| examples | ||
| proptest-regressions/parser/lexer | ||
| src | ||
| tests | ||
| __test__.pdf | ||
| build.rs | ||
| Cargo.toml | ||
| pdftract-core.cdx.json | ||
| README.md | ||
pdftract-core
The core Rust library for PDF text extraction. This crate provides the parsing, layout analysis, font encoding recovery, and text extraction primitives used by the CLI (pdftract-cli) and Python bindings (pdftract-py).
Cargo.lock Policy
This workspace checks in Cargo.lock at the repository root. This is unconventional for library crates—the Cargo Book historically suggested that only binary crates should check in lockfiles, allowing library consumers to resolve their own dependency versions.
pdftract departs from this convention for release reproducibility:
-
SLSA Level 3 provenance requires that every milestone tag produces byte-identical artifacts across builds. Without a checked-in lockfile, two runs of
cargo buildon the same commit can resolve different transitive dependency versions, producing different binary hashes. -
Multi-output artifacts—this workspace produces Rust crates (
pdftract-core,pdftract-cli), Python wheels (pdftract-py), and Docker images. All must be built from the same dependency tree. -
Supply-chain security—the lockfile pins checksums for all transitive dependencies, enabling
cargo auditto detect yanked or compromised crates. -
Downstream consumers can still ignore the lockfile if needed. Cargo allows
cargo build --frozenwith a local lockfile override, or consumers can vendor the crate with their own dependency resolution.
The tradeoff—occasional merge conflicts when PRs update overlapping dependencies—is worth the guarantee of reproducible releases. See CONTRIBUTING.md for the lockfile-update workflow.
Modules
parser: PDF spec parsing (xref, trailer, object streams, indirect references)font: Font encoding recovery, glyph name lookup, fingerprintinglayout: Page layout analysis, region segmentation, reading orderextract: Text extraction with provenance (bounding boxes, confidence scores)ocr: Tesseract integration for raster pages
Usage
use pdftract_core::{extract_text, ExtractOptions};
let options = ExtractOptions::default();
let result = extract_text("document.pdf", &options)?;
println!("{}", result.text);