History

jedarden d0f52751ce fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs The indent trigger was using .abs() which fired on both increased indent (non-indented → indented) AND decreased indent (indented → non-indented). This caused drop-cap style paragraphs (indented first line, flush-left continuation) to incorrectly split into two blocks. Per plan Phase 4.4 heuristic #2, indent change should only trigger when the current line is MORE indented (to the right, larger x0) than the block average - i.e., a new paragraph starting after non-indented text. It should NOT trigger for decreased indent (first line indented, rest flush-left). Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold. Tests: - test_indented_first_line_new_block: PASS (non-indented → indented splits) - test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together) - All 179 line module tests: PASS		2026-06-07 13:43:19 -04:00
..
benches	feat(pdftract-1z0qt): add encryption verification note	2026-05-28 08:09:53 -04:00
bin	fix(pyo3): correct extract_text_fn call in extract_markdown stub	2026-05-28 20:28:25 -04:00
build	feat(pdftract-1xf4d): implement TH-06 supply-chain gate	2026-05-26 17:31:13 -04:00
examples	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
proptest-regressions/parser/lexer	feat(pdftract-1jjn): implement PDF numeric literal lexer with full edge case support	2026-05-23 23:17:04 -04:00
scripts	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
src	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
tests	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
__test__.pdf	feat(pdftract-15pz8): implement multi-process safe cache operations	2026-05-23 05:31:11 -04:00
build.rs	fix(pdftract-4pnmd): build.rs doc comment format string parsing	2026-05-28 14:36:45 -04:00
Cargo.toml	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
check_doc_coverage.sh	fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs	2026-06-01 04:14:05 -04:00
doc_coverage.py	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00
pdftract-core.cdx.json	feat(pdftract-67tm8): implement MCP stdio transport with integration tests	2026-05-23 00:16:42 -04:00
README.md	docs(pdftract-1mp49): Add OCR example and docs.rs badge to pdftract-core	2026-06-02 18:31:35 -04:00
test_simple_extract.rs	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs	2026-06-07 13:43:19 -04:00

README.md

pdftract-core

The core Rust library for PDF text extraction. This crate provides the parsing, layout analysis, font encoding recovery, and text extraction primitives used by the CLI (pdftract-cli) and Python bindings (pdftract-py).

Cargo.lock Policy

This workspace checks in Cargo.lock at the repository root. This is unconventional for library crates—the Cargo Book historically suggested that only binary crates should check in lockfiles, allowing library consumers to resolve their own dependency versions.

pdftract departs from this convention for release reproducibility:

SLSA Level 3 provenance requires that every milestone tag produces byte-identical artifacts across builds. Without a checked-in lockfile, two runs of cargo build on the same commit can resolve different transitive dependency versions, producing different binary hashes.
Multi-output artifacts—this workspace produces Rust crates (pdftract-core, pdftract-cli), Python wheels (pdftract-py), and Docker images. All must be built from the same dependency tree.
Supply-chain security—the lockfile pins checksums for all transitive dependencies, enabling cargo audit to detect yanked or compromised crates.
Downstream consumers can still ignore the lockfile if needed. Cargo allows cargo build --frozen with a local lockfile override, or consumers can vendor the crate with their own dependency resolution.

The tradeoff—occasional merge conflicts when PRs update overlapping dependencies—is worth the guarantee of reproducible releases. See CONTRIBUTING.md for the lockfile-update workflow.

Modules

parser: PDF spec parsing (xref, trailer, object streams, indirect references)
font: Font encoding recovery, glyph name lookup, fingerprinting
layout: Page layout analysis, region segmentation, reading order
extract: Text extraction with provenance (bounding boxes, confidence scores)
ocr: Tesseract integration for raster pages

Testing

SDK Conformance Tests

The conformance integration test validates that pdftract-core's public API satisfies the SDK contract shared across all language implementations. The test rig runs shared conformance cases from tests/sdk-conformance/cases.json and verifies correct behavior for all 9 SDK contract methods.

# Run the conformance suite
cargo test --test conformance

# Run with specific features
cargo test --test conformance --features ocr,profiles,remote,receipts

The conformance suite covers:

extract — Full extraction with structured Document output
extract_text — Plain text extraction
extract_markdown — Markdown-formatted extraction with tables and headings
extract_stream — Streaming NDJSON extraction for large documents
search — Pattern search with regex and case-insensitive options
get_metadata — PDF metadata (page count, title, author, creator)
hash — Content fingerprinting (SHA256) with fast hash variant
classify — Document classification with category and confidence
verify_receipt — Receipt verification against signed metadata

Each test case validates expected results with numeric tolerances for bounding boxes and confidence scores. Feature-gated tests (OCR, decryption, classification, receipts, remote) skip automatically when the corresponding feature is not compiled.

See CONTRIBUTING.md for more details on the conformance suite and adding new test cases.

Usage

use pdftract_core::{extract_text, ExtractOptions};

let options = ExtractOptions::default();
let result = extract_text("document.pdf", &options)?;
println!("{}", result.text);