History

jedarden 768b858c36 feat(pdftract-1w22d): implement .NET SDK subprocess wrapper Complete implementation of the Pdftract NuGet package as a subprocess- based SDK with async-first design using System.Diagnostics.Process and System.Text.Json. Implementation: - All 9 contract methods (ExtractAsync, ExtractTextAsync, etc.) with sync wrappers in Pdftract.Sync.cs - 8 exception types inheriting from PdftractException base class - Source discriminated union (PathSource, UrlSource, BytesSource) with FromPath, FromUrl, FromUri, FromBytes factory methods - C# record types for all models (Document, Page, Metadata, etc.) - ExtractOptions, SearchOptions, HashOptions with PascalCase properties - Source-generated JSON serialization via JsonContext for Native AOT - IAsyncEnumerable streaming for NDJSON outputs - CancellationToken propagation to Process.Kill(entireProcessTree: true) Bug fixes: - Fixed ArgumentList handling (was adding List as single element) - Added source.Dispose() cleanup for BytesSource temporary files - Added cleanup for VerifyReceiptAsync temporary receipt file - Added process.EnableRaisingEvents for proper event handling - Fixed output capture to include newlines between lines - Changed to source-generated JSON (JsonContext) instead of reflection Acceptance criteria: - All 9 methods exposed as both async and sync variants - All 8 exception classes inherit from PdftractException - Models as C# records - Supports net8.0 and net9.0 - CancellationToken terminates subprocess Files modified: - pdftract-dotnet/src/Pdftract/Pdftract.cs - pdftract-dotnet/src/Pdftract/Pdftract.Sync.cs - pdftract-dotnet/src/Pdftract/Source/Source.cs - pdftract-dotnet/src/Pdftract/Models/Document.cs - pdftract-dotnet/src/Pdftract/Models/JsonContext.cs - pdftract-dotnet/tests/Pdftract.Tests/ConformanceTests.cs - pdftract-dotnet/README.md - pdftract-dotnet/notes/pdftract-1w22d.md Co-Authored-By: Claude Code <noreply@anthropic.com>		2026-05-22 19:50:57 -04:00
..
Codegen	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
Models	feat(pdftract-1w22d): implement .NET SDK subprocess wrapper	2026-05-22 19:50:57 -04:00
Source	feat(pdftract-1w22d): implement .NET SDK subprocess wrapper	2026-05-22 19:50:57 -04:00
Options.cs	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
Pdftract.cs	feat(pdftract-1w22d): implement .NET SDK subprocess wrapper	2026-05-22 19:50:57 -04:00
Pdftract.csproj	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00
Pdftract.Sync.cs	feat(pdftract-1w22d): implement .NET SDK subprocess wrapper	2026-05-22 19:50:57 -04:00
README.md	feat(sdks): vendor dotnet/java/node SDKs into the monorepo	2026-05-22 07:20:19 -04:00

README.md

pdftract

A PDF text extraction library that gets the hard parts right.

Platform Support

Platform	Status
Linux x86_64	Fully CI-tested (gating CI on every PR)
Linux aarch64	Fully CI-tested
macOS x86_64	Build-tested; manually smoke-tested per release
macOS aarch64	Build-tested; manually smoke-tested per release
Windows x86_64	Build-tested; manually smoke-tested per release

Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.

Installation

cargo

cargo install pdftract

pip

pip install pdftract

Docker

docker pull ronaldraygun/pdftract:latest

Homebrew

brew install pdftract

Quickstart

Rust

use pdftract_core::{extract_pdf, ExtractionOptions};

let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);

Python

import pdftract

doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")

CLI

pdftract extract file.pdf --json result.json   # JSON output
pdftract extract file.pdf --text -             # Plain text to stdout
pdftract serve --port 8080                     # HTTP microservice

What it does

Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching
Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score

Documentation

User docs: docs/user-docs (mdBook)
API reference: docs.rs/pdftract
Contributing guide: CONTRIBUTING.md
Security policy: SECURITY.md
Changelog: CHANGELOG.md
License: LICENSE-MIT or LICENSE-APACHE

License

Licensed under either of:

MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)

at your option.