pdftract/pdftract-dotnet/src/Pdftract
jedarden 768b858c36 feat(pdftract-1w22d): implement .NET SDK subprocess wrapper
Complete implementation of the Pdftract NuGet package as a subprocess-
based SDK with async-first design using System.Diagnostics.Process and
System.Text.Json.

Implementation:
- All 9 contract methods (ExtractAsync, ExtractTextAsync, etc.) with sync
  wrappers in Pdftract.Sync.cs
- 8 exception types inheriting from PdftractException base class
- Source discriminated union (PathSource, UrlSource, BytesSource) with
  FromPath, FromUrl, FromUri, FromBytes factory methods
- C# record types for all models (Document, Page, Metadata, etc.)
- ExtractOptions, SearchOptions, HashOptions with PascalCase properties
- Source-generated JSON serialization via JsonContext for Native AOT
- IAsyncEnumerable streaming for NDJSON outputs
- CancellationToken propagation to Process.Kill(entireProcessTree: true)

Bug fixes:
- Fixed ArgumentList handling (was adding List as single element)
- Added source.Dispose() cleanup for BytesSource temporary files
- Added cleanup for VerifyReceiptAsync temporary receipt file
- Added process.EnableRaisingEvents for proper event handling
- Fixed output capture to include newlines between lines
- Changed to source-generated JSON (JsonContext) instead of reflection

Acceptance criteria:
- All 9 methods exposed as both async and sync variants
- All 8 exception classes inherit from PdftractException
- Models as C# records
- Supports net8.0 and net9.0
- CancellationToken terminates subprocess

Files modified:
- pdftract-dotnet/src/Pdftract/Pdftract.cs
- pdftract-dotnet/src/Pdftract/Pdftract.Sync.cs
- pdftract-dotnet/src/Pdftract/Source/Source.cs
- pdftract-dotnet/src/Pdftract/Models/Document.cs
- pdftract-dotnet/src/Pdftract/Models/JsonContext.cs
- pdftract-dotnet/tests/Pdftract.Tests/ConformanceTests.cs
- pdftract-dotnet/README.md
- pdftract-dotnet/notes/pdftract-1w22d.md

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-22 19:50:57 -04:00
..
Codegen feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
Models feat(pdftract-1w22d): implement .NET SDK subprocess wrapper 2026-05-22 19:50:57 -04:00
Source feat(pdftract-1w22d): implement .NET SDK subprocess wrapper 2026-05-22 19:50:57 -04:00
Options.cs feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
Pdftract.cs feat(pdftract-1w22d): implement .NET SDK subprocess wrapper 2026-05-22 19:50:57 -04:00
Pdftract.csproj feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
Pdftract.Sync.cs feat(pdftract-1w22d): implement .NET SDK subprocess wrapper 2026-05-22 19:50:57 -04:00
README.md feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00

pdftract

crates.io docs.rs CI Status License

A PDF text extraction library that gets the hard parts right.

Platform Support

Platform Status
Linux x86_64 Fully CI-tested (gating CI on every PR)
Linux aarch64 Fully CI-tested
macOS x86_64 Build-tested; manually smoke-tested per release
macOS aarch64 Build-tested; manually smoke-tested per release
Windows x86_64 Build-tested; manually smoke-tested per release

Note: Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release. See docs/operations/manual-platform-smoke.md for the per-release smoke procedure.

Installation

cargo

cargo install pdftract

pip

pip install pdftract

Docker

docker pull ronaldraygun/pdftract:latest

Homebrew

brew install pdftract

Quickstart

Rust

use pdftract_core::{extract_pdf, ExtractionOptions};

let opts = ExtractionOptions::default();
let doc = extract_pdf("file.pdf", &opts)?;
println!("Extracted {} pages", doc.metadata.page_count);

Python

import pdftract

doc = pdftract.extract("file.pdf")
print(f"Extracted {doc['metadata']['page_count']} pages")

CLI

pdftract extract file.pdf --json result.json   # JSON output
pdftract extract file.pdf --text -             # Plain text to stdout
pdftract serve --port 8080                     # HTTP microservice

What it does

  • Correct reading order — layout regions are segmented and sequenced before text is emitted, handling multi-column pages, sidebars, footnotes, and mixed-layout documents
  • Font encoding recovery — when ToUnicode CMaps are absent, wrong, or incomplete, pdftract works through a layered recovery pipeline: glyph name lookup, font fingerprinting, and glyph outline shape matching
  • Structure tree extraction — PDF/UA and PDF/A documents encode their logical structure; pdftract reads this directly when present
  • Per-page hybrid routing — each page is independently classified and routed to the appropriate pipeline: vector text extraction, full OCR, or assisted OCR
  • Structured output with provenance — the primary output is JSON carrying per-span bounding boxes, font name, size, and confidence score

Documentation

License

Licensed under either of:

at your option.