jedarden 768b858c36 feat(pdftract-1w22d): implement .NET SDK subprocess wrapper

Complete implementation of the Pdftract NuGet package as a subprocess-
based SDK with async-first design using System.Diagnostics.Process and
System.Text.Json.

Implementation:
- All 9 contract methods (ExtractAsync, ExtractTextAsync, etc.) with sync
  wrappers in Pdftract.Sync.cs
- 8 exception types inheriting from PdftractException base class
- Source discriminated union (PathSource, UrlSource, BytesSource) with
  FromPath, FromUrl, FromUri, FromBytes factory methods
- C# record types for all models (Document, Page, Metadata, etc.)
- ExtractOptions, SearchOptions, HashOptions with PascalCase properties
- Source-generated JSON serialization via JsonContext for Native AOT
- IAsyncEnumerable streaming for NDJSON outputs
- CancellationToken propagation to Process.Kill(entireProcessTree: true)

Bug fixes:
- Fixed ArgumentList handling (was adding List as single element)
- Added source.Dispose() cleanup for BytesSource temporary files
- Added cleanup for VerifyReceiptAsync temporary receipt file
- Added process.EnableRaisingEvents for proper event handling
- Fixed output capture to include newlines between lines
- Changed to source-generated JSON (JsonContext) instead of reflection

Acceptance criteria:
- All 9 methods exposed as both async and sync variants
- All 8 exception classes inherit from PdftractException
- Models as C# records
- Supports net8.0 and net9.0
- CancellationToken terminates subprocess

Files modified:
- pdftract-dotnet/src/Pdftract/Pdftract.cs
- pdftract-dotnet/src/Pdftract/Pdftract.Sync.cs
- pdftract-dotnet/src/Pdftract/Source/Source.cs
- pdftract-dotnet/src/Pdftract/Models/Document.cs
- pdftract-dotnet/src/Pdftract/Models/JsonContext.cs
- pdftract-dotnet/tests/Pdftract.Tests/ConformanceTests.cs
- pdftract-dotnet/README.md
- pdftract-dotnet/notes/pdftract-1w22d.md

Co-Authored-By: Claude Code <noreply@anthropic.com>

2026-05-22 19:50:57 -04:00

6 KiB

Raw Blame History

Pdftract .NET SDK

The .NET SDK for pdftract — a subprocess wrapper around the pdftract binary for PDF text extraction, OCR, search, and metadata.

Installation

dotnet add package Pdftract

Quick Start

using Pdftract;
using Pdftract.Models;

var client = new Pdftract();

// Extract structured data
var doc = await client.ExtractAsync(Source.FromPath("document.pdf"));
Console.WriteLine($"Pages: {doc.Pages.Count}");

// Extract plain text
var text = await client.ExtractTextAsync(Source.FromPath("document.pdf"));

// Extract markdown
var md = await client.ExtractMarkdownAsync(Source.FromPath("document.pdf"));

// Get metadata
var metadata = await client.GetMetadataAsync(Source.FromPath("document.pdf"));
Console.WriteLine($"Title: {metadata.Title}");

Features

Extract: Structured data, plain text, or markdown from PDFs
Search: Full-text search with regex and whole-word options
Metadata: Extract document metadata (title, author, page count, etc.)
Hash: Compute document fingerprints for deduplication
Classify: Automatic document classification
OCR: Built-in OCR support for scanned documents
Async-first: All methods return Task<T> or IAsyncEnumerable<T>
AOT-compatible: Works with Native AOT compilation

Supported Platforms

.NET 9.0 (recommended)
.NET 8.0

.NET Framework 4.x is not supported.

API Reference

Source Types

// From file path
var source = Source.FromPath("document.pdf");

// From URL string
var source = Source.FromUrl("https://example.com/document.pdf");

// From URI
var uri = new Uri("https://example.com/document.pdf");
var source = Source.FromUri(uri);

// From bytes
var data = await File.ReadAllBytesAsync("document.pdf");
var source = Source.FromBytes(data);

Extraction Methods

// Structured data with pages, spans, and blocks
var doc = await client.ExtractAsync(source, new ExtractOptions
{
    OcrLanguage = "eng",
    PreserveLayout = true
});

// Plain text
var text = await client.ExtractTextAsync(source);

// Markdown
var md = await client.ExtractMarkdownAsync(source);

// Streaming pages
await foreach (var page in client.ExtractStreamAsync(source))
{
    Console.WriteLine($"Page {page.PageIndex}: {page.Blocks.Count} blocks");
}

Search

await foreach (var match in client.SearchAsync(source, "pattern", new SearchOptions
{
    CaseInsensitive = true,
    Regex = true,
    WholeWord = false,
    MaxResults = 100
}))
{
    Console.WriteLine($"{match.Page}: {match.Text}");
    Console.WriteLine($"  Context: {match.Context.Before}[MATCH]{match.Context.After}");
}

Metadata

var metadata = await client.GetMetadataAsync(source);
Console.WriteLine($"Title: {metadata.Title}");
Console.WriteLine($"Author: {metadata.Author}");
Console.WriteLine($"Page Count: {metadata.PageCount}");
Console.WriteLine($"Created: {metadata.Created}");

Hash

var fingerprint = await client.HashAsync(source);
Console.WriteLine($"Hash: {fingerprint.Hash}");
Console.WriteLine($"Fast Hash: {fingerprint.FastHash}");

Classification

var classification = await client.ClassifyAsync(source);
Console.WriteLine($"Category: {classification.Category}");
Console.WriteLine($"Confidence: {classification.Confidence}");
Console.WriteLine($"Tags: {string.Join(", ", classification.Tags)}");

Options

ExtractOptions

Option	Type	Description
`Password`	`string?`	Password for encrypted PDFs
`OcrLanguage`	`string?`	ISO 639-3 language code for OCR
`OcrThreshold`	`double?`	Confidence threshold for OCR (0-1)
`PreserveLayout`	`bool?`	Preserve original reading order and layout
`ExtractImages`	`bool?`	Extract embedded images
`ImageFormat`	`string?`	Format for extracted images (png, jpg, webp)
`MinImageSize`	`int?`	Minimum dimension for image extraction
`Timeout`	`int?`	Maximum seconds to wait for the operation

SearchOptions

Option	Type	Description
`CaseInsensitive`	`bool?`	Ignore case when matching
`Regex`	`bool?`	Treat pattern as regular expression
`WholeWord`	`bool?`	Match only whole words
`MaxResults`	`int?`	Maximum matches to return

HashOptions

Option	Type	Description
`Password`	`string?`	Password for encrypted PDFs

Error Handling

The SDK provides specific exception types for different error conditions:

try
{
    var doc = await client.ExtractAsync(source);
}
catch (CorruptPdfException ex)
{
    Console.WriteLine($"PDF is corrupt: {ex.Message}");
}
catch (EncryptionException ex)
{
    Console.WriteLine($"PDF is encrypted: {ex.Message}");
}
catch (SourceUnreachableException ex)
{
    Console.WriteLine($"Cannot read source: {ex.Message}");
}
catch (RemoteFetchInterruptedException ex)
{
    Console.WriteLine($"Network error: {ex.Message}");
}
catch (TlsException ex)
{
    Console.WriteLine($"TLS error: {ex.Message}");
}
catch (ReceiptVerifyException ex)
{
    Console.WriteLine($"Receipt verification failed: {ex.Message}");
}
catch (PdftractException ex)
{
    Console.WriteLine($"pdftract error (exit {ex.ExitCode}): {ex.Message}");
}

Conformance

The SDK ships a conformance test suite that verifies compliance with the pdftract contract. See the conformance documentation for details.

Native AOT

This SDK is designed to work with Native AOT compilation. Ensure your project uses source-generated JSON serialization:

<PropertyGroup>
  <PublishAot>true</PublishAot>
</PropertyGroup>

License

MIT

6 KiB Raw Blame History