pdftract/pdftract-dotnet/README.md
jedarden 768b858c36 feat(pdftract-1w22d): implement .NET SDK subprocess wrapper
Complete implementation of the Pdftract NuGet package as a subprocess-
based SDK with async-first design using System.Diagnostics.Process and
System.Text.Json.

Implementation:
- All 9 contract methods (ExtractAsync, ExtractTextAsync, etc.) with sync
  wrappers in Pdftract.Sync.cs
- 8 exception types inheriting from PdftractException base class
- Source discriminated union (PathSource, UrlSource, BytesSource) with
  FromPath, FromUrl, FromUri, FromBytes factory methods
- C# record types for all models (Document, Page, Metadata, etc.)
- ExtractOptions, SearchOptions, HashOptions with PascalCase properties
- Source-generated JSON serialization via JsonContext for Native AOT
- IAsyncEnumerable streaming for NDJSON outputs
- CancellationToken propagation to Process.Kill(entireProcessTree: true)

Bug fixes:
- Fixed ArgumentList handling (was adding List as single element)
- Added source.Dispose() cleanup for BytesSource temporary files
- Added cleanup for VerifyReceiptAsync temporary receipt file
- Added process.EnableRaisingEvents for proper event handling
- Fixed output capture to include newlines between lines
- Changed to source-generated JSON (JsonContext) instead of reflection

Acceptance criteria:
- All 9 methods exposed as both async and sync variants
- All 8 exception classes inherit from PdftractException
- Models as C# records
- Supports net8.0 and net9.0
- CancellationToken terminates subprocess

Files modified:
- pdftract-dotnet/src/Pdftract/Pdftract.cs
- pdftract-dotnet/src/Pdftract/Pdftract.Sync.cs
- pdftract-dotnet/src/Pdftract/Source/Source.cs
- pdftract-dotnet/src/Pdftract/Models/Document.cs
- pdftract-dotnet/src/Pdftract/Models/JsonContext.cs
- pdftract-dotnet/tests/Pdftract.Tests/ConformanceTests.cs
- pdftract-dotnet/README.md
- pdftract-dotnet/notes/pdftract-1w22d.md

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-22 19:50:57 -04:00

229 lines
6 KiB
Markdown

# Pdftract .NET SDK
The .NET SDK for [pdftract](https://github.com/jedarden/pdftract) — a subprocess wrapper around the `pdftract` binary for PDF text extraction, OCR, search, and metadata.
## Installation
```bash
dotnet add package Pdftract
```
## Quick Start
```csharp
using Pdftract;
using Pdftract.Models;
var client = new Pdftract();
// Extract structured data
var doc = await client.ExtractAsync(Source.FromPath("document.pdf"));
Console.WriteLine($"Pages: {doc.Pages.Count}");
// Extract plain text
var text = await client.ExtractTextAsync(Source.FromPath("document.pdf"));
// Extract markdown
var md = await client.ExtractMarkdownAsync(Source.FromPath("document.pdf"));
// Get metadata
var metadata = await client.GetMetadataAsync(Source.FromPath("document.pdf"));
Console.WriteLine($"Title: {metadata.Title}");
```
## Features
- **Extract**: Structured data, plain text, or markdown from PDFs
- **Search**: Full-text search with regex and whole-word options
- **Metadata**: Extract document metadata (title, author, page count, etc.)
- **Hash**: Compute document fingerprints for deduplication
- **Classify**: Automatic document classification
- **OCR**: Built-in OCR support for scanned documents
- **Async-first**: All methods return `Task<T>` or `IAsyncEnumerable<T>`
- **AOT-compatible**: Works with Native AOT compilation
## Supported Platforms
- .NET 9.0 (recommended)
- .NET 8.0
.NET Framework 4.x is **not supported**.
## API Reference
### Source Types
```csharp
// From file path
var source = Source.FromPath("document.pdf");
// From URL string
var source = Source.FromUrl("https://example.com/document.pdf");
// From URI
var uri = new Uri("https://example.com/document.pdf");
var source = Source.FromUri(uri);
// From bytes
var data = await File.ReadAllBytesAsync("document.pdf");
var source = Source.FromBytes(data);
```
### Extraction Methods
```csharp
// Structured data with pages, spans, and blocks
var doc = await client.ExtractAsync(source, new ExtractOptions
{
OcrLanguage = "eng",
PreserveLayout = true
});
// Plain text
var text = await client.ExtractTextAsync(source);
// Markdown
var md = await client.ExtractMarkdownAsync(source);
// Streaming pages
await foreach (var page in client.ExtractStreamAsync(source))
{
Console.WriteLine($"Page {page.PageIndex}: {page.Blocks.Count} blocks");
}
```
### Search
```csharp
await foreach (var match in client.SearchAsync(source, "pattern", new SearchOptions
{
CaseInsensitive = true,
Regex = true,
WholeWord = false,
MaxResults = 100
}))
{
Console.WriteLine($"{match.Page}: {match.Text}");
Console.WriteLine($" Context: {match.Context.Before}[MATCH]{match.Context.After}");
}
```
### Metadata
```csharp
var metadata = await client.GetMetadataAsync(source);
Console.WriteLine($"Title: {metadata.Title}");
Console.WriteLine($"Author: {metadata.Author}");
Console.WriteLine($"Page Count: {metadata.PageCount}");
Console.WriteLine($"Created: {metadata.Created}");
```
### Hash
```csharp
var fingerprint = await client.HashAsync(source);
Console.WriteLine($"Hash: {fingerprint.Hash}");
Console.WriteLine($"Fast Hash: {fingerprint.FastHash}");
```
### Classification
```csharp
var classification = await client.ClassifyAsync(source);
Console.WriteLine($"Category: {classification.Category}");
Console.WriteLine($"Confidence: {classification.Confidence}");
Console.WriteLine($"Tags: {string.Join(", ", classification.Tags)}");
```
## Options
### ExtractOptions
| Option | Type | Description |
|--------|------|-------------|
| `Password` | `string?` | Password for encrypted PDFs |
| `OcrLanguage` | `string?` | ISO 639-3 language code for OCR |
| `OcrThreshold` | `double?` | Confidence threshold for OCR (0-1) |
| `PreserveLayout` | `bool?` | Preserve original reading order and layout |
| `ExtractImages` | `bool?` | Extract embedded images |
| `ImageFormat` | `string?` | Format for extracted images (png, jpg, webp) |
| `MinImageSize` | `int?` | Minimum dimension for image extraction |
| `Timeout` | `int?` | Maximum seconds to wait for the operation |
### SearchOptions
| Option | Type | Description |
|--------|------|-------------|
| `CaseInsensitive` | `bool?` | Ignore case when matching |
| `Regex` | `bool?` | Treat pattern as regular expression |
| `WholeWord` | `bool?` | Match only whole words |
| `MaxResults` | `int?` | Maximum matches to return |
### HashOptions
| Option | Type | Description |
|--------|------|-------------|
| `Password` | `string?` | Password for encrypted PDFs |
## Error Handling
The SDK provides specific exception types for different error conditions:
```csharp
try
{
var doc = await client.ExtractAsync(source);
}
catch (CorruptPdfException ex)
{
Console.WriteLine($"PDF is corrupt: {ex.Message}");
}
catch (EncryptionException ex)
{
Console.WriteLine($"PDF is encrypted: {ex.Message}");
}
catch (SourceUnreachableException ex)
{
Console.WriteLine($"Cannot read source: {ex.Message}");
}
catch (RemoteFetchInterruptedException ex)
{
Console.WriteLine($"Network error: {ex.Message}");
}
catch (TlsException ex)
{
Console.WriteLine($"TLS error: {ex.Message}");
}
catch (ReceiptVerifyException ex)
{
Console.WriteLine($"Receipt verification failed: {ex.Message}");
}
catch (PdftractException ex)
{
Console.WriteLine($"pdftract error (exit {ex.ExitCode}): {ex.Message}");
}
```
## Conformance
The SDK ships a conformance test suite that verifies compliance with the pdftract contract. See the [conformance documentation](https://github.com/jedarden/pdftract/blob/main/docs/conformance/sdk-contract.md) for details.
## Native AOT
This SDK is designed to work with Native AOT compilation. Ensure your project uses source-generated JSON serialization:
```xml
<PropertyGroup>
<PublishAot>true</PublishAot>
</PropertyGroup>
```
## License
MIT
## Links
- [pdftract](https://github.com/jedarden/pdftract)
- [Documentation](https://github.com/jedarden/pdftract/tree/main/docs)
- [Conformance](https://github.com/jedarden/pdftract/blob/main/docs/conformance/sdk-contract.md)