Complete implementation of the Pdftract NuGet package as a subprocess- based SDK with async-first design using System.Diagnostics.Process and System.Text.Json. Implementation: - All 9 contract methods (ExtractAsync, ExtractTextAsync, etc.) with sync wrappers in Pdftract.Sync.cs - 8 exception types inheriting from PdftractException base class - Source discriminated union (PathSource, UrlSource, BytesSource) with FromPath, FromUrl, FromUri, FromBytes factory methods - C# record types for all models (Document, Page, Metadata, etc.) - ExtractOptions, SearchOptions, HashOptions with PascalCase properties - Source-generated JSON serialization via JsonContext for Native AOT - IAsyncEnumerable streaming for NDJSON outputs - CancellationToken propagation to Process.Kill(entireProcessTree: true) Bug fixes: - Fixed ArgumentList handling (was adding List as single element) - Added source.Dispose() cleanup for BytesSource temporary files - Added cleanup for VerifyReceiptAsync temporary receipt file - Added process.EnableRaisingEvents for proper event handling - Fixed output capture to include newlines between lines - Changed to source-generated JSON (JsonContext) instead of reflection Acceptance criteria: - All 9 methods exposed as both async and sync variants - All 8 exception classes inherit from PdftractException - Models as C# records - Supports net8.0 and net9.0 - CancellationToken terminates subprocess Files modified: - pdftract-dotnet/src/Pdftract/Pdftract.cs - pdftract-dotnet/src/Pdftract/Pdftract.Sync.cs - pdftract-dotnet/src/Pdftract/Source/Source.cs - pdftract-dotnet/src/Pdftract/Models/Document.cs - pdftract-dotnet/src/Pdftract/Models/JsonContext.cs - pdftract-dotnet/tests/Pdftract.Tests/ConformanceTests.cs - pdftract-dotnet/README.md - pdftract-dotnet/notes/pdftract-1w22d.md Co-Authored-By: Claude Code <noreply@anthropic.com>
229 lines
6 KiB
Markdown
229 lines
6 KiB
Markdown
# Pdftract .NET SDK
|
|
|
|
The .NET SDK for [pdftract](https://github.com/jedarden/pdftract) — a subprocess wrapper around the `pdftract` binary for PDF text extraction, OCR, search, and metadata.
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
dotnet add package Pdftract
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
```csharp
|
|
using Pdftract;
|
|
using Pdftract.Models;
|
|
|
|
var client = new Pdftract();
|
|
|
|
// Extract structured data
|
|
var doc = await client.ExtractAsync(Source.FromPath("document.pdf"));
|
|
Console.WriteLine($"Pages: {doc.Pages.Count}");
|
|
|
|
// Extract plain text
|
|
var text = await client.ExtractTextAsync(Source.FromPath("document.pdf"));
|
|
|
|
// Extract markdown
|
|
var md = await client.ExtractMarkdownAsync(Source.FromPath("document.pdf"));
|
|
|
|
// Get metadata
|
|
var metadata = await client.GetMetadataAsync(Source.FromPath("document.pdf"));
|
|
Console.WriteLine($"Title: {metadata.Title}");
|
|
```
|
|
|
|
## Features
|
|
|
|
- **Extract**: Structured data, plain text, or markdown from PDFs
|
|
- **Search**: Full-text search with regex and whole-word options
|
|
- **Metadata**: Extract document metadata (title, author, page count, etc.)
|
|
- **Hash**: Compute document fingerprints for deduplication
|
|
- **Classify**: Automatic document classification
|
|
- **OCR**: Built-in OCR support for scanned documents
|
|
- **Async-first**: All methods return `Task<T>` or `IAsyncEnumerable<T>`
|
|
- **AOT-compatible**: Works with Native AOT compilation
|
|
|
|
## Supported Platforms
|
|
|
|
- .NET 9.0 (recommended)
|
|
- .NET 8.0
|
|
|
|
.NET Framework 4.x is **not supported**.
|
|
|
|
## API Reference
|
|
|
|
### Source Types
|
|
|
|
```csharp
|
|
// From file path
|
|
var source = Source.FromPath("document.pdf");
|
|
|
|
// From URL string
|
|
var source = Source.FromUrl("https://example.com/document.pdf");
|
|
|
|
// From URI
|
|
var uri = new Uri("https://example.com/document.pdf");
|
|
var source = Source.FromUri(uri);
|
|
|
|
// From bytes
|
|
var data = await File.ReadAllBytesAsync("document.pdf");
|
|
var source = Source.FromBytes(data);
|
|
```
|
|
|
|
### Extraction Methods
|
|
|
|
```csharp
|
|
// Structured data with pages, spans, and blocks
|
|
var doc = await client.ExtractAsync(source, new ExtractOptions
|
|
{
|
|
OcrLanguage = "eng",
|
|
PreserveLayout = true
|
|
});
|
|
|
|
// Plain text
|
|
var text = await client.ExtractTextAsync(source);
|
|
|
|
// Markdown
|
|
var md = await client.ExtractMarkdownAsync(source);
|
|
|
|
// Streaming pages
|
|
await foreach (var page in client.ExtractStreamAsync(source))
|
|
{
|
|
Console.WriteLine($"Page {page.PageIndex}: {page.Blocks.Count} blocks");
|
|
}
|
|
```
|
|
|
|
### Search
|
|
|
|
```csharp
|
|
await foreach (var match in client.SearchAsync(source, "pattern", new SearchOptions
|
|
{
|
|
CaseInsensitive = true,
|
|
Regex = true,
|
|
WholeWord = false,
|
|
MaxResults = 100
|
|
}))
|
|
{
|
|
Console.WriteLine($"{match.Page}: {match.Text}");
|
|
Console.WriteLine($" Context: {match.Context.Before}[MATCH]{match.Context.After}");
|
|
}
|
|
```
|
|
|
|
### Metadata
|
|
|
|
```csharp
|
|
var metadata = await client.GetMetadataAsync(source);
|
|
Console.WriteLine($"Title: {metadata.Title}");
|
|
Console.WriteLine($"Author: {metadata.Author}");
|
|
Console.WriteLine($"Page Count: {metadata.PageCount}");
|
|
Console.WriteLine($"Created: {metadata.Created}");
|
|
```
|
|
|
|
### Hash
|
|
|
|
```csharp
|
|
var fingerprint = await client.HashAsync(source);
|
|
Console.WriteLine($"Hash: {fingerprint.Hash}");
|
|
Console.WriteLine($"Fast Hash: {fingerprint.FastHash}");
|
|
```
|
|
|
|
### Classification
|
|
|
|
```csharp
|
|
var classification = await client.ClassifyAsync(source);
|
|
Console.WriteLine($"Category: {classification.Category}");
|
|
Console.WriteLine($"Confidence: {classification.Confidence}");
|
|
Console.WriteLine($"Tags: {string.Join(", ", classification.Tags)}");
|
|
```
|
|
|
|
## Options
|
|
|
|
### ExtractOptions
|
|
|
|
| Option | Type | Description |
|
|
|--------|------|-------------|
|
|
| `Password` | `string?` | Password for encrypted PDFs |
|
|
| `OcrLanguage` | `string?` | ISO 639-3 language code for OCR |
|
|
| `OcrThreshold` | `double?` | Confidence threshold for OCR (0-1) |
|
|
| `PreserveLayout` | `bool?` | Preserve original reading order and layout |
|
|
| `ExtractImages` | `bool?` | Extract embedded images |
|
|
| `ImageFormat` | `string?` | Format for extracted images (png, jpg, webp) |
|
|
| `MinImageSize` | `int?` | Minimum dimension for image extraction |
|
|
| `Timeout` | `int?` | Maximum seconds to wait for the operation |
|
|
|
|
### SearchOptions
|
|
|
|
| Option | Type | Description |
|
|
|--------|------|-------------|
|
|
| `CaseInsensitive` | `bool?` | Ignore case when matching |
|
|
| `Regex` | `bool?` | Treat pattern as regular expression |
|
|
| `WholeWord` | `bool?` | Match only whole words |
|
|
| `MaxResults` | `int?` | Maximum matches to return |
|
|
|
|
### HashOptions
|
|
|
|
| Option | Type | Description |
|
|
|--------|------|-------------|
|
|
| `Password` | `string?` | Password for encrypted PDFs |
|
|
|
|
## Error Handling
|
|
|
|
The SDK provides specific exception types for different error conditions:
|
|
|
|
```csharp
|
|
try
|
|
{
|
|
var doc = await client.ExtractAsync(source);
|
|
}
|
|
catch (CorruptPdfException ex)
|
|
{
|
|
Console.WriteLine($"PDF is corrupt: {ex.Message}");
|
|
}
|
|
catch (EncryptionException ex)
|
|
{
|
|
Console.WriteLine($"PDF is encrypted: {ex.Message}");
|
|
}
|
|
catch (SourceUnreachableException ex)
|
|
{
|
|
Console.WriteLine($"Cannot read source: {ex.Message}");
|
|
}
|
|
catch (RemoteFetchInterruptedException ex)
|
|
{
|
|
Console.WriteLine($"Network error: {ex.Message}");
|
|
}
|
|
catch (TlsException ex)
|
|
{
|
|
Console.WriteLine($"TLS error: {ex.Message}");
|
|
}
|
|
catch (ReceiptVerifyException ex)
|
|
{
|
|
Console.WriteLine($"Receipt verification failed: {ex.Message}");
|
|
}
|
|
catch (PdftractException ex)
|
|
{
|
|
Console.WriteLine($"pdftract error (exit {ex.ExitCode}): {ex.Message}");
|
|
}
|
|
```
|
|
|
|
## Conformance
|
|
|
|
The SDK ships a conformance test suite that verifies compliance with the pdftract contract. See the [conformance documentation](https://github.com/jedarden/pdftract/blob/main/docs/conformance/sdk-contract.md) for details.
|
|
|
|
## Native AOT
|
|
|
|
This SDK is designed to work with Native AOT compilation. Ensure your project uses source-generated JSON serialization:
|
|
|
|
```xml
|
|
<PropertyGroup>
|
|
<PublishAot>true</PublishAot>
|
|
</PropertyGroup>
|
|
```
|
|
|
|
## License
|
|
|
|
MIT
|
|
|
|
## Links
|
|
|
|
- [pdftract](https://github.com/jedarden/pdftract)
|
|
- [Documentation](https://github.com/jedarden/pdftract/tree/main/docs)
|
|
- [Conformance](https://github.com/jedarden/pdftract/blob/main/docs/conformance/sdk-contract.md)
|