pdftract/pdftract-dotnet
jedarden 0932cf1fdc feat(sdks): vendor dotnet/java/node SDKs into the monorepo
Consolidate the .NET, Java, and Node SDKs into root-level pdftract-<lang>/
directories (matching the already-tracked pdftract-go/), per the decision to
make the generated SDKs first-class monorepo members rather than separate repos.
Content imported from the standalone ~/pdftract-<lang> repos (build artifacts
excluded). Removes the broken empty-git nested clones that were polluting the
working tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:20:19 -04:00
..
notes feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
src/Pdftract feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
tests/Pdftract.Tests feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
.gitignore feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
LICENSE feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
Pdftract.csproj feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
Pdftract.sln feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00
README.md feat(sdks): vendor dotnet/java/node SDKs into the monorepo 2026-05-22 07:20:19 -04:00

Pdftract .NET SDK

The .NET SDK for pdftract — a subprocess wrapper around the pdftract binary for PDF text extraction, OCR, search, and metadata.

Installation

dotnet add package Pdftract

Quick Start

using Pdftract;
using Pdftract.Models;

var client = new Pdftract();

// Extract structured data
var doc = await client.ExtractAsync(Source.FromPath("document.pdf"));
Console.WriteLine($"Pages: {doc.Pages.Count}");

// Extract plain text
var text = await client.ExtractTextAsync(Source.FromPath("document.pdf"));

// Extract markdown
var md = await client.ExtractMarkdownAsync(Source.FromPath("document.pdf"));

// Get metadata
var metadata = await client.GetMetadataAsync(Source.FromPath("document.pdf"));
Console.WriteLine($"Title: {metadata.Title}");

Features

  • Extract: Structured data, plain text, or markdown from PDFs
  • Search: Full-text search with regex and whole-word options
  • Metadata: Extract document metadata (title, author, page count, etc.)
  • Hash: Compute document fingerprints for deduplication
  • Classify: Automatic document classification
  • OCR: Built-in OCR support for scanned documents
  • Async-first: All methods return Task<T> or IAsyncEnumerable<T>
  • AOT-compatible: Works with Native AOT compilation

Supported Platforms

  • .NET 9.0 (recommended)
  • .NET 8.0

.NET Framework 4.x is not supported.

API Reference

Source Types

// From file path
var source = Source.FromPath("document.pdf");

// From URL
var source = Source.FromUrl("https://example.com/document.pdf");

// From bytes
var data = await File.ReadAllBytesAsync("document.pdf");
var source = Source.FromBytes(data);

Extraction Methods

// Structured data with pages, spans, and blocks
var doc = await client.ExtractAsync(source, new ExtractOptions
{
    OcrLanguage = "eng",
    PreserveLayout = true
});

// Plain text
var text = await client.ExtractTextAsync(source);

// Markdown
var md = await client.ExtractMarkdownAsync(source);

// Streaming pages
await foreach (var page in client.ExtractStreamAsync(source))
{
    Console.WriteLine($"Page {page.PageIndex}: {page.Blocks.Count} blocks");
}
await foreach (var match in client.SearchAsync(source, "pattern", new SearchOptions
{
    CaseInsensitive = true,
    Regex = true,
    WholeWord = false,
    MaxResults = 100
}))
{
    Console.WriteLine($"{match.Page}: {match.Text}");
    Console.WriteLine($"  Context: {match.Context.Before}[MATCH]{match.Context.After}");
}

Metadata

var metadata = await client.GetMetadataAsync(source);
Console.WriteLine($"Title: {metadata.Title}");
Console.WriteLine($"Author: {metadata.Author}");
Console.WriteLine($"Page Count: {metadata.PageCount}");
Console.WriteLine($"Created: {metadata.Created}");

Hash

var fingerprint = await client.HashAsync(source);
Console.WriteLine($"Hash: {fingerprint.Hash}");
Console.WriteLine($"Fast Hash: {fingerprint.FastHash}");

Classification

var classification = await client.ClassifyAsync(source);
Console.WriteLine($"Category: {classification.Category}");
Console.WriteLine($"Confidence: {classification.Confidence}");
Console.WriteLine($"Tags: {string.Join(", ", classification.Tags)}");

Options

ExtractOptions

Option Type Description
Password string? Password for encrypted PDFs
OcrLanguage string? ISO 639-3 language code for OCR
OcrThreshold double? Confidence threshold for OCR (0-1)
PreserveLayout bool? Preserve original reading order and layout
ExtractImages bool? Extract embedded images
ImageFormat string? Format for extracted images (png, jpg, webp)
MinImageSize int? Minimum dimension for image extraction
Timeout int? Maximum seconds to wait for the operation

SearchOptions

Option Type Description
CaseInsensitive bool? Ignore case when matching
Regex bool? Treat pattern as regular expression
WholeWord bool? Match only whole words
MaxResults int? Maximum matches to return

HashOptions

Option Type Description
Password string? Password for encrypted PDFs

Error Handling

The SDK provides specific exception types for different error conditions:

try
{
    var doc = await client.ExtractAsync(source);
}
catch (CorruptPdfException ex)
{
    Console.WriteLine($"PDF is corrupt: {ex.Message}");
}
catch (EncryptionException ex)
{
    Console.WriteLine($"PDF is encrypted: {ex.Message}");
}
catch (SourceUnreachableException ex)
{
    Console.WriteLine($"Cannot read source: {ex.Message}");
}
catch (RemoteFetchInterruptedException ex)
{
    Console.WriteLine($"Network error: {ex.Message}");
}
catch (TlsException ex)
{
    Console.WriteLine($"TLS error: {ex.Message}");
}
catch (ReceiptVerifyException ex)
{
    Console.WriteLine($"Receipt verification failed: {ex.Message}");
}
catch (PdftractException ex)
{
    Console.WriteLine($"pdftract error (exit {ex.ExitCode}): {ex.Message}");
}

Conformance

The SDK ships a conformance test suite that verifies compliance with the pdftract contract. See the conformance documentation for details.

Native AOT

This SDK is designed to work with Native AOT compilation. Ensure your project uses source-generated JSON serialization:

<PropertyGroup>
  <PublishAot>true</PublishAot>
</PropertyGroup>

License

MIT