pdftract/pdftract-go/README.md

# pdftract-go

Go SDK for pdftract — subprocess-based client for extracting structured data from PDFs.

## Installation

```go
go get github.com/jedarden/pdftract-go
```

The SDK requires the `pdftract` binary to be installed and available in your PATH. See [jedarden/pdftract](https://github.com/jedarden/pdftract) for installation instructions.

## Quick Start

```go
package main

import (
    "context"
    "fmt"
    "log"

    "github.com/jedarden/pdftract-go"
)

func main() {
    // Create a client (searches PATH for pdftract binary)
    client, err := pdftract.NewClient("")
    if err != nil {
        log.Fatal(err)
    }

    // Extract structured data from a PDF
    doc, err := client.Extract(context.Background(), pdftract.FileSource("document.pdf"), nil)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Pages: %d\n", len(doc.Pages))
    fmt.Printf("Title: %s\n", doc.Metadata.Title)
}
```

## Sources

The SDK accepts three types of PDF sources:

```go
// Local file path
client.Extract(ctx, pdftract.FileSource("path/to/file.pdf"), opts)

// Remote URL
client.Extract(ctx, pdftract.RemoteSource("https://example.com/doc.pdf"), opts)

// In-memory bytes
data, _ := os.ReadFile("document.pdf")
client.Extract(ctx, pdftract.MemorySource(data), opts)
```

## API Methods

### Extract

Extract structured data from a PDF:

```go
opts := &pdftract.ExtractOptions{
    OCRLanguage:    "eng",
    OCRThreshold:   0.7,
    PreserveLayout: false,
    ExtractImages:  false,
}

doc, err := client.Extract(ctx, pdftract.FileSource("doc.pdf"), opts)
```

### ExtractText

Extract plain text:

```go
text, err := client.ExtractText(ctx, source, opts)
```

### ExtractMarkdown

Extract Markdown-formatted text:

```go
md, err := client.ExtractMarkdown(ctx, source, opts)
```

### ExtractStream

Stream pages one at a time:

```go
resultChan, err := client.ExtractStream(ctx, source, opts)
for result := range resultChan {
    if result.Err != nil {
        log.Printf("Error: %v", result.Err)
        continue
    }
    page := result.Page
    fmt.Printf("Page %d: %d spans\n", page.Page, len(page.Spans))
}
```

### Search

Search for a pattern in a PDF:

```go
opts := &pdftract.SearchOptions{
    CaseInsensitive: true,
    Regex:           false,
    WholeWord:       false,
}

resultChan, err := client.Search(ctx, source, "invoice", opts)
for result := range resultChan {
    if result.Err != nil {
        log.Printf("Error: %v", result.Err)
        continue
    }
    match := result.Match
    fmt.Printf("Match on page %d: %s\n", match.Page, match.Text)
}
```

### GetMetadata

Extract document metadata:

```go
meta, err := client.GetMetadata(ctx, source, nil)
fmt.Printf("Title: %s\n", meta.Title)
fmt.Printf("Author: %s\n", meta.Author)
fmt.Printf("Page count: %d\n", meta.PageCount)
```

### Hash

Compute document fingerprint:

```go
fp, err := client.Hash(ctx, source, nil)
fmt.Printf("SHA-256: %s\n", fp.Hash)
fmt.Printf("BLAKE3: %s\n", fp.FastHash)
```

### Classify

Classify document type:

```go
cls, err := client.Classify(ctx, source)
fmt.Printf("Category: %s\n", cls.Category)
fmt.Printf("Confidence: %.2f\n", cls.Confidence)
fmt.Printf("Tags: %v\n", cls.Tags)
```

### VerifyReceipt

Verify a cryptographic receipt:

```go
valid, err := client.VerifyReceipt(ctx, "document.pdf", receipt)
```

## Context Cancellation

All methods accept `context.Context` for cancellation:

```go
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

doc, err := client.Extract(ctx, source, opts)
if errors.Is(err, context.DeadlineExceeded) {
    // Handle timeout
}
```

## Error Handling

The SDK maps CLI exit codes to specific error types:

```go
doc, err := client.Extract(ctx, source, opts)
if err != nil {
    if corruptErr, ok := pdftract.AsCorruptPdfError(err); ok {
        // Handle corrupt PDF
        fmt.Printf("Corrupt PDF: %s\n", corruptErr.Message)
    } else if encErr, ok := pdftract.AsEncryptionError(err); ok {
        // Handle encrypted PDF
        fmt.Printf("Encrypted PDF: %s\n", encErr.Message)
    } else {
        // Handle other errors
        log.Fatal(err)
    }
}
```

Available error types:
- `CorruptPdfError` — Exit code 2
- `EncryptionError` — Exit code 3
- `SourceUnreachableError` — Exit code 4
- `RemoteFetchInterruptedError` — Exit code 5
- `TlsError` — Exit code 6
- `ReceiptVerifyError` — Exit code 10

## Options

### ExtractOptions

```go
type ExtractOptions struct {
    Password       string  // PDF password
    OCRLanguage    string  // OCR language code (default: "eng")
    OCRThreshold   float64 // OCR confidence threshold (default: 0.7)
    PreserveLayout bool    // Preserve original reading order
    ExtractImages  bool    // Extract embedded images
    ImageFormat    string  // Image format: "png", "jpg", "webp"
    MinImageSize   int     // Minimum image dimension in pixels
}
```

### SearchOptions

```go
type SearchOptions struct {
    CaseInsensitive bool // Ignore case when matching
    Regex           bool // Treat pattern as regex
    WholeWord       bool // Match only whole words
    MaxResults      *int // Maximum matches (nil = unlimited)
}
```

### HashOptions

```go
type HashOptions struct {
    Password string // PDF password
}
```

## Go Version

This module requires Go 1.22 or later.

## Conformance

This SDK passes the official pdftract conformance suite. Run tests with:

```bash
go test ./...
```

## License

MIT

## Links

- [pdftract CLI](https://github.com/jedarden/pdftract)
- [SDK Contract](https://github.com/jedarden/pdftract/blob/main/docs/notes/sdk-contract.md)
- [pkg.go.dev](https://pkg.go.dev/github.com/jedarden/pdftract-go)