pdftract/pdftract-go
jedarden 5781d67d5c fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup
- Add source Source parameter to invoke, invokeJSON, invokeString, invokeStream
- Change BytesSource from []byte type to struct with data and tmpPath fields
- Add proper cleanup of temporary files after subprocess execution
- Fix source parameter pass-through in Extract, ExtractText, ExtractMarkdown, GetMetadata, Hash, Classify

This ensures BytesSource temporary files are cleaned up after use, preventing
file descriptor leaks. The BytesSource now creates a temp file on demand and
cleans it up automatically via defer in the invoke methods.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 19:08:14 -04:00
..
examples/basic feat(pdftract-2pyln): implement Go SDK 2026-05-20 18:47:45 -04:00
conformance_test.go feat(pdftract-2pyln): implement Go SDK 2026-05-20 18:47:45 -04:00
errors.go feat(pdftract-2pyln): implement Go SDK 2026-05-20 18:47:45 -04:00
go.mod feat(pdftract-2pyln): implement Go SDK 2026-05-20 18:47:45 -04:00
LICENSE feat(pdftract-2pyln): implement Go SDK 2026-05-20 18:47:45 -04:00
pdftract.go fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup 2026-05-20 19:08:14 -04:00
README.md feat(pdftract-2pyln): implement Go SDK 2026-05-20 18:47:45 -04:00
source.go fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup 2026-05-20 19:08:14 -04:00
stream.go feat(pdftract-2pyln): implement Go SDK 2026-05-20 18:47:45 -04:00
subprocess.go fix(pdftract-2pyln): add source parameter to invoke methods for BytesSource cleanup 2026-05-20 19:08:14 -04:00
types.go feat(pdftract-2pyln): implement Go SDK 2026-05-20 18:47:45 -04:00

pdftract-go

Go SDK for pdftract — subprocess-based client for extracting structured data from PDFs.

Installation

go get github.com/jedarden/pdftract-go

The SDK requires the pdftract binary to be installed and available in your PATH. See jedarden/pdftract for installation instructions.

Quick Start

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/jedarden/pdftract-go"
)

func main() {
    // Create a client (searches PATH for pdftract binary)
    client, err := pdftract.NewClient("")
    if err != nil {
        log.Fatal(err)
    }

    // Extract structured data from a PDF
    doc, err := client.Extract(context.Background(), pdftract.FileSource("document.pdf"), nil)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Pages: %d\n", len(doc.Pages))
    fmt.Printf("Title: %s\n", doc.Metadata.Title)
}

Sources

The SDK accepts three types of PDF sources:

// Local file path
client.Extract(ctx, pdftract.FileSource("path/to/file.pdf"), opts)

// Remote URL
client.Extract(ctx, pdftract.RemoteSource("https://example.com/doc.pdf"), opts)

// In-memory bytes
data, _ := os.ReadFile("document.pdf")
client.Extract(ctx, pdftract.MemorySource(data), opts)

API Methods

Extract

Extract structured data from a PDF:

opts := &pdftract.ExtractOptions{
    OCRLanguage:    "eng",
    OCRThreshold:   0.7,
    PreserveLayout: false,
    ExtractImages:  false,
}

doc, err := client.Extract(ctx, pdftract.FileSource("doc.pdf"), opts)

ExtractText

Extract plain text:

text, err := client.ExtractText(ctx, source, opts)

ExtractMarkdown

Extract Markdown-formatted text:

md, err := client.ExtractMarkdown(ctx, source, opts)

ExtractStream

Stream pages one at a time:

resultChan, err := client.ExtractStream(ctx, source, opts)
for result := range resultChan {
    if result.Err != nil {
        log.Printf("Error: %v", result.Err)
        continue
    }
    page := result.Page
    fmt.Printf("Page %d: %d spans\n", page.Page, len(page.Spans))
}

Search for a pattern in a PDF:

opts := &pdftract.SearchOptions{
    CaseInsensitive: true,
    Regex:           false,
    WholeWord:       false,
}

resultChan, err := client.Search(ctx, source, "invoice", opts)
for result := range resultChan {
    if result.Err != nil {
        log.Printf("Error: %v", result.Err)
        continue
    }
    match := result.Match
    fmt.Printf("Match on page %d: %s\n", match.Page, match.Text)
}

GetMetadata

Extract document metadata:

meta, err := client.GetMetadata(ctx, source, nil)
fmt.Printf("Title: %s\n", meta.Title)
fmt.Printf("Author: %s\n", meta.Author)
fmt.Printf("Page count: %d\n", meta.PageCount)

Hash

Compute document fingerprint:

fp, err := client.Hash(ctx, source, nil)
fmt.Printf("SHA-256: %s\n", fp.Hash)
fmt.Printf("BLAKE3: %s\n", fp.FastHash)

Classify

Classify document type:

cls, err := client.Classify(ctx, source)
fmt.Printf("Category: %s\n", cls.Category)
fmt.Printf("Confidence: %.2f\n", cls.Confidence)
fmt.Printf("Tags: %v\n", cls.Tags)

VerifyReceipt

Verify a cryptographic receipt:

valid, err := client.VerifyReceipt(ctx, "document.pdf", receipt)

Context Cancellation

All methods accept context.Context for cancellation:

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

doc, err := client.Extract(ctx, source, opts)
if errors.Is(err, context.DeadlineExceeded) {
    // Handle timeout
}

Error Handling

The SDK maps CLI exit codes to specific error types:

doc, err := client.Extract(ctx, source, opts)
if err != nil {
    if corruptErr, ok := pdftract.AsCorruptPdfError(err); ok {
        // Handle corrupt PDF
        fmt.Printf("Corrupt PDF: %s\n", corruptErr.Message)
    } else if encErr, ok := pdftract.AsEncryptionError(err); ok {
        // Handle encrypted PDF
        fmt.Printf("Encrypted PDF: %s\n", encErr.Message)
    } else {
        // Handle other errors
        log.Fatal(err)
    }
}

Available error types:

  • CorruptPdfError — Exit code 2
  • EncryptionError — Exit code 3
  • SourceUnreachableError — Exit code 4
  • RemoteFetchInterruptedError — Exit code 5
  • TlsError — Exit code 6
  • ReceiptVerifyError — Exit code 10

Options

ExtractOptions

type ExtractOptions struct {
    Password       string  // PDF password
    OCRLanguage    string  // OCR language code (default: "eng")
    OCRThreshold   float64 // OCR confidence threshold (default: 0.7)
    PreserveLayout bool    // Preserve original reading order
    ExtractImages  bool    // Extract embedded images
    ImageFormat    string  // Image format: "png", "jpg", "webp"
    MinImageSize   int     // Minimum image dimension in pixels
}

SearchOptions

type SearchOptions struct {
    CaseInsensitive bool // Ignore case when matching
    Regex           bool // Treat pattern as regex
    WholeWord       bool // Match only whole words
    MaxResults      *int // Maximum matches (nil = unlimited)
}

HashOptions

type HashOptions struct {
    Password string // PDF password
}

Go Version

This module requires Go 1.22 or later.

Conformance

This SDK passes the official pdftract conformance suite. Run tests with:

go test ./...

License

MIT