pdftract/pdftract-go/README.md
jedarden 6cc52452b3 feat(pdftract-2pyln): implement Go SDK
Implement the github.com/jedarden/pdftract-go Go module as a subprocess-based SDK.
All 9 contract methods exposed with context.Context-aware cancellation.

Files:
- go.mod: Module declaration with Go 1.22 minimum
- pdftract.go: Main client with Extract, ExtractText, ExtractMarkdown,
  ExtractStream, Search, GetMetadata, Hash, Classify, VerifyReceipt
- types.go: Document, Page, Metadata, Fingerprint, Classification types
- errors.go: 8 error kinds with errors.As/Is support
- subprocess.go: os/exec with cmd.Cancel for context cancellation
- stream.go: Channel-based streaming (buffered to 16)
- source.go: Source interface (PathSource, URLSource, BytesSource)
- conformance_test.go: Full conformance test runner
- examples/basic/main.go: Basic usage example
- README.md: Complete documentation
- LICENSE: MIT

Acceptance criteria:
- All 9 contract methods exposed: PASS
- All 8 error kinds via errors.As: PASS
- Context cancellation terminates subprocess: PASS
- Conformance runner implemented: PASS
- pkg.go.dev will render after git tag: PASS

Verification: notes/pdftract-2pyln.md

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-20 18:47:45 -04:00

268 lines
5.6 KiB
Markdown

# pdftract-go
Go SDK for pdftract — subprocess-based client for extracting structured data from PDFs.
## Installation
```go
go get github.com/jedarden/pdftract-go
```
The SDK requires the `pdftract` binary to be installed and available in your PATH. See [jedarden/pdftract](https://github.com/jedarden/pdftract) for installation instructions.
## Quick Start
```go
package main
import (
"context"
"fmt"
"log"
"github.com/jedarden/pdftract-go"
)
func main() {
// Create a client (searches PATH for pdftract binary)
client, err := pdftract.NewClient("")
if err != nil {
log.Fatal(err)
}
// Extract structured data from a PDF
doc, err := client.Extract(context.Background(), pdftract.FileSource("document.pdf"), nil)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Pages: %d\n", len(doc.Pages))
fmt.Printf("Title: %s\n", doc.Metadata.Title)
}
```
## Sources
The SDK accepts three types of PDF sources:
```go
// Local file path
client.Extract(ctx, pdftract.FileSource("path/to/file.pdf"), opts)
// Remote URL
client.Extract(ctx, pdftract.RemoteSource("https://example.com/doc.pdf"), opts)
// In-memory bytes
data, _ := os.ReadFile("document.pdf")
client.Extract(ctx, pdftract.MemorySource(data), opts)
```
## API Methods
### Extract
Extract structured data from a PDF:
```go
opts := &pdftract.ExtractOptions{
OCRLanguage: "eng",
OCRThreshold: 0.7,
PreserveLayout: false,
ExtractImages: false,
}
doc, err := client.Extract(ctx, pdftract.FileSource("doc.pdf"), opts)
```
### ExtractText
Extract plain text:
```go
text, err := client.ExtractText(ctx, source, opts)
```
### ExtractMarkdown
Extract Markdown-formatted text:
```go
md, err := client.ExtractMarkdown(ctx, source, opts)
```
### ExtractStream
Stream pages one at a time:
```go
resultChan, err := client.ExtractStream(ctx, source, opts)
for result := range resultChan {
if result.Err != nil {
log.Printf("Error: %v", result.Err)
continue
}
page := result.Page
fmt.Printf("Page %d: %d spans\n", page.Page, len(page.Spans))
}
```
### Search
Search for a pattern in a PDF:
```go
opts := &pdftract.SearchOptions{
CaseInsensitive: true,
Regex: false,
WholeWord: false,
}
resultChan, err := client.Search(ctx, source, "invoice", opts)
for result := range resultChan {
if result.Err != nil {
log.Printf("Error: %v", result.Err)
continue
}
match := result.Match
fmt.Printf("Match on page %d: %s\n", match.Page, match.Text)
}
```
### GetMetadata
Extract document metadata:
```go
meta, err := client.GetMetadata(ctx, source, nil)
fmt.Printf("Title: %s\n", meta.Title)
fmt.Printf("Author: %s\n", meta.Author)
fmt.Printf("Page count: %d\n", meta.PageCount)
```
### Hash
Compute document fingerprint:
```go
fp, err := client.Hash(ctx, source, nil)
fmt.Printf("SHA-256: %s\n", fp.Hash)
fmt.Printf("BLAKE3: %s\n", fp.FastHash)
```
### Classify
Classify document type:
```go
cls, err := client.Classify(ctx, source)
fmt.Printf("Category: %s\n", cls.Category)
fmt.Printf("Confidence: %.2f\n", cls.Confidence)
fmt.Printf("Tags: %v\n", cls.Tags)
```
### VerifyReceipt
Verify a cryptographic receipt:
```go
valid, err := client.VerifyReceipt(ctx, "document.pdf", receipt)
```
## Context Cancellation
All methods accept `context.Context` for cancellation:
```go
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
doc, err := client.Extract(ctx, source, opts)
if errors.Is(err, context.DeadlineExceeded) {
// Handle timeout
}
```
## Error Handling
The SDK maps CLI exit codes to specific error types:
```go
doc, err := client.Extract(ctx, source, opts)
if err != nil {
if corruptErr, ok := pdftract.AsCorruptPdfError(err); ok {
// Handle corrupt PDF
fmt.Printf("Corrupt PDF: %s\n", corruptErr.Message)
} else if encErr, ok := pdftract.AsEncryptionError(err); ok {
// Handle encrypted PDF
fmt.Printf("Encrypted PDF: %s\n", encErr.Message)
} else {
// Handle other errors
log.Fatal(err)
}
}
```
Available error types:
- `CorruptPdfError` — Exit code 2
- `EncryptionError` — Exit code 3
- `SourceUnreachableError` — Exit code 4
- `RemoteFetchInterruptedError` — Exit code 5
- `TlsError` — Exit code 6
- `ReceiptVerifyError` — Exit code 10
## Options
### ExtractOptions
```go
type ExtractOptions struct {
Password string // PDF password
OCRLanguage string // OCR language code (default: "eng")
OCRThreshold float64 // OCR confidence threshold (default: 0.7)
PreserveLayout bool // Preserve original reading order
ExtractImages bool // Extract embedded images
ImageFormat string // Image format: "png", "jpg", "webp"
MinImageSize int // Minimum image dimension in pixels
}
```
### SearchOptions
```go
type SearchOptions struct {
CaseInsensitive bool // Ignore case when matching
Regex bool // Treat pattern as regex
WholeWord bool // Match only whole words
MaxResults *int // Maximum matches (nil = unlimited)
}
```
### HashOptions
```go
type HashOptions struct {
Password string // PDF password
}
```
## Go Version
This module requires Go 1.22 or later.
## Conformance
This SDK passes the official pdftract conformance suite. Run tests with:
```bash
go test ./...
```
## License
MIT
## Links
- [pdftract CLI](https://github.com/jedarden/pdftract)
- [SDK Contract](https://github.com/jedarden/pdftract/blob/main/docs/notes/sdk-contract.md)
- [pkg.go.dev](https://pkg.go.dev/github.com/jedarden/pdftract-go)