pdftract/pdftract-swift
jedarden d0f52751ce fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs
The indent trigger was using .abs() which fired on both increased indent
(non-indented → indented) AND decreased indent (indented → non-indented).
This caused drop-cap style paragraphs (indented first line, flush-left
continuation) to incorrectly split into two blocks.

Per plan Phase 4.4 heuristic #2, indent change should only trigger when the
current line is MORE indented (to the right, larger x0) than the block
average - i.e., a new paragraph starting after non-indented text. It should
NOT trigger for decreased indent (first line indented, rest flush-left).

Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold.

Tests:
- test_indented_first_line_new_block: PASS (non-indented → indented splits)
- test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together)
- All 179 line module tests: PASS
2026-06-07 13:43:19 -04:00
..
Sources fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
Tests/PdftractTests fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
.codegen-version fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
GENERATED fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
Package.swift fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
README.md fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00

pdftract-swift

Swift SDK for pdftract - PDF extraction and analysis for server-side Swift.

Platform Support

Supported: macOS 13+, Linux (server-side use only) Unsupported: iOS (Apple does not allow spawning subprocesses in App Store apps)

Note for iOS users: Use pdftract serve over HTTP from your iOS client. Run the server with the Swift SDK on a macOS/Linux backend and make HTTP requests from your iOS app.

Installation

Add to your Package.swift:

dependencies: [
    .package(url: "https://github.com/jedarden/pdftract-swift", from: "1.0.0")
]

Usage

Basic extract

import Pdftract

let client = Pdftract()
let doc = try await client.extract(.path("document.pdf"))
print("Pages: \(doc.pages.count)")
print("Title: \(doc.metadata.title ?? "Untitled")")

Extract from URL

let doc = try await client.extract(.url(URL(string: "https://example.com/doc.pdf")!))

Extract with OCR

let options = ExtractOptions(
    ocrLanguage: "eng",
    ocrThreshold: 0.7
)
let doc = try await client.extract(.path("scanned.pdf"), options: options)

Extract text

let text = try await client.extractText(.path("document.pdf"))
print(text)

Extract Markdown

let md = try await client.extractMarkdown(.path("document.pdf"))

Stream extraction (for large PDFs)

for await page in client.extractStream(.path("large.pdf")) {
    print("Page \(page.pageIndex + 1): \(page.blocks.count) blocks")
}
for await match in client.search(.path("document.pdf"), "invoice") {
    print("Found on page \(match.page): \(match.text)")
    print("  Context: ...\(match.context.before)[\(match.text)]\(match.context.after)...")
}

Get metadata

let metadata = try await client.getMetadata(.path("document.pdf"))
print("Pages: \(metadata.pageCount)")
print("Author: \(metadata.author ?? "Unknown")")

Hash fingerprint

let fingerprint = try await client.hash(.path("document.pdf"))
print("SHA-256: \(fingerprint.hash)")
print("BLAKE3: \(fingerprint.fastHash)")

Classify document

let classification = try await client.classify(.path("document.pdf"))
print("Category: \(classification.category)")
print("Confidence: \(classification.confidence)")

Verify receipt

let receipt = Receipt(data: "...")
let valid = try await client.verifyReceipt("/path/to/receipt.pdf", receipt: receipt)
print("Valid: \(valid)")

Binary version compatibility

This SDK requires pdftract 1.0.0. Download from: https://github.com/jedarden/pdftract/releases/tag/v1.0.0

The SDK will search for pdftract on your PATH. To specify a custom binary path:

let client = Pdftract(binaryPath: "/custom/path/to/pdftract")

Error handling

All methods are async throws and can throw the following errors:

Error Exit Code Description
CorruptPdfError 2 The PDF file is corrupt or invalid
EncryptionError 3 The PDF is encrypted and password is missing/wrong
SourceUnreachableError 4 The source (file or URL) is unreadable
RemoteFetchInterruptedError 5 Network interrupted during remote fetch
TlsError 6 TLS certificate validation failed
ReceiptVerifyError 10 Receipt verification failed
PdftractError other Internal error

Example:

do {
    let doc = try await client.extract(.path("document.pdf"))
} catch let error as PdftractError {
    print("Error (code \(error.exitCode)): \(error.localizedDescription)")
}

Options

ExtractOptions

let options = ExtractOptions(
    ocrLanguage: "eng",           // ISO 639-3 language code
    ocrThreshold: 0.7,            // OCR confidence threshold (0-1)
    preserveLayout: false,        // Preserve original reading order
    extractImages: false,         // Extract embedded images
    imageFormat: "png",           // Format for images: png, jpg, webp
    minImageSize: 64              // Minimum image dimension
)

SearchOptions

let options = SearchOptions(
    caseInsensitive: true,        // Ignore case
    regex: false,                 // Treat pattern as regex
    wholeWord: false,             // Match whole words only
    maxResults: 100              // Maximum matches
)

BaseOptions / HashOptions

let options = BaseOptions(
    timeout: 60                   // Maximum seconds
)

Troubleshooting

Binary not found

Ensure pdftract is on your PATH. The SDK searches PATH for the executable.

# Verify pdftract is available
pdftract --version

Version mismatch

The SDK will refuse to invoke mismatched binary versions. Install the correct version from the releases page.

Network failure

For remote URLs, check your network connection and TLS certificate chain.

Conformance

This SDK passes 100% of the pdftract conformance suite. The conformance report for this release is linked in the GitHub Release.

License

MIT License - see LICENSE file for details.