pdftract/pdftract-swift/README.md
jedarden d0f52751ce fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs
The indent trigger was using .abs() which fired on both increased indent
(non-indented → indented) AND decreased indent (indented → non-indented).
This caused drop-cap style paragraphs (indented first line, flush-left
continuation) to incorrectly split into two blocks.

Per plan Phase 4.4 heuristic #2, indent change should only trigger when the
current line is MORE indented (to the right, larger x0) than the block
average - i.e., a new paragraph starting after non-indented text. It should
NOT trigger for decreased indent (first line indented, rest flush-left).

Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold.

Tests:
- test_indented_first_line_new_block: PASS (non-indented → indented splits)
- test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together)
- All 179 line module tests: PASS
2026-06-07 13:43:19 -04:00

207 lines
5.1 KiB
Markdown

# pdftract-swift
Swift SDK for pdftract - PDF extraction and analysis for server-side Swift.
## Platform Support
**Supported**: macOS 13+, Linux (server-side use only)
**Unsupported**: iOS (Apple does not allow spawning subprocesses in App Store apps)
> **Note for iOS users**: Use `pdftract serve` over HTTP from your iOS client. Run the server with the Swift SDK on a macOS/Linux backend and make HTTP requests from your iOS app.
## Installation
Add to your `Package.swift`:
```swift
dependencies: [
.package(url: "https://github.com/jedarden/pdftract-swift", from: "1.0.0")
]
```
## Usage
### Basic extract
```swift
import Pdftract
let client = Pdftract()
let doc = try await client.extract(.path("document.pdf"))
print("Pages: \(doc.pages.count)")
print("Title: \(doc.metadata.title ?? "Untitled")")
```
### Extract from URL
```swift
let doc = try await client.extract(.url(URL(string: "https://example.com/doc.pdf")!))
```
### Extract with OCR
```swift
let options = ExtractOptions(
ocrLanguage: "eng",
ocrThreshold: 0.7
)
let doc = try await client.extract(.path("scanned.pdf"), options: options)
```
### Extract text
```swift
let text = try await client.extractText(.path("document.pdf"))
print(text)
```
### Extract Markdown
```swift
let md = try await client.extractMarkdown(.path("document.pdf"))
```
### Stream extraction (for large PDFs)
```swift
for await page in client.extractStream(.path("large.pdf")) {
print("Page \(page.pageIndex + 1): \(page.blocks.count) blocks")
}
```
### Search
```swift
for await match in client.search(.path("document.pdf"), "invoice") {
print("Found on page \(match.page): \(match.text)")
print(" Context: ...\(match.context.before)[\(match.text)]\(match.context.after)...")
}
```
### Get metadata
```swift
let metadata = try await client.getMetadata(.path("document.pdf"))
print("Pages: \(metadata.pageCount)")
print("Author: \(metadata.author ?? "Unknown")")
```
### Hash fingerprint
```swift
let fingerprint = try await client.hash(.path("document.pdf"))
print("SHA-256: \(fingerprint.hash)")
print("BLAKE3: \(fingerprint.fastHash)")
```
### Classify document
```swift
let classification = try await client.classify(.path("document.pdf"))
print("Category: \(classification.category)")
print("Confidence: \(classification.confidence)")
```
### Verify receipt
```swift
let receipt = Receipt(data: "...")
let valid = try await client.verifyReceipt("/path/to/receipt.pdf", receipt: receipt)
print("Valid: \(valid)")
```
## Binary version compatibility
This SDK requires pdftract 1.0.0. Download from:
https://github.com/jedarden/pdftract/releases/tag/v1.0.0
The SDK will search for `pdftract` on your PATH. To specify a custom binary path:
```swift
let client = Pdftract(binaryPath: "/custom/path/to/pdftract")
```
## Error handling
All methods are `async throws` and can throw the following errors:
| Error | Exit Code | Description |
|-------|-----------|-------------|
| `CorruptPdfError` | 2 | The PDF file is corrupt or invalid |
| `EncryptionError` | 3 | The PDF is encrypted and password is missing/wrong |
| `SourceUnreachableError` | 4 | The source (file or URL) is unreadable |
| `RemoteFetchInterruptedError` | 5 | Network interrupted during remote fetch |
| `TlsError` | 6 | TLS certificate validation failed |
| `ReceiptVerifyError` | 10 | Receipt verification failed |
| `PdftractError` | other | Internal error |
Example:
```swift
do {
let doc = try await client.extract(.path("document.pdf"))
} catch let error as PdftractError {
print("Error (code \(error.exitCode)): \(error.localizedDescription)")
}
```
## Options
### ExtractOptions
```swift
let options = ExtractOptions(
ocrLanguage: "eng", // ISO 639-3 language code
ocrThreshold: 0.7, // OCR confidence threshold (0-1)
preserveLayout: false, // Preserve original reading order
extractImages: false, // Extract embedded images
imageFormat: "png", // Format for images: png, jpg, webp
minImageSize: 64 // Minimum image dimension
)
```
### SearchOptions
```swift
let options = SearchOptions(
caseInsensitive: true, // Ignore case
regex: false, // Treat pattern as regex
wholeWord: false, // Match whole words only
maxResults: 100 // Maximum matches
)
```
### BaseOptions / HashOptions
```swift
let options = BaseOptions(
timeout: 60 // Maximum seconds
)
```
## Troubleshooting
### Binary not found
Ensure `pdftract` is on your PATH. The SDK searches PATH for the executable.
```bash
# Verify pdftract is available
pdftract --version
```
### Version mismatch
The SDK will refuse to invoke mismatched binary versions. Install the correct version from the releases page.
### Network failure
For remote URLs, check your network connection and TLS certificate chain.
## Conformance
This SDK passes 100% of the [pdftract conformance suite](https://github.com/jedarden/pdftract/tree/main/tests/sdk-conformance). The conformance report for this release is linked in the GitHub Release.
## License
MIT License - see LICENSE file for details.