The indent trigger was using .abs() which fired on both increased indent (non-indented → indented) AND decreased indent (indented → non-indented). This caused drop-cap style paragraphs (indented first line, flush-left continuation) to incorrectly split into two blocks. Per plan Phase 4.4 heuristic #2, indent change should only trigger when the current line is MORE indented (to the right, larger x0) than the block average - i.e., a new paragraph starting after non-indented text. It should NOT trigger for decreased indent (first line indented, rest flush-left). Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold. Tests: - test_indented_first_line_new_block: PASS (non-indented → indented splits) - test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together) - All 179 line module tests: PASS
207 lines
5.1 KiB
Markdown
207 lines
5.1 KiB
Markdown
# pdftract-swift
|
|
|
|
Swift SDK for pdftract - PDF extraction and analysis for server-side Swift.
|
|
|
|
## Platform Support
|
|
|
|
**Supported**: macOS 13+, Linux (server-side use only)
|
|
**Unsupported**: iOS (Apple does not allow spawning subprocesses in App Store apps)
|
|
|
|
> **Note for iOS users**: Use `pdftract serve` over HTTP from your iOS client. Run the server with the Swift SDK on a macOS/Linux backend and make HTTP requests from your iOS app.
|
|
|
|
## Installation
|
|
|
|
Add to your `Package.swift`:
|
|
|
|
```swift
|
|
dependencies: [
|
|
.package(url: "https://github.com/jedarden/pdftract-swift", from: "1.0.0")
|
|
]
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic extract
|
|
|
|
```swift
|
|
import Pdftract
|
|
|
|
let client = Pdftract()
|
|
let doc = try await client.extract(.path("document.pdf"))
|
|
print("Pages: \(doc.pages.count)")
|
|
print("Title: \(doc.metadata.title ?? "Untitled")")
|
|
```
|
|
|
|
### Extract from URL
|
|
|
|
```swift
|
|
let doc = try await client.extract(.url(URL(string: "https://example.com/doc.pdf")!))
|
|
```
|
|
|
|
### Extract with OCR
|
|
|
|
```swift
|
|
let options = ExtractOptions(
|
|
ocrLanguage: "eng",
|
|
ocrThreshold: 0.7
|
|
)
|
|
let doc = try await client.extract(.path("scanned.pdf"), options: options)
|
|
```
|
|
|
|
### Extract text
|
|
|
|
```swift
|
|
let text = try await client.extractText(.path("document.pdf"))
|
|
print(text)
|
|
```
|
|
|
|
### Extract Markdown
|
|
|
|
```swift
|
|
let md = try await client.extractMarkdown(.path("document.pdf"))
|
|
```
|
|
|
|
### Stream extraction (for large PDFs)
|
|
|
|
```swift
|
|
for await page in client.extractStream(.path("large.pdf")) {
|
|
print("Page \(page.pageIndex + 1): \(page.blocks.count) blocks")
|
|
}
|
|
```
|
|
|
|
### Search
|
|
|
|
```swift
|
|
for await match in client.search(.path("document.pdf"), "invoice") {
|
|
print("Found on page \(match.page): \(match.text)")
|
|
print(" Context: ...\(match.context.before)[\(match.text)]\(match.context.after)...")
|
|
}
|
|
```
|
|
|
|
### Get metadata
|
|
|
|
```swift
|
|
let metadata = try await client.getMetadata(.path("document.pdf"))
|
|
print("Pages: \(metadata.pageCount)")
|
|
print("Author: \(metadata.author ?? "Unknown")")
|
|
```
|
|
|
|
### Hash fingerprint
|
|
|
|
```swift
|
|
let fingerprint = try await client.hash(.path("document.pdf"))
|
|
print("SHA-256: \(fingerprint.hash)")
|
|
print("BLAKE3: \(fingerprint.fastHash)")
|
|
```
|
|
|
|
### Classify document
|
|
|
|
```swift
|
|
let classification = try await client.classify(.path("document.pdf"))
|
|
print("Category: \(classification.category)")
|
|
print("Confidence: \(classification.confidence)")
|
|
```
|
|
|
|
### Verify receipt
|
|
|
|
```swift
|
|
let receipt = Receipt(data: "...")
|
|
let valid = try await client.verifyReceipt("/path/to/receipt.pdf", receipt: receipt)
|
|
print("Valid: \(valid)")
|
|
```
|
|
|
|
## Binary version compatibility
|
|
|
|
This SDK requires pdftract 1.0.0. Download from:
|
|
https://github.com/jedarden/pdftract/releases/tag/v1.0.0
|
|
|
|
The SDK will search for `pdftract` on your PATH. To specify a custom binary path:
|
|
|
|
```swift
|
|
let client = Pdftract(binaryPath: "/custom/path/to/pdftract")
|
|
```
|
|
|
|
## Error handling
|
|
|
|
All methods are `async throws` and can throw the following errors:
|
|
|
|
| Error | Exit Code | Description |
|
|
|-------|-----------|-------------|
|
|
| `CorruptPdfError` | 2 | The PDF file is corrupt or invalid |
|
|
| `EncryptionError` | 3 | The PDF is encrypted and password is missing/wrong |
|
|
| `SourceUnreachableError` | 4 | The source (file or URL) is unreadable |
|
|
| `RemoteFetchInterruptedError` | 5 | Network interrupted during remote fetch |
|
|
| `TlsError` | 6 | TLS certificate validation failed |
|
|
| `ReceiptVerifyError` | 10 | Receipt verification failed |
|
|
| `PdftractError` | other | Internal error |
|
|
|
|
Example:
|
|
|
|
```swift
|
|
do {
|
|
let doc = try await client.extract(.path("document.pdf"))
|
|
} catch let error as PdftractError {
|
|
print("Error (code \(error.exitCode)): \(error.localizedDescription)")
|
|
}
|
|
```
|
|
|
|
## Options
|
|
|
|
### ExtractOptions
|
|
|
|
```swift
|
|
let options = ExtractOptions(
|
|
ocrLanguage: "eng", // ISO 639-3 language code
|
|
ocrThreshold: 0.7, // OCR confidence threshold (0-1)
|
|
preserveLayout: false, // Preserve original reading order
|
|
extractImages: false, // Extract embedded images
|
|
imageFormat: "png", // Format for images: png, jpg, webp
|
|
minImageSize: 64 // Minimum image dimension
|
|
)
|
|
```
|
|
|
|
### SearchOptions
|
|
|
|
```swift
|
|
let options = SearchOptions(
|
|
caseInsensitive: true, // Ignore case
|
|
regex: false, // Treat pattern as regex
|
|
wholeWord: false, // Match whole words only
|
|
maxResults: 100 // Maximum matches
|
|
)
|
|
```
|
|
|
|
### BaseOptions / HashOptions
|
|
|
|
```swift
|
|
let options = BaseOptions(
|
|
timeout: 60 // Maximum seconds
|
|
)
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Binary not found
|
|
|
|
Ensure `pdftract` is on your PATH. The SDK searches PATH for the executable.
|
|
|
|
```bash
|
|
# Verify pdftract is available
|
|
pdftract --version
|
|
```
|
|
|
|
### Version mismatch
|
|
|
|
The SDK will refuse to invoke mismatched binary versions. Install the correct version from the releases page.
|
|
|
|
### Network failure
|
|
|
|
For remote URLs, check your network connection and TLS certificate chain.
|
|
|
|
## Conformance
|
|
|
|
This SDK passes 100% of the [pdftract conformance suite](https://github.com/jedarden/pdftract/tree/main/tests/sdk-conformance). The conformance report for this release is linked in the GitHub Release.
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details.
|