# pdftract Java SDK
[](https://central.sonatype.com/search?q=com.jedarden:pdftract)
[](LICENSE)
Java/Kotlin SDK for [pdftract](https://github.com/jedarden/pdftract) — PDF extraction and analysis library.
## Features
- **9 contract methods**: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt
- **AutoCloseable client**: Use with try-with-resources for automatic cleanup
- **8 typed exceptions**: CorruptPdfException, EncryptionException, SourceUnreachableException, etc.
- **Kotlin extensions**: Idiomatic Kotlin syntax in the same artifact
- **Java 17+**: Modern Java with records and pattern matching
## Installation
Add to your `pom.xml`:
```xml
com.jedarden
pdftract
0.1.0
```
Or for Gradle:
```groovy
implementation 'com.jedarden:pdftract:0.1.0'
```
## Requirements
- Java 17 or higher
- The `pdftract` binary must be available on your PATH (or specify custom path)
- Download from [GitHub Releases](https://github.com/jedarden/pdftract/releases)
## Java Usage
### Basic extraction
```java
import com.jedarden.pdftract.*;
import com.jedarden.pdftract.codegen.*;
import java.nio.file.Path;
try (Pdftract client = new Pdftract()) {
// Extract structured data
Document doc = client.extract(
Source.fromPath("document.pdf"),
null
);
System.out.println("Pages: " + doc.pages().size());
System.out.println("Title: " + doc.metadata().title());
// Access pages, blocks, and spans
for (Page page : doc.pages()) {
System.out.println("Page " + page.pageIndex() + ": " + page.width() + "x" + page.height());
for (Block block : page.blocks()) {
System.out.println(" " + block.kind() + ": " + block.text());
}
}
}
```
### Extract plain text
```java
try (Pdftract client = new Pdftract()) {
String text = client.extractText(
Source.fromPath("document.pdf"),
null
);
System.out.println(text);
}
```
### Extract Markdown
```java
try (Pdftract client = new Pdftract()) {
String markdown = client.extractMarkdown(
Source.fromPath("document.pdf"),
null
);
System.out.println(markdown);
}
```
### OCR options
```java
ExtractOptions options = new ExtractOptions()
.setOcrLanguage("eng")
.setOcrThreshold(0.7);
Document doc = client.extract(Source.fromPath("scanned.pdf"), options);
```
### Password-protected PDFs
```java
BaseOptions options = new BaseOptions()
.setPassword("secret");
Document doc = client.extract(Source.fromPath("protected.pdf"), options);
```
### Stream pages (for large PDFs)
```java
try (Pdftract client = new Pdftract()) {
client.extractStream(Source.fromPath("large.pdf"), null)
.forEach(page -> {
System.out.println("Page " + page.pageIndex());
// Process each page as it arrives
});
}
```
### Search for text
```java
try (Pdftract client = new Pdftract()) {
SearchOptions options = new SearchOptions()
.setMaxResults(100)
.setWholeWord(true);
client.search(Source.fromPath("document.pdf"), "invoice", options)
.forEach(match -> {
System.out.println("Found at page " + match.page() + ": " + match.text());
});
}
```
### Get metadata
```java
try (Pdftract client = new Pdftract()) {
Metadata metadata = client.getMetadata(
Source.fromPath("document.pdf"),
null
);
System.out.println("Pages: " + metadata.pageCount());
System.out.println("Title: " + metadata.title());
System.out.println("Author: " + metadata.author());
}
```
### Compute fingerprint
```java
try (Pdftract client = new Pdftract()) {
Fingerprint fp = client.hash(
Source.fromPath("document.pdf"),
null
);
System.out.println("SHA-256: " + fp.hash());
System.out.println("Fast hash: " + fp.fastHash());
}
```
### Classify document
```java
try (Pdftract client = new Pdftract()) {
Classification cls = client.classify(
Source.fromPath("unknown.pdf")
);
System.out.println("Category: " + cls.category());
System.out.println("Confidence: " + cls.confidence());
}
```
### Verify receipt
```java
try (Pdftract client = new Pdftract()) {
Receipt receipt = new Receipt(
"abc123def456", // fingerprint
"sig789xyz012" // signature
);
boolean valid = client.verifyReceipt(
Path.of("receipt.pdf"),
receipt
);
System.out.println("Valid: " + valid);
}
```
### URL sources
```java
try (Pdftract client = new Pdftract()) {
Document doc = client.extract(
Source.fromUrl("https://example.com/document.pdf"),
null
);
}
```
### Byte sources
```java
byte[] pdfBytes = Files.readAllBytes(Path.of("document.pdf"));
try (Pdftract client = new Pdftract()) {
Document doc = client.extract(
Source.fromBytes(pdfBytes),
null
);
}
```
### Custom binary path
```java
try (Pdftract client = new Pdftract("/path/to/pdftract")) {
Document doc = client.extract(Source.fromPath("doc.pdf"), null);
}
```
## Kotlin Usage
The Kotlin extensions provide idiomatic syntax with lambda-based options:
```kotlin
import com.jedarden.pdftract.*
import com.jedarden.pdftract.codegen.*
import java.nio.file.Path
// Use with invoke operator (use-with-resources pattern)
pdftract {
val doc = extract(Path.of("document.pdf")) {
ocrLanguage = "eng"
ocrThreshold = 0.7
}
println("Pages: ${doc.pages.size}")
}
// Or use try-with-resources explicitly
Pdftract().use { client ->
val doc = client.extract(Path.of("document.pdf"))
println(doc.metadata.title)
}
// Extract text
Pdftract().use { client ->
val text = client.extractText(Path.of("document.pdf")) {
ocrLanguage = "eng"
}
println(text)
}
// Search with options
Pdftract().use { client ->
client.search(Path.of("document.pdf"), "invoice") {
maxResults = 100
wholeWord = true
}.forEach { match ->
println("Found at page ${match.page}: ${match.text}")
}
}
// Stream pages (converts to Sequence)
Pdftract().use { client ->
client.extractStream(Path.of("large.pdf")) {
ocrLanguage = "eng"
}.forEach { page ->
println("Page ${page.pageIndex}")
}
}
```
## Exception Handling
All methods throw `PdftractException` or its subclasses:
```java
try (Pdftract client = new Pdftract()) {
Document doc = client.extract(Source.fromPath("doc.pdf"), null);
} catch (CorruptPdfException e) {
System.err.println("PDF is corrupt: " + e.getMessage());
} catch (EncryptionException e) {
System.err.println("PDF is encrypted: " + e.getMessage());
} catch (SourceUnreachableException e) {
System.err.println("Cannot read source: " + e.getMessage());
} catch (TlsException e) {
System.err.println("TLS error: " + e.getMessage());
} catch (PdftractException e) {
System.err.println("Error (exit code " + e.getExitCode() + "): " + e.getMessage());
}
```
Exception types:
- `PdftractException` — Base exception
- `CorruptPdfException` — PDF is corrupt (exit code 2)
- `EncryptionException` — PDF is encrypted (exit code 3)
- `SourceUnreachableException` — Cannot read source (exit code 4)
- `RemoteFetchInterruptedException` — Network interrupted (exit code 5)
- `TlsException` — TLS certificate error (exit code 6)
- `ReceiptVerifyException` — Receipt verification failed (exit code 10)
## Data Types
### Source
Sealed interface for PDF input sources:
- `Source.fromPath(Path)` — Local file path
- `Source.fromUrl(String)` — Remote URL
- `Source.fromBytes(byte[])` — Raw bytes
### Document
```java
public record Document(
String schemaVersion,
DocumentMetadata metadata,
List pages,
List errors
)
```
### Page
```java
public record Page(
int pageIndex,
double width,
double height,
int rotation,
String pageType, // "vector" or "scanned"
List spans,
List blocks
)
```
### Block
```java
public record Block(
String kind, // "paragraph", "heading", "table", "figure", "list"
List bbox, // [x1, y1, x2, y2]
List lines
)
```
### Options
- `ExtractOptions` — Extends `BaseOptions`, adds OCR settings
- `SearchOptions` — Extends `BaseOptions`, adds search settings
- `BaseOptions` — Password and common settings
## Conformance
This SDK passes the [pdftract conformance suite](https://github.com/jedarden/pdftract/tree/main/tests/sdk-conformance).
Run tests:
```bash
mvn test
```
## License
MIT License — see [LICENSE](LICENSE) for details.
## Links
- [GitHub](https://github.com/jedarden/pdftract-java)
- [pdftract CLI](https://github.com/jedarden/pdftract)
- [Conformance Report](https://github.com/jedarden/pdftract/releases/latest)