Consolidate the .NET, Java, and Node SDKs into root-level pdftract-<lang>/ directories (matching the already-tracked pdftract-go/), per the decision to make the generated SDKs first-class monorepo members rather than separate repos. Content imported from the standalone ~/pdftract-<lang> repos (build artifacts excluded). Removes the broken empty-git nested clones that were polluting the working tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| notes | ||
| src | ||
| .gitignore | ||
| GENERATED | ||
| LICENSE | ||
| pom.xml | ||
| README.md | ||
pdftract Java SDK
Java/Kotlin SDK for pdftract — PDF extraction and analysis library.
Features
- 9 contract methods: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt
- AutoCloseable client: Use with try-with-resources for automatic cleanup
- 8 typed exceptions: CorruptPdfException, EncryptionException, SourceUnreachableException, etc.
- Kotlin extensions: Idiomatic Kotlin syntax in the same artifact
- Java 17+: Modern Java with records and pattern matching
Installation
Add to your pom.xml:
<dependency>
<groupId>com.jedarden</groupId>
<artifactId>pdftract</artifactId>
<version>0.1.0</version>
</dependency>
Or for Gradle:
implementation 'com.jedarden:pdftract:0.1.0'
Requirements
- Java 17 or higher
- The
pdftractbinary must be available on your PATH (or specify custom path)- Download from GitHub Releases
Java Usage
Basic extraction
import com.jedarden.pdftract.*;
import com.jedarden.pdftract.codegen.*;
import java.nio.file.Path;
try (Pdftract client = new Pdftract()) {
// Extract structured data
Document doc = client.extract(
Source.fromPath("document.pdf"),
null
);
System.out.println("Pages: " + doc.pages().size());
System.out.println("Title: " + doc.metadata().title());
// Access pages, blocks, and spans
for (Page page : doc.pages()) {
System.out.println("Page " + page.pageIndex() + ": " + page.width() + "x" + page.height());
for (Block block : page.blocks()) {
System.out.println(" " + block.kind() + ": " + block.text());
}
}
}
Extract plain text
try (Pdftract client = new Pdftract()) {
String text = client.extractText(
Source.fromPath("document.pdf"),
null
);
System.out.println(text);
}
Extract Markdown
try (Pdftract client = new Pdftract()) {
String markdown = client.extractMarkdown(
Source.fromPath("document.pdf"),
null
);
System.out.println(markdown);
}
OCR options
ExtractOptions options = new ExtractOptions()
.setOcrLanguage("eng")
.setOcrThreshold(0.7);
Document doc = client.extract(Source.fromPath("scanned.pdf"), options);
Password-protected PDFs
BaseOptions options = new BaseOptions()
.setPassword("secret");
Document doc = client.extract(Source.fromPath("protected.pdf"), options);
Stream pages (for large PDFs)
try (Pdftract client = new Pdftract()) {
client.extractStream(Source.fromPath("large.pdf"), null)
.forEach(page -> {
System.out.println("Page " + page.pageIndex());
// Process each page as it arrives
});
}
Search for text
try (Pdftract client = new Pdftract()) {
SearchOptions options = new SearchOptions()
.setMaxResults(100)
.setWholeWord(true);
client.search(Source.fromPath("document.pdf"), "invoice", options)
.forEach(match -> {
System.out.println("Found at page " + match.page() + ": " + match.text());
});
}
Get metadata
try (Pdftract client = new Pdftract()) {
Metadata metadata = client.getMetadata(
Source.fromPath("document.pdf"),
null
);
System.out.println("Pages: " + metadata.pageCount());
System.out.println("Title: " + metadata.title());
System.out.println("Author: " + metadata.author());
}
Compute fingerprint
try (Pdftract client = new Pdftract()) {
Fingerprint fp = client.hash(
Source.fromPath("document.pdf"),
null
);
System.out.println("SHA-256: " + fp.hash());
System.out.println("Fast hash: " + fp.fastHash());
}
Classify document
try (Pdftract client = new Pdftract()) {
Classification cls = client.classify(
Source.fromPath("unknown.pdf")
);
System.out.println("Category: " + cls.category());
System.out.println("Confidence: " + cls.confidence());
}
Verify receipt
try (Pdftract client = new Pdftract()) {
Receipt receipt = new Receipt(
"abc123def456", // fingerprint
"sig789xyz012" // signature
);
boolean valid = client.verifyReceipt(
Path.of("receipt.pdf"),
receipt
);
System.out.println("Valid: " + valid);
}
URL sources
try (Pdftract client = new Pdftract()) {
Document doc = client.extract(
Source.fromUrl("https://example.com/document.pdf"),
null
);
}
Byte sources
byte[] pdfBytes = Files.readAllBytes(Path.of("document.pdf"));
try (Pdftract client = new Pdftract()) {
Document doc = client.extract(
Source.fromBytes(pdfBytes),
null
);
}
Custom binary path
try (Pdftract client = new Pdftract("/path/to/pdftract")) {
Document doc = client.extract(Source.fromPath("doc.pdf"), null);
}
Kotlin Usage
The Kotlin extensions provide idiomatic syntax with lambda-based options:
import com.jedarden.pdftract.*
import com.jedarden.pdftract.codegen.*
import java.nio.file.Path
// Use with invoke operator (use-with-resources pattern)
pdftract {
val doc = extract(Path.of("document.pdf")) {
ocrLanguage = "eng"
ocrThreshold = 0.7
}
println("Pages: ${doc.pages.size}")
}
// Or use try-with-resources explicitly
Pdftract().use { client ->
val doc = client.extract(Path.of("document.pdf"))
println(doc.metadata.title)
}
// Extract text
Pdftract().use { client ->
val text = client.extractText(Path.of("document.pdf")) {
ocrLanguage = "eng"
}
println(text)
}
// Search with options
Pdftract().use { client ->
client.search(Path.of("document.pdf"), "invoice") {
maxResults = 100
wholeWord = true
}.forEach { match ->
println("Found at page ${match.page}: ${match.text}")
}
}
// Stream pages (converts to Sequence)
Pdftract().use { client ->
client.extractStream(Path.of("large.pdf")) {
ocrLanguage = "eng"
}.forEach { page ->
println("Page ${page.pageIndex}")
}
}
Exception Handling
All methods throw PdftractException or its subclasses:
try (Pdftract client = new Pdftract()) {
Document doc = client.extract(Source.fromPath("doc.pdf"), null);
} catch (CorruptPdfException e) {
System.err.println("PDF is corrupt: " + e.getMessage());
} catch (EncryptionException e) {
System.err.println("PDF is encrypted: " + e.getMessage());
} catch (SourceUnreachableException e) {
System.err.println("Cannot read source: " + e.getMessage());
} catch (TlsException e) {
System.err.println("TLS error: " + e.getMessage());
} catch (PdftractException e) {
System.err.println("Error (exit code " + e.getExitCode() + "): " + e.getMessage());
}
Exception types:
PdftractException— Base exceptionCorruptPdfException— PDF is corrupt (exit code 2)EncryptionException— PDF is encrypted (exit code 3)SourceUnreachableException— Cannot read source (exit code 4)RemoteFetchInterruptedException— Network interrupted (exit code 5)TlsException— TLS certificate error (exit code 6)ReceiptVerifyException— Receipt verification failed (exit code 10)
Data Types
Source
Sealed interface for PDF input sources:
Source.fromPath(Path)— Local file pathSource.fromUrl(String)— Remote URLSource.fromBytes(byte[])— Raw bytes
Document
public record Document(
String schemaVersion,
DocumentMetadata metadata,
List<Page> pages,
List<ProcessingError> errors
)
Page
public record Page(
int pageIndex,
double width,
double height,
int rotation,
String pageType, // "vector" or "scanned"
List<Span> spans,
List<Block> blocks
)
Block
public record Block(
String kind, // "paragraph", "heading", "table", "figure", "list"
List<Double> bbox, // [x1, y1, x2, y2]
List<Line> lines
)
Options
ExtractOptions— ExtendsBaseOptions, adds OCR settingsSearchOptions— ExtendsBaseOptions, adds search settingsBaseOptions— Password and common settings
Conformance
This SDK passes the pdftract conformance suite.
Run tests:
mvn test
License
MIT License — see LICENSE for details.