pdftract/templates/sdk-skeleton/java/README.md.tera
jedarden 9aa26a449e docs(pdftract-49f8): establish Cargo.lock policy and documentation
This commit implements the Cargo.lock policy for reproducible builds
across all workspace members (pdftract-core, pdftract-cli, pdftract-py).

Changes:
- Add CONTRIBUTING.md with lockfile-update workflow documentation
- Add .renovaterc.json for weekly lockfile-only PRs (human-gated)
- Add crates/pdftract-core/README.md with rationale for checked-in lockfiles
- Add notes/pdftract-49f8.md with verification note

The Argo workflow updates (pdftract-ci.yaml) are committed separately
in the declarative-config repo.

Acceptance criteria:
- PASS: Cargo.lock tracked by git, not in .gitignore
- PASS: Argo workflow templates document --locked/--frozen requirements
- WARN: Enforcement to be completed when placeholder templates are implemented
- WARN: Binary reproducibility verification deferred to pdftract-build-binaries implementation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 18:13:14 -04:00

198 lines
4.7 KiB
Text

# pdftract-java
Java SDK for pdftract - PDF extraction and conformance testing.
## Installation
```xml
<dependency>
<groupId>com.jedarden</groupId>
<artifactId>pdftract</artifactId>
<version>{{ version }}</version>
</dependency>
```
## Requirements
- **Java 17 or higher** - The SDK uses records, sealed interfaces, and switch expressions
- **pdftract binary** - Install from [releases](https://github.com/jedarden/pdftract/releases/tag/v{{ version }})
## Usage
### Java - Basic extract
```java
import com.jedarden.pdftract.Pdftract;
import com.jedarden.pdftract.codegen.Source;
import com.jedarden.pdftract.codegen.Document;
try (Pdftract client = new Pdftract()) {
Document doc = client.extract(Source.fromPath("document.pdf"), null);
System.out.println("Pages: " + doc.pages().size());
}
```
### Java - Extract with options
```java
import com.jedarden.pdftract.codegen.ExtractOptions;
ExtractOptions options = new ExtractOptions()
.setOcrLanguage("eng")
.setOcrThreshold(0.7)
.setPassword("secret");
Document doc = client.extract(Source.fromPath("scanned.pdf"), options);
```
### Java - Search
```java
import java.util.stream.Stream;
import com.jedarden.pdftract.codegen.Match;
try (Stream<Match> matches = client.search(
Source.fromPath("document.pdf"),
"invoice",
null)) {
matches.forEach(match -> {
System.out.println("Found on page " + match.page() + ": " + match.text());
});
}
```
### Java - Stream extraction
```java
import java.util.stream.Stream;
import com.jedarden.pdftract.codegen.Page;
try (Stream<Page> pages = client.extractStream(
Source.fromPath("large.pdf"),
null)) {
pages.forEach(page -> {
System.out.println("Page " + page.pageIndex() + ": " + page.blocks().size() + " blocks");
});
}
```
### Kotlin - Idiomatic syntax
The same JAR includes Kotlin extension functions for idiomatic usage:
```kotlin
import com.jedarden.pdftract.*
import com.jedarden.pdftract.codegen.extractOptions
pdftract {
val doc = extract(Paths.get("document.pdf")) {
ocrLanguage = "eng"
ocrThreshold = 0.7
}
println("Pages: ${doc.pages.size}")
}
```
### Kotlin - Search with Sequence
```kotlin
pdftract {
search(Paths.get("document.pdf"), "invoice") {
maxResults = 10
wholeWord = true
}.forEach { match ->
println("Found on page ${match.page}: ${match.text}")
}
}
```
## Error handling
All SDK methods throw `PdftractException` or its subclasses:
```java
try (Pdftract client = new Pdftract()) {
Document doc = client.extract(source, null);
} catch (CorruptPdfException e) {
// PDF is corrupt (exit code 2)
System.err.println("Corrupt PDF: " + e.getMessage());
} catch (EncryptionException e) {
// PDF is encrypted (exit code 3)
System.err.println("Encryption error: " + e.getMessage());
} catch (SourceUnreachableException e) {
// File or URL unreadable (exit code 4)
System.err.println("Source unreachable: " + e.getMessage());
} catch (PdftractException e) {
// Other errors
System.err.println("Error (exit code " + e.getExitCode() + "): " + e.getMessage());
}
```
## Exception mapping
| Exit code | Exception | Description |
|-----------|-----------|-------------|
| 0 | Success | No error |
| 2 | CorruptPdfException | PDF is corrupt or invalid |
| 3 | EncryptionException | PDF encrypted, password missing/wrong |
| 4 | SourceUnreachableException | File or URL unreadable |
| 5 | RemoteFetchInterruptedException | Network interrupted during fetch |
| 6 | TlsException | TLS certificate validation failed |
| 10 | ReceiptVerifyException | Receipt verification failed |
## Source types
```java
// From file path
Source.fromPath(Paths.get("document.pdf"));
Source.fromPath("document.pdf");
// From URL
Source.fromUrl(URI.create("https://example.com/doc.pdf"));
Source.fromUrl("https://example.com/doc.pdf");
// From bytes
Source.fromBytes(Files.readAllBytes(Paths.get("document.pdf")));
```
## Binary discovery
The SDK looks for the `pdftract` binary on your PATH. To use a custom path:
```java
try (Pdftract client = new Pdftract("/custom/path/to/pdftract")) {
// ...
}
```
## Troubleshooting
### Binary not found
Ensure `pdftract` is on your PATH. Verify with:
```bash
pdftract --version
```
### Version mismatch
The SDK expects pdftract {{ version }}. Install the matching version from releases.
### Network failure
For remote URLs, check your network connection and TLS certificate chain.
### AutoCloseable
Always use try-with-resources or call `close()` to ensure clean subprocess termination:
```java
try (Pdftract client = new Pdftract()) {
// work with client
} // automatically calls close()
```
## License
MIT