pdftract

History

jedarden 9aa26a449e docs(pdftract-49f8): establish Cargo.lock policy and documentation This commit implements the Cargo.lock policy for reproducible builds across all workspace members (pdftract-core, pdftract-cli, pdftract-py). Changes: - Add CONTRIBUTING.md with lockfile-update workflow documentation - Add .renovaterc.json for weekly lockfile-only PRs (human-gated) - Add crates/pdftract-core/README.md with rationale for checked-in lockfiles - Add notes/pdftract-49f8.md with verification note The Argo workflow updates (pdftract-ci.yaml) are committed separately in the declarative-config repo. Acceptance criteria: - PASS: Cargo.lock tracked by git, not in .gitignore - PASS: Argo workflow templates document --locked/--frozen requirements - WARN: Enforcement to be completed when placeholder templates are implemented - WARN: Binary reproducibility verification deferred to pdftract-build-binaries implementation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-20 18:13:14 -04:00
..
src	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
GENERATED.tera	feat(pdftract-l993m): complete per-language Tera template scaffolding	2026-05-18 02:01:46 -04:00
pom.xml.tera	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00
README.md.tera	docs(pdftract-49f8): establish Cargo.lock policy and documentation	2026-05-20 18:13:14 -04:00

README.md.tera

# pdftract-java

Java SDK for pdftract - PDF extraction and conformance testing.

## Installation

```xml
<dependency>
    <groupId>com.jedarden</groupId>
    <artifactId>pdftract</artifactId>
    <version>{{ version }}</version>
</dependency>
```

## Requirements

- **Java 17 or higher** - The SDK uses records, sealed interfaces, and switch expressions
- **pdftract binary** - Install from [releases](https://github.com/jedarden/pdftract/releases/tag/v{{ version }})

## Usage

### Java - Basic extract

```java
import com.jedarden.pdftract.Pdftract;
import com.jedarden.pdftract.codegen.Source;
import com.jedarden.pdftract.codegen.Document;

try (Pdftract client = new Pdftract()) {
    Document doc = client.extract(Source.fromPath("document.pdf"), null);
    System.out.println("Pages: " + doc.pages().size());
}
```

### Java - Extract with options

```java
import com.jedarden.pdftract.codegen.ExtractOptions;

ExtractOptions options = new ExtractOptions()
    .setOcrLanguage("eng")
    .setOcrThreshold(0.7)
    .setPassword("secret");

Document doc = client.extract(Source.fromPath("scanned.pdf"), options);
```

### Java - Search

```java
import java.util.stream.Stream;
import com.jedarden.pdftract.codegen.Match;

try (Stream<Match> matches = client.search(
        Source.fromPath("document.pdf"),
        "invoice",
        null)) {
    matches.forEach(match -> {
        System.out.println("Found on page " + match.page() + ": " + match.text());
    });
}
```

### Java - Stream extraction

```java
import java.util.stream.Stream;
import com.jedarden.pdftract.codegen.Page;

try (Stream<Page> pages = client.extractStream(
        Source.fromPath("large.pdf"),
        null)) {
    pages.forEach(page -> {
        System.out.println("Page " + page.pageIndex() + ": " + page.blocks().size() + " blocks");
    });
}
```

### Kotlin - Idiomatic syntax

The same JAR includes Kotlin extension functions for idiomatic usage:

```kotlin
import com.jedarden.pdftract.*
import com.jedarden.pdftract.codegen.extractOptions

pdftract {
    val doc = extract(Paths.get("document.pdf")) {
        ocrLanguage = "eng"
        ocrThreshold = 0.7
    }
    println("Pages: ${doc.pages.size}")
}
```

### Kotlin - Search with Sequence

```kotlin
pdftract {
    search(Paths.get("document.pdf"), "invoice") {
        maxResults = 10
        wholeWord = true
    }.forEach { match ->
        println("Found on page ${match.page}: ${match.text}")
    }
}
```

## Error handling

All SDK methods throw `PdftractException` or its subclasses:

```java
try (Pdftract client = new Pdftract()) {
    Document doc = client.extract(source, null);
} catch (CorruptPdfException e) {
    // PDF is corrupt (exit code 2)
    System.err.println("Corrupt PDF: " + e.getMessage());
} catch (EncryptionException e) {
    // PDF is encrypted (exit code 3)
    System.err.println("Encryption error: " + e.getMessage());
} catch (SourceUnreachableException e) {
    // File or URL unreadable (exit code 4)
    System.err.println("Source unreachable: " + e.getMessage());
} catch (PdftractException e) {
    // Other errors
    System.err.println("Error (exit code " + e.getExitCode() + "): " + e.getMessage());
}
```

## Exception mapping

| Exit code | Exception | Description |
|-----------|-----------|-------------|
| 0 | Success | No error |
| 2 | CorruptPdfException | PDF is corrupt or invalid |
| 3 | EncryptionException | PDF encrypted, password missing/wrong |
| 4 | SourceUnreachableException | File or URL unreadable |
| 5 | RemoteFetchInterruptedException | Network interrupted during fetch |
| 6 | TlsException | TLS certificate validation failed |
| 10 | ReceiptVerifyException | Receipt verification failed |

## Source types

```java
// From file path
Source.fromPath(Paths.get("document.pdf"));
Source.fromPath("document.pdf");

// From URL
Source.fromUrl(URI.create("https://example.com/doc.pdf"));
Source.fromUrl("https://example.com/doc.pdf");

// From bytes
Source.fromBytes(Files.readAllBytes(Paths.get("document.pdf")));
```

## Binary discovery

The SDK looks for the `pdftract` binary on your PATH. To use a custom path:

```java
try (Pdftract client = new Pdftract("/custom/path/to/pdftract")) {
    // ...
}
```

## Troubleshooting

### Binary not found

Ensure `pdftract` is on your PATH. Verify with:

```bash
pdftract --version
```

### Version mismatch

The SDK expects pdftract {{ version }}. Install the matching version from releases.

### Network failure

For remote URLs, check your network connection and TLS certificate chain.

### AutoCloseable

Always use try-with-resources or call `close()` to ensure clean subprocess termination:

```java
try (Pdftract client = new Pdftract()) {
    // work with client
} // automatically calls close()
```

## License

MIT