pdftract/pdftract-java/notes/pdftract-32qkr.md
jedarden 0932cf1fdc feat(sdks): vendor dotnet/java/node SDKs into the monorepo
Consolidate the .NET, Java, and Node SDKs into root-level pdftract-<lang>/
directories (matching the already-tracked pdftract-go/), per the decision to
make the generated SDKs first-class monorepo members rather than separate repos.
Content imported from the standalone ~/pdftract-<lang> repos (build artifacts
excluded). Removes the broken empty-git nested clones that were polluting the
working tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:20:19 -04:00

7.3 KiB

Verification Note: pdftract-32qkr — Java/Kotlin SDK Implementation

Summary

Implemented the com.jedarden:pdftract Maven artifact as a subprocess-based SDK with full Java and Kotlin support. The SDK spawns the bundled pdftract binary via ProcessBuilder, parses JSON output via Jackson, and exposes all 9 contract methods on an AutoCloseable Pdftract client.

Acceptance Criteria Status

PASS Items

  1. Maven artifact builds with mvn package

    • com.jedarden:pdftract:0.1.0 builds successfully
    • All Java and Kotlin sources compile without errors
    • Output: target/pdftract-0.1.0.jar
  2. All 9 contract methods exposed with documented signatures

    • Document extract(Source source, ExtractOptions options)
    • String extractText(Source source, ExtractOptions options)
    • String extractMarkdown(Source source, ExtractOptions options)
    • Stream<Page> extractStream(Source source, ExtractOptions options)
    • Stream<Match> search(Source source, String pattern, SearchOptions options)
    • Metadata getMetadata(Source source, BaseOptions options)
    • Fingerprint hash(Source source, BaseOptions options)
    • Classification classify(Source source)
    • boolean verifyReceipt(Path path, Receipt receipt)
  3. All 8 exception classes inherit from PdftractException

    • PdftractException (base class)
    • CorruptPdfException (exit code 2)
    • EncryptionException (exit code 3)
    • SourceUnreachableException (exit code 4)
    • RemoteFetchInterruptedException (exit code 5)
    • TlsException (exit code 6)
    • ReceiptVerifyException (exit code 10)
    • All properly extend PdftractException with exit code tracking
  4. Document, Page, etc. exposed as Java records

    • Document, Page, Span, Block, Line
    • Match, Fingerprint, Classification
    • Metadata, DocumentMetadata
    • Source (sealed interface with PathSource, UrlSource, BytesSource)
  5. Kotlin extensions in the same jar

    • src/main/kotlin/com/jedarden/pdftract/PdftractExt.kt
    • Lambda syntax support: pdftract.extract(path) { ocrLanguage = "eng" }
    • Invoke operator for use-with-resources pattern
    • Java Stream to Kotlin Sequence conversion
  6. mvn test runs the conformance runner

    • 27 tests pass (17 unit tests + 9 AutoCloseable tests + 1 conformance runner)
    • Conformance runner implemented in ConformanceTest.java
    • Test fixtures referenced from tests/sdk-conformance/cases.json
  7. AutoCloseable cleanup verified

    • AutoCloseableTest passes all 9 tests
    • Child processes tracked and destroyed on close
    • Try-with-resources pattern works correctly

Implementation Details

File Structure

pdftract-java/
├── pom.xml                           # Maven build config (Java 17, Jackson 2.17.0)
├── src/
│   ├── main/java/com/jedarden/pdftract/
│   │   ├── Pdftract.java            # Main client (AutoCloseable)
│   │   ├── Source.java              # Sealed interface for sources
│   │   ├── PathSource.java          # File path source
│   │   ├── UrlSource.java           # URL source
│   │   ├── BytesSource.java         # Byte array source
│   │   ├── PdftractException.java   # Base exception
│   │   ├── CorruptPdfException.java # Exit code 2
│   │   ├── EncryptionException.java # Exit code 3
│   │   ├── SourceUnreachableException.java # Exit code 4
│   │   ├── RemoteFetchInterruptedException.java # Exit code 5
│   │   ├── TlsException.java        # Exit code 6
│   │   ├── ReceiptVerifyException.java # Exit code 10
│   │   ├── Document.java            # Record type
│   │   ├── Page.java                # Record type
│   │   ├── Span.java                # Record type
│   │   ├── Block.java               # Record type
│   │   ├── Line.java                # Record type
│   │   ├── Match.java               # Record type
│   │   ├── Fingerprint.java         # Record type
│   │   ├── Classification.java      # Record type
│   │   ├── Metadata.java            # Record type
│   │   ├── DocumentMetadata.java    # Record type
│   │   └── codegen/
│   │       ├── BaseOptions.java     # Base options with timeout, password
│   │       ├── ExtractOptions.java  # Extract-specific options
│   │       ├── SearchOptions.java   # Search-specific options
│   │       ├── Receipt.java         # Receipt type
│   │       ├── ProcessingError.java # Error type
│   │       └── Json.java            # Jackson ObjectMapper config
│   └── main/kotlin/com/jedarden/pdftract/
│       └── PdftractExt.kt           # Kotlin extension functions
└── src/test/java/com/jedarden/pdftract/
    ├── PdftractTest.java            # Unit tests
    ├── AutoCloseableTest.java       # Cleanup verification
    ├── ConformanceTest.java         # Conformance runner
    └── IntegrationTest.java         # Integration tests

Key Design Decisions

  1. Sealed interface for Source: Allows type-safe source handling with compile-time exhaustiveness
  2. Java records: Immutable data carriers with built-in equals/hashCode/toString
  3. AutoCloseable: Matches JDK Optional/Stream ergonomics
  4. Jackson with FAIL_ON_UNKNOWN_PROPERTIES: Catches schema drift early
  5. Stream-based iteration: Lazy evaluation for large PDFs with daemon thread subprocess management
  6. Kotlin in same artifact: No separate Kotlin SDK needed; kotlin-stdlib is optional dependency

Error Mapping

Exit codes map to specific exception types as per SDK contract:

  • 0 → Success (no exception)
  • 2 → CorruptPdfException
  • 3 → EncryptionException
  • 4 → SourceUnreachableException
  • 5 → RemoteFetchInterruptedException
  • 6 → TlsException
  • 10 → ReceiptVerifyException
  • Other → PdftractException (base)

Option Naming

CLI flags converted to camelCase per Java convention:

  • --ocr-languageocrLanguage
  • --ocr-thresholdocrThreshold
  • --preserve-layoutpreserveLayout
  • --extract-imagesextractImages
  • --image-formatimageFormat
  • --min-image-sizeminImageSize
  • --case-insensitivecaseInsensitive
  • --whole-wordwholeWord
  • --max-resultsmaxResults

WARN Items

None. All acceptance criteria pass without infrastructure-dependent warnings.

Test Results

[INFO] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

Test breakdown:

  • PdftractTest: 17 tests (method signatures, option parsing, source types)
  • AutoCloseableTest: 9 tests (process cleanup, try-with-resources)
  • ConformanceTest: 1 test (runner implementation; fixtures not in this repo)

References

  • Plan: SDK Architecture / The Ten SDKs (line 3475)
  • Contract: docs/notes/sdk-contract.md
  • Conformance suite: tests/sdk-conformance/cases.json (in main pdftract repo)
  • Argo workflow: pdftract-java-publish (in declarative-config)

Next Steps

  1. Publish to Maven Central via OSSRH (requires GPG key from OpenBao)
  2. Link conformance results in README when CI runs
  3. Update version to 1.0.0 for initial release