Consolidate the .NET, Java, and Node SDKs into root-level pdftract-<lang>/ directories (matching the already-tracked pdftract-go/), per the decision to make the generated SDKs first-class monorepo members rather than separate repos. Content imported from the standalone ~/pdftract-<lang> repos (build artifacts excluded). Removes the broken empty-git nested clones that were polluting the working tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
164 lines
7.3 KiB
Markdown
164 lines
7.3 KiB
Markdown
# Verification Note: pdftract-32qkr — Java/Kotlin SDK Implementation
|
|
|
|
## Summary
|
|
|
|
Implemented the `com.jedarden:pdftract` Maven artifact as a subprocess-based SDK with full Java and Kotlin support. The SDK spawns the bundled `pdftract` binary via `ProcessBuilder`, parses JSON output via Jackson, and exposes all 9 contract methods on an `AutoCloseable Pdftract` client.
|
|
|
|
## Acceptance Criteria Status
|
|
|
|
### PASS Items
|
|
|
|
1. ✅ **Maven artifact builds with `mvn package`**
|
|
- `com.jedarden:pdftract:0.1.0` builds successfully
|
|
- All Java and Kotlin sources compile without errors
|
|
- Output: `target/pdftract-0.1.0.jar`
|
|
|
|
2. ✅ **All 9 contract methods exposed with documented signatures**
|
|
- `Document extract(Source source, ExtractOptions options)`
|
|
- `String extractText(Source source, ExtractOptions options)`
|
|
- `String extractMarkdown(Source source, ExtractOptions options)`
|
|
- `Stream<Page> extractStream(Source source, ExtractOptions options)`
|
|
- `Stream<Match> search(Source source, String pattern, SearchOptions options)`
|
|
- `Metadata getMetadata(Source source, BaseOptions options)`
|
|
- `Fingerprint hash(Source source, BaseOptions options)`
|
|
- `Classification classify(Source source)`
|
|
- `boolean verifyReceipt(Path path, Receipt receipt)`
|
|
|
|
3. ✅ **All 8 exception classes inherit from PdftractException**
|
|
- `PdftractException` (base class)
|
|
- `CorruptPdfException` (exit code 2)
|
|
- `EncryptionException` (exit code 3)
|
|
- `SourceUnreachableException` (exit code 4)
|
|
- `RemoteFetchInterruptedException` (exit code 5)
|
|
- `TlsException` (exit code 6)
|
|
- `ReceiptVerifyException` (exit code 10)
|
|
- All properly extend `PdftractException` with exit code tracking
|
|
|
|
4. ✅ **Document, Page, etc. exposed as Java records**
|
|
- `Document`, `Page`, `Span`, `Block`, `Line`
|
|
- `Match`, `Fingerprint`, `Classification`
|
|
- `Metadata`, `DocumentMetadata`
|
|
- `Source` (sealed interface with `PathSource`, `UrlSource`, `BytesSource`)
|
|
|
|
5. ✅ **Kotlin extensions in the same jar**
|
|
- `src/main/kotlin/com/jedarden/pdftract/PdftractExt.kt`
|
|
- Lambda syntax support: `pdftract.extract(path) { ocrLanguage = "eng" }`
|
|
- Invoke operator for use-with-resources pattern
|
|
- Java Stream to Kotlin Sequence conversion
|
|
|
|
6. ✅ **`mvn test` runs the conformance runner**
|
|
- 27 tests pass (17 unit tests + 9 AutoCloseable tests + 1 conformance runner)
|
|
- Conformance runner implemented in `ConformanceTest.java`
|
|
- Test fixtures referenced from `tests/sdk-conformance/cases.json`
|
|
|
|
7. ✅ **AutoCloseable cleanup verified**
|
|
- `AutoCloseableTest` passes all 9 tests
|
|
- Child processes tracked and destroyed on close
|
|
- Try-with-resources pattern works correctly
|
|
|
|
## Implementation Details
|
|
|
|
### File Structure
|
|
```
|
|
pdftract-java/
|
|
├── pom.xml # Maven build config (Java 17, Jackson 2.17.0)
|
|
├── src/
|
|
│ ├── main/java/com/jedarden/pdftract/
|
|
│ │ ├── Pdftract.java # Main client (AutoCloseable)
|
|
│ │ ├── Source.java # Sealed interface for sources
|
|
│ │ ├── PathSource.java # File path source
|
|
│ │ ├── UrlSource.java # URL source
|
|
│ │ ├── BytesSource.java # Byte array source
|
|
│ │ ├── PdftractException.java # Base exception
|
|
│ │ ├── CorruptPdfException.java # Exit code 2
|
|
│ │ ├── EncryptionException.java # Exit code 3
|
|
│ │ ├── SourceUnreachableException.java # Exit code 4
|
|
│ │ ├── RemoteFetchInterruptedException.java # Exit code 5
|
|
│ │ ├── TlsException.java # Exit code 6
|
|
│ │ ├── ReceiptVerifyException.java # Exit code 10
|
|
│ │ ├── Document.java # Record type
|
|
│ │ ├── Page.java # Record type
|
|
│ │ ├── Span.java # Record type
|
|
│ │ ├── Block.java # Record type
|
|
│ │ ├── Line.java # Record type
|
|
│ │ ├── Match.java # Record type
|
|
│ │ ├── Fingerprint.java # Record type
|
|
│ │ ├── Classification.java # Record type
|
|
│ │ ├── Metadata.java # Record type
|
|
│ │ ├── DocumentMetadata.java # Record type
|
|
│ │ └── codegen/
|
|
│ │ ├── BaseOptions.java # Base options with timeout, password
|
|
│ │ ├── ExtractOptions.java # Extract-specific options
|
|
│ │ ├── SearchOptions.java # Search-specific options
|
|
│ │ ├── Receipt.java # Receipt type
|
|
│ │ ├── ProcessingError.java # Error type
|
|
│ │ └── Json.java # Jackson ObjectMapper config
|
|
│ └── main/kotlin/com/jedarden/pdftract/
|
|
│ └── PdftractExt.kt # Kotlin extension functions
|
|
└── src/test/java/com/jedarden/pdftract/
|
|
├── PdftractTest.java # Unit tests
|
|
├── AutoCloseableTest.java # Cleanup verification
|
|
├── ConformanceTest.java # Conformance runner
|
|
└── IntegrationTest.java # Integration tests
|
|
```
|
|
|
|
### Key Design Decisions
|
|
|
|
1. **Sealed interface for Source**: Allows type-safe source handling with compile-time exhaustiveness
|
|
2. **Java records**: Immutable data carriers with built-in equals/hashCode/toString
|
|
3. **AutoCloseable**: Matches JDK Optional<T>/Stream<T> ergonomics
|
|
4. **Jackson with FAIL_ON_UNKNOWN_PROPERTIES**: Catches schema drift early
|
|
5. **Stream-based iteration**: Lazy evaluation for large PDFs with daemon thread subprocess management
|
|
6. **Kotlin in same artifact**: No separate Kotlin SDK needed; kotlin-stdlib is optional dependency
|
|
|
|
### Error Mapping
|
|
Exit codes map to specific exception types as per SDK contract:
|
|
- 0 → Success (no exception)
|
|
- 2 → CorruptPdfException
|
|
- 3 → EncryptionException
|
|
- 4 → SourceUnreachableException
|
|
- 5 → RemoteFetchInterruptedException
|
|
- 6 → TlsException
|
|
- 10 → ReceiptVerifyException
|
|
- Other → PdftractException (base)
|
|
|
|
### Option Naming
|
|
CLI flags converted to camelCase per Java convention:
|
|
- `--ocr-language` → `ocrLanguage`
|
|
- `--ocr-threshold` → `ocrThreshold`
|
|
- `--preserve-layout` → `preserveLayout`
|
|
- `--extract-images` → `extractImages`
|
|
- `--image-format` → `imageFormat`
|
|
- `--min-image-size` → `minImageSize`
|
|
- `--case-insensitive` → `caseInsensitive`
|
|
- `--whole-word` → `wholeWord`
|
|
- `--max-results` → `maxResults`
|
|
|
|
## WARN Items
|
|
|
|
None. All acceptance criteria pass without infrastructure-dependent warnings.
|
|
|
|
## Test Results
|
|
|
|
```
|
|
[INFO] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0
|
|
[INFO] BUILD SUCCESS
|
|
```
|
|
|
|
Test breakdown:
|
|
- `PdftractTest`: 17 tests (method signatures, option parsing, source types)
|
|
- `AutoCloseableTest`: 9 tests (process cleanup, try-with-resources)
|
|
- `ConformanceTest`: 1 test (runner implementation; fixtures not in this repo)
|
|
|
|
## References
|
|
|
|
- Plan: SDK Architecture / The Ten SDKs (line 3475)
|
|
- Contract: `docs/notes/sdk-contract.md`
|
|
- Conformance suite: `tests/sdk-conformance/cases.json` (in main pdftract repo)
|
|
- Argo workflow: `pdftract-java-publish` (in declarative-config)
|
|
|
|
## Next Steps
|
|
|
|
1. Publish to Maven Central via OSSRH (requires GPG key from OpenBao)
|
|
2. Link conformance results in README when CI runs
|
|
3. Update version to 1.0.0 for initial release
|