Add comprehensive SDK contract specification at docs/notes/sdk-contract.md. This document serves as the constitutional specification for all pdftract SDK implementations across all languages. The contract defines: - Method surface (9 methods mirroring CLI/MCP tools) - Error mapping (CLI exit codes → native exceptions) - Versioning compatibility rules (MAJOR lock, MINOR flexibility) - Option-naming conventions (CLI flag → language-native case) - Native type-mapping requirements (Document, Page, Span, Block, Match, Fingerprint, Classification) - Async conventions per language - Conformance enforcement (100% pass required) - Change policy (ADR required for contract changes) Verification note: notes/pdftract-147a.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
14 KiB
pdftract SDK Contract
Scope
This document is the constitutional specification for all pdftract SDK implementations. Every SDK in every language MUST implement the contract exactly as written — deviations are bugs in the SDK, not in this document. The contract defines the method surface, error mapping, versioning compatibility, option naming conventions, native type requirements, and conformance enforcement that make it possible to maintain ten SDKs with a single maintainer.
Method surface
All SDKs expose nine methods mirroring CLI subcommands and MCP tools.
| Method | Maps to CLI | Maps to MCP tool |
|---|---|---|
extract(source, options) -> Document |
pdftract extract --json |
extract |
extract_text(source, options) -> string |
pdftract extract --text |
extract_text |
extract_markdown(source, options) -> string |
pdftract extract --md |
extract_markdown |
extract_stream(source, options) -> Iterator<Page> |
pdftract extract --ndjson |
(n/a) |
search(source, pattern, options) -> Iterator<Match> |
pdftract grep |
search |
get_metadata(source, options) -> Metadata |
pdftract extract --metadata-only |
get_metadata |
hash(source, options) -> Fingerprint |
pdftract hash |
hash |
classify(source) -> Classification |
pdftract classify |
classify |
verify_receipt(path, receipt) -> bool |
pdftract verify-receipt |
(n/a) |
Method signatures
extract
- Signature:
extract(source: Source, options?: ExtractOptions) -> Document - Honored options:
ocr_language,ocr_threshold,preserve_layout,extract_images,image_format,min_image_size - Return:
Documentstruct (see Return types) - Errors: All 8 error mappings from Error mapping section
extract_text
- Signature:
extract_text(source: Source, options?: ExtractOptions) -> string - Honored options:
ocr_language,ocr_threshold,preserve_layout - Return: Plain text string (all pages concatenated with
\n\nseparators) - Errors: All 8 error mappings from Error mapping section
extract_markdown
- Signature:
extract_markdown(source: Source, options?: ExtractOptions) -> string - Honored options:
ocr_language,ocr_threshold,preserve_layout - Return: Markdown formatted string
- Errors: All 8 error mappings from Error mapping section
extract_stream
- Signature:
extract_stream(source: Source, options?: ExtractOptions) -> Iterator<Page> - Honored options:
ocr_language,ocr_threshold,preserve_layout - Return: Lazy iterator yielding
Pagestructs one at a time - Errors: All 8 error mappings from Error mapping section; may raise during iteration
search
- Signature:
search(source: Source, pattern: string, options?: SearchOptions) -> Iterator<Match> - Honored options:
case_insensitive,regex,whole_word,max_results - Return: Lazy iterator yielding
Matchstructs - Errors: All 8 error mappings from Error mapping section; may raise during iteration
get_metadata
- Signature:
get_metadata(source: Source, options?: BaseOptions) -> Metadata - Honored options:
timeout - Return:
Metadatastruct - Errors: Exit codes 0, 4, 5, 6 only
hash
- Signature:
hash(source: Source, options?: BaseOptions) -> Fingerprint - Honored options:
timeout - Return:
Fingerprintstruct - Errors: Exit codes 0, 2, 4, 5, 6
classify
- Signature:
classify(source: Source) -> Classification - Honored options: None
- Return:
Classificationstruct - Errors: Exit codes 0, 2, 4, 5, 6
verify_receipt
- Signature:
verify_receipt(path: string, receipt: string) -> bool - Honored options: None
- Return:
trueif receipt valid,falseotherwise (does NOT raise for exit code 10) - Errors: Exit codes 0, 2, 4 only
Source argument types
The source parameter for all methods (except verify_receipt) accepts three types:
| Type | Format | Example |
|---|---|---|
| Path | Local filesystem path (absolute or relative) | "/path/to/file.pdf" or "./doc.pdf" |
| URL | Remote URL with http:// or https:// scheme |
"https://example.com/doc.pdf" |
| Bytes | Language-native byte buffer | bytes (Python), Buffer (Node), byte[] (Java/C#), []byte (Go) |
SDKs MAY add language-specific overloads for path-like types (e.g. java.io.File, System.IO.FileInfo, pathlib.Path) without violating the contract. The contract requires the three core types; additional conveniences are permitted.
Options object
The options object is the union of all CLI flags. Individual methods document which options they honor. All options are optional; default values match the CLI defaults.
BaseOptions (all methods)
| Option | Type | Default | Description |
|---|---|---|---|
timeout |
number | 30 | Maximum seconds to wait for the operation |
ExtractOptions (extract, extract_text, extract_markdown, extract_stream)
| Option | Type | Default | Description |
|---|---|---|---|
ocr_language |
string | "eng" |
ISO 639-3 language code for OCR |
ocr_threshold |
number | 0.7 | Confidence threshold (0-1) for accepting OCR text |
preserve_layout |
boolean | false | Preserve original reading order and layout |
extract_images |
boolean | false | Extract embedded images |
image_format |
string | "png" |
Format for extracted images: png, jpg, webp |
min_image_size |
number | 64 | Minimum dimension (pixels) for image extraction |
SearchOptions (search)
| Option | Type | Default | Description |
|---|---|---|---|
case_insensitive |
boolean | false | Ignore case when matching |
regex |
boolean | false | Treat pattern as regular expression |
whole_word |
boolean | false | Match only whole words |
max_results |
number | null | Maximum matches to return (null = unlimited) |
Return types
All SDKs MUST expose language-native types, NOT raw JSON dictionaries. The type definitions below derive from the JSON schema in docs/schema/v1.0/. SDKs SHOULD generate these types from the schema to guarantee alignment.
Document
{
schema_version: string; // e.g. "1.0"
pages: Page[];
metadata: Metadata;
}
Page
{
page: number; // 1-indexed page number
width: number; // Points (1/72 inch)
height: number; // Points
rotation: number; // Degrees (0, 90, 180, 270)
spans: Span[];
blocks: Block[];
}
Span
{
text: string;
bbox: [number, number, number, number]; // [x0, y0, x1, y1] in points
font: string;
size: number; // Font size in points
confidence: number; // 0-1, null for non-OCR text
}
Block
{
kind: 'paragraph' | 'heading' | 'table' | 'figure' | 'list';
text: string;
bbox: [number, number, number, number];
level?: number; // For headings (1-6) and lists (nested depth)
}
Match
{
text: string; // The matched text
page: number; // Page number where match occurred
bbox: [number, number, number, number]; // Location of the match
context: { // 50 chars before/after
before: string;
after: string;
}
}
Fingerprint
{
hash: string; // SHA-256 hex of document content
page_count: number;
fast_hash: string; // BLAKE3 hex of first 10KB
metadata: Metadata;
}
Classification
{
category: string; // Primary category
confidence: number; // 0-1
tags: string[];
heuristics: Record<string, boolean>; // Individual feature detections
}
Metadata
{
title?: string;
author?: string;
subject?: string;
keywords?: string[];
creator?: string;
producer?: string;
created?: string; // ISO 8601 date
modified?: string; // ISO 8601 date
page_count: number;
}
Error mapping
All SDKs map CLI exit codes to language-native exception classes. The mapping is exhaustive; new error kinds require a contract bump and coordinated SDK release.
| Exit code | Meaning | Native exception |
|---|---|---|
| 0 | Success | (no exception) |
| 2 | Corrupt PDF | CorruptPdfError |
| 3 | Encrypted / password missing/wrong | EncryptionError |
| 4 | Source unreadable | SourceUnreachableError |
| 5 | Network interrupted | RemoteFetchInterruptedError |
| 6 | TLS / cert failure | TlsError |
| 10 | Receipt verify failed | ReceiptVerifyError |
| any other non-zero | Internal | PdftractError (base) |
Per-language base exception types
Every language-specific exception inherits from a single base type following language conventions:
| Language | Base type |
|---|---|
| Python | class PdftractError(Exception) |
| Node.js | class PdftractError extends Error |
| Java | class PdftractException extends Exception |
| C# | class PdftractException : Exception |
| Go | Single error type with errors.As-compatible kind |
| Ruby | class PdftractError < StandardError |
| PHP | class PdftractException extends Exception |
| Rust | enum PdftractError (all variants in one enum) |
| C++ | class PdftractException : std::runtime_error |
Exception properties
All exceptions MUST expose:
| Property | Type | Description |
|---|---|---|
exit_code |
number | The CLI exit code |
message |
string | Human-readable description |
stderr |
string? | Raw stderr output from CLI (if available) |
Versioning compatibility
SDK version is pinned to binary version via semantic versioning.
MAJOR version lock
- SDK MAJOR MUST match binary MAJOR exactly
@pdftract/sdk@1.x.yworks withpdftract@1.0.0throughpdftract@1.x.x@pdftract/sdk@1.x.yMUST refuse to invokepdftract@2.0.0with a clear startup error- The error message MUST indicate version mismatch and suggest installing the correct binary
MINOR version flexibility
- SDK MINOR MAY add wrappers for new binary features behind feature flags
- Calling a method whose underlying CLI subcommand the binary doesn't recognize raises
UnsupportedOperationError - Example: SDK 1.3 calling
hash()when binary 1.0 lacks thehashsubcommand raisesUnsupportedOperationError
Binary resolution
The SDK constructor or client initialization follows this resolution order:
- Explicit binary path (if provided via constructor/client config)
- Probe PATH for
pdftractexecutable - Download matching binary version into per-user cache (opt-in via
auto_install=true)
Download URL format:
https://github.com/jedarden/pdftract/releases/download/v{VERSION}/pdftract-{TARGET}.tar.gz
Where {TARGET} is one of:
x86_64-unknown-linux-gnuaarch64-unknown-linux-gnux86_64-apple-darwinaarch64-apple-darwinx86_64-pc-windows-msvc
Option-naming conventions
CLI flags use kebab-case. Each SDK converts to its language's conventional case.
| CLI flag | Python | Node.js | Go | Java | C# | C |
|---|---|---|---|---|---|---|
--ocr-language |
ocr_language |
ocrLanguage |
OCRLanguage |
ocrLanguage |
OcrLanguage |
ocr_language |
--ocr-threshold |
ocr_threshold |
ocrThreshold |
OCRThreshold |
ocrThreshold |
OcrThreshold |
ocr_threshold |
--preserve-layout |
preserve_layout |
preserveLayout |
PreserveLayout |
preserveLayout |
PreserveLayout |
preserve_layout |
--extract-images |
extract_images |
extractImages |
ExtractImages |
extractImages |
ExtractImages |
extract_images |
--image-format |
image_format |
imageFormat |
ImageFormat |
imageFormat |
ImageFormat |
image_format |
--min-image-size |
min_image_size |
minImageSize |
MinImageSize |
minImageSize |
MinImageSize |
min_image_size |
--case-insensitive |
case_insensitive |
caseInsensitive |
CaseInsensitive |
caseInsensitive |
CaseInsensitive |
case_insensitive |
--whole-word |
whole_word |
wholeWord |
WholeWord |
wholeWord |
WholeWord |
whole_word |
--max-results |
max_results |
maxResults |
MaxResults |
maxResults |
MaxResults |
max_results |
C bindings keep snake_case for the FFI ABI stability.
Native type mapping
SDKs MUST expose return types as language-native structs/classes. An SDK that returns Map<String, Object> or Dict[str, Any] fails the contract even if the data is correct.
Type generation
The canonical source of truth for type definitions is the JSON schema in docs/schema/v1.0/. SDKs SHOULD generate types from this schema using code generation tools:
| Language | Recommended tool |
|---|---|
| Python | datamodel-code-generator + Pydantic |
| Node.js/TypeScript | json-schema-to-typescript or quicktype |
| Java | jsonschema2pojo |
| C# | NJsonSchema |
| Go | jsonschema |
| Rust | serde_json + manual structs |
| Ruby | json_schemer + dry-struct |
Async conventions
Each SDK follows its language's idiomatic async pattern.
| Language | Async pattern |
|---|---|
| Node.js | All methods return Promise<T> |
| Python | Optional asyncio via asyncio.to_thread; sync methods by default |
| Go | context.Context passed for cancellation |
| C# | All methods return Task<T> |
| Java | Optional CompletableFuture<T>; sync methods by default |
| Rust | async fn with tokio/async-std runtime |
Conformance
All SDKs MUST pass the conformance suite before release. The suite lives at tests/sdk-conformance/cases.json and defines language-neutral test cases.
Conformance format
Each test case specifies:
{
"id": "extract-vector-academic-paper",
"fixture": "fixtures/vector/academic-paper-2col.pdf",
"method": "extract",
"options": { "ocr_language": "eng" },
"assertions": {
"page_count": 2,
"has_title": true,
"has_blocks": true
}
}
Gating rule
100% of conformance tests MUST pass before an SDK is published to its package registry. A failing test is a release blocker.
Running conformance
Each SDK repo includes a test runner that:
- Parses
cases.json - Executes each case against the SDK
- Compares assertions to actual results
- Reports pass/fail per case
Change policy
The contract is versioned with the schema_version field. Changing the contract requires:
- An Architecture Decision Record (ADR) describing the change
- A coordinated PR wave against all SDK repos
- A
schema_versionbump indocs/schema/v1.0/ - Updated conformance tests for new behavior
- A milestone release of all SDKs before the next binary release
Minor additions (e.g., a new optional field in Metadata) MAY be backward-compatible. Breaking changes (e.g., renaming a field) trigger a MAJOR version bump across all SDKs.