pdftract/docs/notes/sdk-contract.md
jedarden 5e66846288 docs(pdftract-147a): author SDK contract specification
Add comprehensive SDK contract specification at docs/notes/sdk-contract.md.
This document serves as the constitutional specification for all pdftract
SDK implementations across all languages.

The contract defines:
- Method surface (9 methods mirroring CLI/MCP tools)
- Error mapping (CLI exit codes → native exceptions)
- Versioning compatibility rules (MAJOR lock, MINOR flexibility)
- Option-naming conventions (CLI flag → language-native case)
- Native type-mapping requirements (Document, Page, Span, Block, Match, Fingerprint, Classification)
- Async conventions per language
- Conformance enforcement (100% pass required)
- Change policy (ADR required for contract changes)

Verification note: notes/pdftract-147a.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:13:55 -04:00

14 KiB

pdftract SDK Contract

Scope

This document is the constitutional specification for all pdftract SDK implementations. Every SDK in every language MUST implement the contract exactly as written — deviations are bugs in the SDK, not in this document. The contract defines the method surface, error mapping, versioning compatibility, option naming conventions, native type requirements, and conformance enforcement that make it possible to maintain ten SDKs with a single maintainer.

Method surface

All SDKs expose nine methods mirroring CLI subcommands and MCP tools.

Method Maps to CLI Maps to MCP tool
extract(source, options) -> Document pdftract extract --json extract
extract_text(source, options) -> string pdftract extract --text extract_text
extract_markdown(source, options) -> string pdftract extract --md extract_markdown
extract_stream(source, options) -> Iterator<Page> pdftract extract --ndjson (n/a)
search(source, pattern, options) -> Iterator<Match> pdftract grep search
get_metadata(source, options) -> Metadata pdftract extract --metadata-only get_metadata
hash(source, options) -> Fingerprint pdftract hash hash
classify(source) -> Classification pdftract classify classify
verify_receipt(path, receipt) -> bool pdftract verify-receipt (n/a)

Method signatures

extract

  • Signature: extract(source: Source, options?: ExtractOptions) -> Document
  • Honored options: ocr_language, ocr_threshold, preserve_layout, extract_images, image_format, min_image_size
  • Return: Document struct (see Return types)
  • Errors: All 8 error mappings from Error mapping section

extract_text

  • Signature: extract_text(source: Source, options?: ExtractOptions) -> string
  • Honored options: ocr_language, ocr_threshold, preserve_layout
  • Return: Plain text string (all pages concatenated with \n\n separators)
  • Errors: All 8 error mappings from Error mapping section

extract_markdown

  • Signature: extract_markdown(source: Source, options?: ExtractOptions) -> string
  • Honored options: ocr_language, ocr_threshold, preserve_layout
  • Return: Markdown formatted string
  • Errors: All 8 error mappings from Error mapping section

extract_stream

  • Signature: extract_stream(source: Source, options?: ExtractOptions) -> Iterator<Page>
  • Honored options: ocr_language, ocr_threshold, preserve_layout
  • Return: Lazy iterator yielding Page structs one at a time
  • Errors: All 8 error mappings from Error mapping section; may raise during iteration

search

  • Signature: search(source: Source, pattern: string, options?: SearchOptions) -> Iterator<Match>
  • Honored options: case_insensitive, regex, whole_word, max_results
  • Return: Lazy iterator yielding Match structs
  • Errors: All 8 error mappings from Error mapping section; may raise during iteration

get_metadata

  • Signature: get_metadata(source: Source, options?: BaseOptions) -> Metadata
  • Honored options: timeout
  • Return: Metadata struct
  • Errors: Exit codes 0, 4, 5, 6 only

hash

  • Signature: hash(source: Source, options?: BaseOptions) -> Fingerprint
  • Honored options: timeout
  • Return: Fingerprint struct
  • Errors: Exit codes 0, 2, 4, 5, 6

classify

  • Signature: classify(source: Source) -> Classification
  • Honored options: None
  • Return: Classification struct
  • Errors: Exit codes 0, 2, 4, 5, 6

verify_receipt

  • Signature: verify_receipt(path: string, receipt: string) -> bool
  • Honored options: None
  • Return: true if receipt valid, false otherwise (does NOT raise for exit code 10)
  • Errors: Exit codes 0, 2, 4 only

Source argument types

The source parameter for all methods (except verify_receipt) accepts three types:

Type Format Example
Path Local filesystem path (absolute or relative) "/path/to/file.pdf" or "./doc.pdf"
URL Remote URL with http:// or https:// scheme "https://example.com/doc.pdf"
Bytes Language-native byte buffer bytes (Python), Buffer (Node), byte[] (Java/C#), []byte (Go)

SDKs MAY add language-specific overloads for path-like types (e.g. java.io.File, System.IO.FileInfo, pathlib.Path) without violating the contract. The contract requires the three core types; additional conveniences are permitted.

Options object

The options object is the union of all CLI flags. Individual methods document which options they honor. All options are optional; default values match the CLI defaults.

BaseOptions (all methods)

Option Type Default Description
timeout number 30 Maximum seconds to wait for the operation

ExtractOptions (extract, extract_text, extract_markdown, extract_stream)

Option Type Default Description
ocr_language string "eng" ISO 639-3 language code for OCR
ocr_threshold number 0.7 Confidence threshold (0-1) for accepting OCR text
preserve_layout boolean false Preserve original reading order and layout
extract_images boolean false Extract embedded images
image_format string "png" Format for extracted images: png, jpg, webp
min_image_size number 64 Minimum dimension (pixels) for image extraction
Option Type Default Description
case_insensitive boolean false Ignore case when matching
regex boolean false Treat pattern as regular expression
whole_word boolean false Match only whole words
max_results number null Maximum matches to return (null = unlimited)

Return types

All SDKs MUST expose language-native types, NOT raw JSON dictionaries. The type definitions below derive from the JSON schema in docs/schema/v1.0/. SDKs SHOULD generate these types from the schema to guarantee alignment.

Document

{
  schema_version: string;           // e.g. "1.0"
  pages: Page[];
  metadata: Metadata;
}

Page

{
  page: number;                     // 1-indexed page number
  width: number;                    // Points (1/72 inch)
  height: number;                   // Points
  rotation: number;                 // Degrees (0, 90, 180, 270)
  spans: Span[];
  blocks: Block[];
}

Span

{
  text: string;
  bbox: [number, number, number, number];  // [x0, y0, x1, y1] in points
  font: string;
  size: number;                     // Font size in points
  confidence: number;               // 0-1, null for non-OCR text
}

Block

{
  kind: 'paragraph' | 'heading' | 'table' | 'figure' | 'list';
  text: string;
  bbox: [number, number, number, number];
  level?: number;                   // For headings (1-6) and lists (nested depth)
}

Match

{
  text: string;                     // The matched text
  page: number;                     // Page number where match occurred
  bbox: [number, number, number, number];  // Location of the match
  context: {                       // 50 chars before/after
    before: string;
    after: string;
  }
}

Fingerprint

{
  hash: string;                     // SHA-256 hex of document content
  page_count: number;
  fast_hash: string;                // BLAKE3 hex of first 10KB
  metadata: Metadata;
}

Classification

{
  category: string;                 // Primary category
  confidence: number;               // 0-1
  tags: string[];
  heuristics: Record<string, boolean>;  // Individual feature detections
}

Metadata

{
  title?: string;
  author?: string;
  subject?: string;
  keywords?: string[];
  creator?: string;
  producer?: string;
  created?: string;                 // ISO 8601 date
  modified?: string;                // ISO 8601 date
  page_count: number;
}

Error mapping

All SDKs map CLI exit codes to language-native exception classes. The mapping is exhaustive; new error kinds require a contract bump and coordinated SDK release.

Exit code Meaning Native exception
0 Success (no exception)
2 Corrupt PDF CorruptPdfError
3 Encrypted / password missing/wrong EncryptionError
4 Source unreadable SourceUnreachableError
5 Network interrupted RemoteFetchInterruptedError
6 TLS / cert failure TlsError
10 Receipt verify failed ReceiptVerifyError
any other non-zero Internal PdftractError (base)

Per-language base exception types

Every language-specific exception inherits from a single base type following language conventions:

Language Base type
Python class PdftractError(Exception)
Node.js class PdftractError extends Error
Java class PdftractException extends Exception
C# class PdftractException : Exception
Go Single error type with errors.As-compatible kind
Ruby class PdftractError < StandardError
PHP class PdftractException extends Exception
Rust enum PdftractError (all variants in one enum)
C++ class PdftractException : std::runtime_error

Exception properties

All exceptions MUST expose:

Property Type Description
exit_code number The CLI exit code
message string Human-readable description
stderr string? Raw stderr output from CLI (if available)

Versioning compatibility

SDK version is pinned to binary version via semantic versioning.

MAJOR version lock

  • SDK MAJOR MUST match binary MAJOR exactly
  • @pdftract/sdk@1.x.y works with pdftract@1.0.0 through pdftract@1.x.x
  • @pdftract/sdk@1.x.y MUST refuse to invoke pdftract@2.0.0 with a clear startup error
  • The error message MUST indicate version mismatch and suggest installing the correct binary

MINOR version flexibility

  • SDK MINOR MAY add wrappers for new binary features behind feature flags
  • Calling a method whose underlying CLI subcommand the binary doesn't recognize raises UnsupportedOperationError
  • Example: SDK 1.3 calling hash() when binary 1.0 lacks the hash subcommand raises UnsupportedOperationError

Binary resolution

The SDK constructor or client initialization follows this resolution order:

  1. Explicit binary path (if provided via constructor/client config)
  2. Probe PATH for pdftract executable
  3. Download matching binary version into per-user cache (opt-in via auto_install=true)

Download URL format:

https://github.com/jedarden/pdftract/releases/download/v{VERSION}/pdftract-{TARGET}.tar.gz

Where {TARGET} is one of:

  • x86_64-unknown-linux-gnu
  • aarch64-unknown-linux-gnu
  • x86_64-apple-darwin
  • aarch64-apple-darwin
  • x86_64-pc-windows-msvc

Option-naming conventions

CLI flags use kebab-case. Each SDK converts to its language's conventional case.

CLI flag Python Node.js Go Java C# C
--ocr-language ocr_language ocrLanguage OCRLanguage ocrLanguage OcrLanguage ocr_language
--ocr-threshold ocr_threshold ocrThreshold OCRThreshold ocrThreshold OcrThreshold ocr_threshold
--preserve-layout preserve_layout preserveLayout PreserveLayout preserveLayout PreserveLayout preserve_layout
--extract-images extract_images extractImages ExtractImages extractImages ExtractImages extract_images
--image-format image_format imageFormat ImageFormat imageFormat ImageFormat image_format
--min-image-size min_image_size minImageSize MinImageSize minImageSize MinImageSize min_image_size
--case-insensitive case_insensitive caseInsensitive CaseInsensitive caseInsensitive CaseInsensitive case_insensitive
--whole-word whole_word wholeWord WholeWord wholeWord WholeWord whole_word
--max-results max_results maxResults MaxResults maxResults MaxResults max_results

C bindings keep snake_case for the FFI ABI stability.

Native type mapping

SDKs MUST expose return types as language-native structs/classes. An SDK that returns Map<String, Object> or Dict[str, Any] fails the contract even if the data is correct.

Type generation

The canonical source of truth for type definitions is the JSON schema in docs/schema/v1.0/. SDKs SHOULD generate types from this schema using code generation tools:

Language Recommended tool
Python datamodel-code-generator + Pydantic
Node.js/TypeScript json-schema-to-typescript or quicktype
Java jsonschema2pojo
C# NJsonSchema
Go jsonschema
Rust serde_json + manual structs
Ruby json_schemer + dry-struct

Async conventions

Each SDK follows its language's idiomatic async pattern.

Language Async pattern
Node.js All methods return Promise<T>
Python Optional asyncio via asyncio.to_thread; sync methods by default
Go context.Context passed for cancellation
C# All methods return Task<T>
Java Optional CompletableFuture<T>; sync methods by default
Rust async fn with tokio/async-std runtime

Conformance

All SDKs MUST pass the conformance suite before release. The suite lives at tests/sdk-conformance/cases.json and defines language-neutral test cases.

Conformance format

Each test case specifies:

{
  "id": "extract-vector-academic-paper",
  "fixture": "fixtures/vector/academic-paper-2col.pdf",
  "method": "extract",
  "options": { "ocr_language": "eng" },
  "assertions": {
    "page_count": 2,
    "has_title": true,
    "has_blocks": true
  }
}

Gating rule

100% of conformance tests MUST pass before an SDK is published to its package registry. A failing test is a release blocker.

Running conformance

Each SDK repo includes a test runner that:

  1. Parses cases.json
  2. Executes each case against the SDK
  3. Compares assertions to actual results
  4. Reports pass/fail per case

Change policy

The contract is versioned with the schema_version field. Changing the contract requires:

  1. An Architecture Decision Record (ADR) describing the change
  2. A coordinated PR wave against all SDK repos
  3. A schema_version bump in docs/schema/v1.0/
  4. Updated conformance tests for new behavior
  5. A milestone release of all SDKs before the next binary release

Minor additions (e.g., a new optional field in Metadata) MAY be backward-compatible. Breaking changes (e.g., renaming a field) trigger a MAJOR version bump across all SDKs.