pdftract/docs/notes/sdk-architecture.md
jedarden 1791bb6d80 docs(pdftract-32y9): finalize SDK architecture note with workspace layout, cross-compile matrix, and KU-12 alignment
- Add workspace layout section documenting pdftract-core as the only direct dependency,
  with pdftract-cli, pdftract-py, and pdftract-inspector-ui as siblings
- Update binary distribution table with correct target triples (musl not gnu for Linux)
- Add KU-12 cross-platform test limitation section with verbatim wording from plan:
  "Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release"
- Add Argo CI templates section (pdftract-cargo-build, pdftract-maturin-build)
- Add feature flag composition section with tiers, dependencies, and binary size budgets
- Add cross-references to sdk-invocation.md, sdk-contract.md, ocr-language-packs.md
- Fix clippy warnings in build.rs files (expect_fun_call, get_first, manual_strip, unused imports)

Closes: pdftract-32y9

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 06:38:23 -04:00

679 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# SDK Architecture and Language Coverage
## Top 10 language coverage status
Based on Stack Overflow 2024 survey rankings. The existing `sdk-invocation.md` covers Python,
JavaScript, Go, Ruby, Java, Rust, and Bash. Gaps: TypeScript, C#, C++, PHP, and Kotlin.
---
## Workspace layout
The workspace is organized so that `pdftract-core` is the only crate that other consumers depend on directly. The CLI, Python bindings, and inspector UI are siblings that compose `pdftract-core` behind their respective surfaces.
```
pdftract/
├── Cargo.toml (workspace root)
├── crates/
│ ├── pdftract-core/ (library — only direct dependency for downstream consumers)
│ ├── pdftract-cli/ (binary)
│ ├── pdftract-py/ (PyO3 bindings, optional feature)
│ └── pdftract-inspector-ui/ (HTML/CSS/JS bundled via include_bytes!, Phase 7.9)
└── docs/
├── plan/plan.md
├── research/ (per-feature deep dives)
└── notes/ (this file, sdk-invocation.md, sdk-contract.md, ocr-language-packs.md)
```
See `docs/plan/plan.md` lines 141268 for the full file and module layout specification.
---
## Common infrastructure (required before any SDK ships)
### Binary distribution
Every SDK approach — subprocess or native — depends on platform binaries published to GitHub Releases:
| Target triple | Platform |
|---|---|
| `x86_64-unknown-linux-musl` | Linux x86_64 (production binary) |
| `aarch64-unknown-linux-musl` | Linux ARM64 |
| `x86_64-apple-darwin` | macOS Intel |
| `aarch64-apple-darwin` | macOS Apple Silicon |
| `x86_64-pc-windows-gnu` | Windows x86_64 |
The CI workflow must cross-compile for all five targets and attach the binaries to a versioned
GitHub Release tag on every release. SDKs pin to a binary version and download the appropriate
artifact at install time.
### Cross-platform test limitation (KU-12)
Per ADR-009, `iad-ci` is Linux-only. **Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release.** macOS and Windows binaries are *built* via `cross` on Linux but never *executed* in CI. This is acknowledged as Known Unknown KU-12 with the following mitigation:
- A manual smoke-test runbook in `docs/operations/manual-platform-smoke.md` is executed by the release lead before each milestone tag against at least one physical macOS machine and one Windows VM
- User bug reports for platform-specific issues acknowledged within 48 hours and addressed in the next patch release
See `docs/plan/plan.md` lines 34313436 and lines 608609 for the full KU-12 specification.
### Argo CI templates
Binary and wheel builds are orchestrated by Argo WorkflowTemplates on the `iad-ci` Rackspace Spot cluster:
- `pdftract-cargo-build` — builds the Rust binary for all five target triples using `cross` (Docker-based cross-compilation)
- `pdftract-maturin-build` — builds the PyO3 wheel for all five target triples (uses `ghcr.io/rust-cross/manylinux` for Linux, `osxcross` for macOS, `cross` for Windows)
GitHub Actions is **FORBIDDEN** per ADR-009. All CI runs on `iad-ci`; secrets live in OpenBao and reach workflows via ESO-synced Kubernetes Secrets.
See `docs/plan/plan.md` ADR-009 (lines 495502) and Phase 0.2 (lines 10151029) for the full CI specification.
### Release format
```
https://github.com/jedarden/pdftract/releases/download/v{VERSION}/pdftract-{TARGET}.tar.gz
```
Semantic versioning is required before any package is published to a package registry.
---
## Feature flag composition
Feature flags control the binary footprint. The default build (`cargo build`) includes only the core extraction path. Heavy optional capabilities are behind named features.
### Feature tiers
| Tier | Features | Binary size (stripped) | Use case |
|---|---|---|---|
| **slim** | `["cli", "decrypt"]` | < 3 MB | Minimal CLI without Markdown |
| **default** | `["cli", "decrypt", "markdown"]` | < 4 MB | Standard CLI with Markdown output |
| **serve** | `default + ["serve"]` | < 12 MB | HTTP server mode |
| **ocr** | `default + ["ocr"]` | < 12 MB | OCR with Tesseract |
| **full** | `default + ["serve", "ocr", "mcp", "inspect", "grep", "profiles", "cache", "receipts", "remote"]` | < 14 MB | All features except `full-render` |
### Feature dependencies
Some features implicitly enable others:
- `serve` enables `cache` (the HTTP server is the primary cache consumer)
- `mcp` depends on `serve` (both transports share the HTTP infrastructure)
- `inspect` depends on `serve` (bundles a ~80 KB static HTML/CSS/JS frontend via `include_bytes!`)
- `grep` requires `regex` crate
- `profiles` requires `regex` crate
### Binary size budgets
Per the Primary Objectives (Weight Targets):
| Metric | Target |
|---|---|
| Binary size, default features (no OCR, no serve) | < 4 MB stripped |
| Binary size, `--features ocr,serve` | < 12 MB stripped |
| Binary size, `--features full` (everything except `full-render`) | < 14 MB stripped |
| Docker image, CLI only | < 20 MB (distroless base) |
| Docker image, with OCR (`tesseract-ocr` system pkg) | < 120 MB |
| Docker image, `pdftract:full` | < 140 MB |
See `docs/plan/plan.md` lines 4662 for the full weight targets specification.
---
## Two SDK tracks
### Track A — Subprocess / HTTP wrappers
Each SDK ships a thin wrapper that:
1. Downloads and caches the platform binary on first use (or at install time)
2. Invokes it via subprocess for one-off extractions
3. Optionally connects to a `pdftract serve` instance over HTTP for high-throughput use
**Tradeoffs:** Fast to implement for any language, no FFI complexity, slight per-call overhead
from process spawn. Acceptable for batch and interactive workloads.
### Track B — Native bindings
The Rust core exposes a C ABI via `cbindgen`. Each language calls into the compiled shared
library directly, bypassing subprocess entirely.
**Requires:**
- `cdylib` crate type in `Cargo.toml`
- `cbindgen` generating a `pdftract.h` C header
- A `#[no_mangle] extern "C"` public API surface in the Rust core
- Per-language FFI glue
**Tradeoffs:** Zero process-spawn overhead, suitable for embedding in long-running services,
but requires per-language binding work and platform-specific shared library distribution.
**Starting recommendation:** Track A for all languages, Track B for Python first (PyO3 is
mature, Python is the highest-volume use case for RAG and LLM preprocessing pipelines).
---
## Per-language breakdown
| Language | Package Manager | Track A | Track B | Status |
|---|---|---|---|---|
| Python | PyPI | `subprocess` | PyO3 + maturin | Covered |
| JavaScript | npm | `child_process` | napi-rs | Covered |
| TypeScript | npm | Same as JS | Same + `.d.ts` types | **Gap — types only** |
| Java | Maven Central | `ProcessBuilder` | JNI via `jni` crate | Covered |
| C# | NuGet | `System.Diagnostics.Process` | P/Invoke via cbindgen | **Gap** |
| C++ | vcpkg / conan | `popen` | cbindgen `.h` + shared lib | **Gap** |
| Go | Go modules | `os/exec` | cgo + cbindgen | Covered |
| PHP | Packagist | `proc_open` | ext-php-rs or PHP FFI | **Gap** |
| Kotlin | Maven Central | `ProcessBuilder` (JVM) | JNI (same as Java) | **Gap** |
| Rust | crates.io | `std::process::Command` | native library crate | Covered |
---
## Gap detail
### TypeScript
Minimal work on top of the existing JavaScript notes. The implementation is identical add a
`pdftract.d.ts` type definition file and publish to npm as `@pdftract/sdk` or alongside the JS
package. Types to define:
```typescript
export interface Span {
text: string;
bbox: [number, number, number, number];
font: string;
size: number;
confidence: number;
}
export interface Block {
kind: 'paragraph' | 'heading' | 'table' | 'figure' | 'list';
text: string;
bbox: [number, number, number, number];
}
export interface Page {
page: number;
spans: Span[];
blocks: Block[];
}
export interface ExtractionResult {
pages: Page[];
metadata: {
title?: string;
author?: string;
page_count: number;
};
}
export function extract(filePath: string): Promise<ExtractionResult>;
export function extractText(filePath: string): Promise<string>;
export function extractPage(filePath: string, page: number): Promise<Page>;
export function createClient(baseUrl: string): PdftractClient;
export class PdftractClient {
extract(filePath: string): Promise<ExtractionResult>;
extractText(filePath: string): Promise<string>;
}
```
---
### C# / .NET
**Track A — subprocess:**
```csharp
using System.Diagnostics;
using System.Text.Json;
public class PdftractClient
{
private readonly string _binaryPath;
public PdftractClient(string binaryPath = "pdftract")
{
_binaryPath = binaryPath;
}
public async Task<ExtractionResult> ExtractAsync(string pdfPath)
{
using var process = new Process
{
StartInfo = new ProcessStartInfo
{
FileName = _binaryPath,
Arguments = $"extract \"{pdfPath}\"",
RedirectStandardOutput = true,
RedirectStandardError = true,
UseShellExecute = false,
}
};
process.Start();
string stdout = await process.StandardOutput.ReadToEndAsync();
await process.WaitForExitAsync();
if (process.ExitCode != 0)
{
string stderr = await process.StandardError.ReadToEndAsync();
throw new PdftractException($"pdftract exited {process.ExitCode}: {stderr}");
}
return JsonSerializer.Deserialize<ExtractionResult>(stdout)
?? throw new PdftractException("Empty response");
}
public async Task<string> ExtractTextAsync(string pdfPath)
{
var result = await ExtractAsync(pdfPath);
return string.Join("\n\n", result.Pages.Select(p =>
string.Join("\n", p.Blocks.Select(b => b.Text))));
}
}
```
**Track A — HTTP:**
```csharp
using System.Net.Http.Headers;
public class PdftractHttpClient : IDisposable
{
private readonly HttpClient _http;
private readonly string _baseUrl;
public PdftractHttpClient(string baseUrl = "http://localhost:8080")
{
_http = new HttpClient();
_baseUrl = baseUrl;
}
public async Task<ExtractionResult> ExtractAsync(string pdfPath)
{
using var form = new MultipartFormDataContent();
var fileBytes = await File.ReadAllBytesAsync(pdfPath);
var fileContent = new ByteArrayContent(fileBytes);
fileContent.Headers.ContentType = MediaTypeHeaderValue.Parse("application/pdf");
form.Add(fileContent, "file", Path.GetFileName(pdfPath));
var response = await _http.PostAsync($"{_baseUrl}/extract", form);
response.EnsureSuccessStatusCode();
var json = await response.Content.ReadAsStringAsync();
return JsonSerializer.Deserialize<ExtractionResult>(json)
?? throw new PdftractException("Empty response");
}
public void Dispose() => _http.Dispose();
}
```
**Track B — native (P/Invoke):**
Requires `cbindgen` to generate `pdftract.h`, then:
```csharp
using System.Runtime.InteropServices;
internal static class NativeMethods
{
private const string LibName = "pdftract";
[DllImport(LibName, EntryPoint = "pdftract_extract_file")]
internal static extern IntPtr ExtractFile(
[MarshalAs(UnmanagedType.LPUTF8Str)] string path);
[DllImport(LibName, EntryPoint = "pdftract_free_result")]
internal static extern void FreeResult(IntPtr result);
}
```
**NuGet packaging** the `.nupkg` must embed the shared library per Runtime Identifier:
```
lib/
net8.0/
Pdftract.dll
runtimes/
linux-x64/native/libpdftract.so
linux-arm64/native/libpdftract.so
osx-x64/native/libpdftract.dylib
osx-arm64/native/libpdftract.dylib
win-x64/native/pdftract.dll
```
The `.csproj` sets `RuntimeIdentifiers` and uses `<PackagePath>` to map each binary into the
correct runtime folder. This is the primary complexity in C# packaging.
---
### C++
**Track A — subprocess:**
```cpp
#include <array>
#include <cstdio>
#include <memory>
#include <stdexcept>
#include <string>
std::string pdftract_extract_json(const std::string& pdf_path) {
std::string cmd = "pdftract extract \"" + pdf_path + "\"";
std::array<char, 4096> buf{};
std::string result;
std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(cmd.c_str(), "r"), pclose);
if (!pipe) throw std::runtime_error("popen failed");
while (fgets(buf.data(), buf.size(), pipe.get()) != nullptr)
result += buf.data();
return result;
}
std::string pdftract_extract_text(const std::string& pdf_path) {
std::string cmd = "pdftract extract --text \"" + pdf_path + "\"";
std::array<char, 4096> buf{};
std::string result;
std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(cmd.c_str(), "r"), pclose);
if (!pipe) throw std::runtime_error("popen failed");
while (fgets(buf.data(), buf.size(), pipe.get()) != nullptr)
result += buf.data();
return result;
}
```
**Track A — HTTP (using libcurl):**
```cpp
#include <curl/curl.h>
#include <string>
static size_t write_cb(char* ptr, size_t size, size_t nmemb, std::string* data) {
data->append(ptr, size * nmemb);
return size * nmemb;
}
std::string pdftract_http_extract(const std::string& pdf_path,
const std::string& base_url = "http://localhost:8080") {
CURL* curl = curl_easy_init();
if (!curl) throw std::runtime_error("curl_easy_init failed");
std::string response;
curl_mime* mime = curl_mime_init(curl);
curl_mimepart* part = curl_mime_addpart(mime);
curl_mime_name(part, "file");
curl_mime_filedata(part, pdf_path.c_str());
curl_easy_setopt(curl, CURLOPT_URL, (base_url + "/extract").c_str());
curl_easy_setopt(curl, CURLOPT_MIMEPOST, mime);
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_cb);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);
CURLcode res = curl_easy_perform(curl);
curl_mime_free(mime);
curl_easy_cleanup(curl);
if (res != CURLE_OK)
throw std::runtime_error(curl_easy_strerror(res));
return response;
}
```
**Track B — native:** `cbindgen` generates `pdftract.h`; link against `libpdftract.so` /
`pdftract.dll`. Distribute as a vcpkg port or conan recipe with the header and shared library.
No standard package manager provide both options.
---
### PHP
**Track A — subprocess:**
```php
<?php
class Pdftract
{
public function __construct(
private string $binaryPath = 'pdftract'
) {}
public function extract(string $pdfPath): array
{
$cmd = escapeshellcmd($this->binaryPath)
. ' extract '
. escapeshellarg($pdfPath);
$descriptors = [
1 => ['pipe', 'w'],
2 => ['pipe', 'w'],
];
$proc = proc_open($cmd, $descriptors, $pipes);
if (!is_resource($proc)) {
throw new RuntimeException('Failed to start pdftract');
}
$stdout = stream_get_contents($pipes[1]);
$stderr = stream_get_contents($pipes[2]);
fclose($pipes[1]);
fclose($pipes[2]);
$exit = proc_close($proc);
if ($exit !== 0) {
throw new RuntimeException("pdftract exited $exit: $stderr");
}
return json_decode($stdout, true, 512, JSON_THROW_ON_ERROR);
}
public function extractText(string $pdfPath): string
{
$result = $this->extract($pdfPath);
$lines = [];
foreach ($result['pages'] as $page) {
foreach ($page['blocks'] as $block) {
$lines[] = $block['text'];
}
}
return implode("\n\n", $lines);
}
public function extractPage(string $pdfPath, int $page): array
{
$result = $this->extract($pdfPath);
foreach ($result['pages'] as $p) {
if ($p['page'] === $page) return $p;
}
throw new OutOfRangeException("Page $page not found");
}
}
```
**Track A — HTTP:**
```php
<?php
class PdftractHttpClient
{
public function __construct(
private string $baseUrl = 'http://localhost:8080'
) {}
public function extract(string $pdfPath): array
{
$ch = curl_init($this->baseUrl . '/extract');
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => ['file' => new CURLFile($pdfPath, 'application/pdf')],
]);
$response = curl_exec($ch);
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($status !== 200) {
throw new RuntimeException("HTTP $status from pdftract serve");
}
return json_decode($response, true, 512, JSON_THROW_ON_ERROR);
}
public function extractText(string $pdfPath): string
{
$result = $this->extract($pdfPath);
$lines = array_map(
fn($page) => implode("\n", array_column($page['blocks'], 'text')),
$result['pages']
);
return implode("\n\n", $lines);
}
}
```
**Track B — native:** `ext-php-rs` compiles a PHP extension in Rust directly. Alternatively,
PHP 8+ FFI (`FFI::load`) can call into a C ABI shared library without writing a C extension.
The FFI approach is easier to distribute but has higher per-call overhead than a compiled
extension.
**Distribution:** Composer package (`packagist.org`). The package downloads the platform binary
in a post-install script. PHP extension distribution requires `pecl` and per-version compilation,
which is significant maintenance overhead subprocess Track A is the right starting point.
---
### Kotlin
The JVM is shared with Java, so the implementation is the same `ProcessBuilder` and
`java.net.http.HttpClient` approach. The Kotlin wrapper adds idiomatic sugar: coroutines for
async, extension functions, and data classes for the JSON model.
**Subprocess:**
```kotlin
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.withContext
import kotlinx.serialization.Serializable
import kotlinx.serialization.json.Json
import java.io.File
@Serializable
data class Span(val text: String, val bbox: List<Double>, val font: String,
val size: Double, val confidence: Double)
@Serializable
data class Block(val kind: String, val text: String, val bbox: List<Double>)
@Serializable
data class Page(val page: Int, val spans: List<Span>, val blocks: List<Block>)
@Serializable
data class Metadata(val title: String? = null, val author: String? = null,
val page_count: Int)
@Serializable
data class ExtractionResult(val pages: List<Page>, val metadata: Metadata)
class Pdftract(private val binaryPath: String = "pdftract") {
private val json = Json { ignoreUnknownKeys = true }
suspend fun extract(pdfPath: String): ExtractionResult = withContext(Dispatchers.IO) {
val process = ProcessBuilder(binaryPath, "extract", pdfPath)
.redirectErrorStream(false)
.start()
val stdout = process.inputStream.bufferedReader().readText()
val stderr = process.errorStream.bufferedReader().readText()
val exit = process.waitFor()
if (exit != 0) throw RuntimeException("pdftract exited $exit: $stderr")
json.decodeFromString(stdout)
}
suspend fun extractText(pdfPath: String): String =
extract(pdfPath).pages
.flatMap { it.blocks }
.joinToString("\n\n") { it.text }
suspend fun extractPage(pdfPath: String, page: Int): Page =
extract(pdfPath).pages.first { it.page == page }
}
```
**HTTP:**
```kotlin
import io.ktor.client.*
import io.ktor.client.request.forms.*
import io.ktor.client.statement.*
import io.ktor.http.*
import java.io.File
class PdftractHttpClient(
private val baseUrl: String = "http://localhost:8080",
private val client: HttpClient = HttpClient()
) {
private val json = Json { ignoreUnknownKeys = true }
suspend fun extract(pdfPath: String): ExtractionResult {
val file = File(pdfPath)
val response: HttpResponse = client.submitFormWithBinaryData(
url = "$baseUrl/extract",
formData = formData {
append("file", file.readBytes(), Headers.build {
append(HttpHeaders.ContentType, "application/pdf")
append(HttpHeaders.ContentDisposition, "filename=\"${file.name}\"")
})
}
)
return json.decodeFromString(response.bodyAsText())
}
suspend fun extractText(pdfPath: String): String =
extract(pdfPath).pages
.flatMap { it.blocks }
.joinToString("\n\n") { it.text }
}
```
**Distribution:** Maven Central, same artifact group as the Java package (`com.pdftract`).
Separate artifact ID (`pdftract-kotlin`) so Java users don't pull in Kotlin stdlib.
---
## Implementation sequencing
| Priority | Language | Effort | Rationale |
|---|---|---|---|
| 1 | TypeScript | Half a day | Type definitions on top of existing JS code |
| 2 | Kotlin | Half a day | JVM wrapper on top of existing Java code |
| 3 | C# | 12 days | Subprocess is straightforward; NuGet RID packaging is the complexity |
| 4 | PHP | 1 day | Composer subprocess wrapper; avoid extension track initially |
| 5 | C++ | 12 days | `popen` + libcurl; no package manager standard, distribute as vcpkg port |
All five are blocked on the GitHub Releases binary distribution infrastructure being in place first.
---
## Cross-references
Related documentation:
- **[`sdk-invocation.md`](sdk-invocation.md)** Subprocess and HTTP invocation patterns for all supported languages
- **[`sdk-contract.md`](sdk-contract.md)** The constitutional SDK specification (method surface, error mapping, versioning, conformance)
- **[`ocr-language-packs.md`](ocr-language-packs.md)** Tesseract language pack distribution and installation
- **[`docs/plan/plan.md`](../plan/plan.md)** The source of truth for all architectural decisions (workspace layout, cross-compile matrix, ADR-009 CI policy, KU-12 platform testing)
See also:
- **Phase 6.3** (PyO3 bindings) Python wheel build matrix via `pdftract-maturin-build`
- **Phase 7.9** (Inspector UI) Web debug viewer bundled via `include_bytes!`
- **ADR-009** (Argo Workflows on iad-ci) CI/CD architecture and cross-compilation strategy