Add SDK architecture notes covering top 10 languages

Covers TypeScript, C#, C++, PHP, and Kotlin gaps with full code examples
for both subprocess and HTTP tracks, NuGet RID packaging detail, PHP FFI
options, and implementation sequencing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-16 14:51:25 -04:00
parent f87579b100
commit ef9c03095d

View file

@ -0,0 +1,579 @@
# SDK Architecture and Language Coverage
## Top 10 language coverage status
Based on Stack Overflow 2024 survey rankings. The existing `sdk-invocation.md` covers Python,
JavaScript, Go, Ruby, Java, Rust, and Bash. Gaps: TypeScript, C#, C++, PHP, and Kotlin.
---
## Common infrastructure (required before any SDK ships)
### Binary distribution
Every SDK approach — subprocess or native — depends on platform binaries published to GitHub Releases:
| Target triple | Platform |
|---|---|
| `x86_64-unknown-linux-gnu` | Linux x86_64 |
| `aarch64-unknown-linux-gnu` | Linux ARM64 |
| `x86_64-apple-darwin` | macOS Intel |
| `aarch64-apple-darwin` | macOS Apple Silicon |
| `x86_64-pc-windows-msvc` | Windows x86_64 |
The CI workflow must cross-compile for all five targets and attach the binaries to a versioned
GitHub Release tag on every release. SDKs pin to a binary version and download the appropriate
artifact at install time.
### Release format
```
https://github.com/jedarden/pdftract/releases/download/v{VERSION}/pdftract-{TARGET}.tar.gz
```
Semantic versioning is required before any package is published to a package registry.
---
## Two SDK tracks
### Track A — Subprocess / HTTP wrappers
Each SDK ships a thin wrapper that:
1. Downloads and caches the platform binary on first use (or at install time)
2. Invokes it via subprocess for one-off extractions
3. Optionally connects to a `pdftract serve` instance over HTTP for high-throughput use
**Tradeoffs:** Fast to implement for any language, no FFI complexity, slight per-call overhead
from process spawn. Acceptable for batch and interactive workloads.
### Track B — Native bindings
The Rust core exposes a C ABI via `cbindgen`. Each language calls into the compiled shared
library directly, bypassing subprocess entirely.
**Requires:**
- `cdylib` crate type in `Cargo.toml`
- `cbindgen` generating a `pdftract.h` C header
- A `#[no_mangle] extern "C"` public API surface in the Rust core
- Per-language FFI glue
**Tradeoffs:** Zero process-spawn overhead, suitable for embedding in long-running services,
but requires per-language binding work and platform-specific shared library distribution.
**Starting recommendation:** Track A for all languages, Track B for Python first (PyO3 is
mature, Python is the highest-volume use case for RAG and LLM preprocessing pipelines).
---
## Per-language breakdown
| Language | Package Manager | Track A | Track B | Status |
|---|---|---|---|---|
| Python | PyPI | `subprocess` | PyO3 + maturin | Covered |
| JavaScript | npm | `child_process` | napi-rs | Covered |
| TypeScript | npm | Same as JS | Same + `.d.ts` types | **Gap — types only** |
| Java | Maven Central | `ProcessBuilder` | JNI via `jni` crate | Covered |
| C# | NuGet | `System.Diagnostics.Process` | P/Invoke via cbindgen | **Gap** |
| C++ | vcpkg / conan | `popen` | cbindgen → `.h` + shared lib | **Gap** |
| Go | Go modules | `os/exec` | cgo + cbindgen | Covered |
| PHP | Packagist | `proc_open` | ext-php-rs or PHP FFI | **Gap** |
| Kotlin | Maven Central | `ProcessBuilder` (JVM) | JNI (same as Java) | **Gap** |
| Rust | crates.io | `std::process::Command` | native library crate | Covered |
---
## Gap detail
### TypeScript
Minimal work on top of the existing JavaScript notes. The implementation is identical — add a
`pdftract.d.ts` type definition file and publish to npm as `@pdftract/sdk` or alongside the JS
package. Types to define:
```typescript
export interface Span {
text: string;
bbox: [number, number, number, number];
font: string;
size: number;
confidence: number;
}
export interface Block {
kind: 'paragraph' | 'heading' | 'table' | 'figure' | 'list';
text: string;
bbox: [number, number, number, number];
}
export interface Page {
page: number;
spans: Span[];
blocks: Block[];
}
export interface ExtractionResult {
pages: Page[];
metadata: {
title?: string;
author?: string;
page_count: number;
};
}
export function extract(filePath: string): Promise<ExtractionResult>;
export function extractText(filePath: string): Promise<string>;
export function extractPage(filePath: string, page: number): Promise<Page>;
export function createClient(baseUrl: string): PdftractClient;
export class PdftractClient {
extract(filePath: string): Promise<ExtractionResult>;
extractText(filePath: string): Promise<string>;
}
```
---
### C# / .NET
**Track A — subprocess:**
```csharp
using System.Diagnostics;
using System.Text.Json;
public class PdftractClient
{
private readonly string _binaryPath;
public PdftractClient(string binaryPath = "pdftract")
{
_binaryPath = binaryPath;
}
public async Task<ExtractionResult> ExtractAsync(string pdfPath)
{
using var process = new Process
{
StartInfo = new ProcessStartInfo
{
FileName = _binaryPath,
Arguments = $"extract \"{pdfPath}\"",
RedirectStandardOutput = true,
RedirectStandardError = true,
UseShellExecute = false,
}
};
process.Start();
string stdout = await process.StandardOutput.ReadToEndAsync();
await process.WaitForExitAsync();
if (process.ExitCode != 0)
{
string stderr = await process.StandardError.ReadToEndAsync();
throw new PdftractException($"pdftract exited {process.ExitCode}: {stderr}");
}
return JsonSerializer.Deserialize<ExtractionResult>(stdout)
?? throw new PdftractException("Empty response");
}
public async Task<string> ExtractTextAsync(string pdfPath)
{
var result = await ExtractAsync(pdfPath);
return string.Join("\n\n", result.Pages.Select(p =>
string.Join("\n", p.Blocks.Select(b => b.Text))));
}
}
```
**Track A — HTTP:**
```csharp
using System.Net.Http.Headers;
public class PdftractHttpClient : IDisposable
{
private readonly HttpClient _http;
private readonly string _baseUrl;
public PdftractHttpClient(string baseUrl = "http://localhost:8080")
{
_http = new HttpClient();
_baseUrl = baseUrl;
}
public async Task<ExtractionResult> ExtractAsync(string pdfPath)
{
using var form = new MultipartFormDataContent();
var fileBytes = await File.ReadAllBytesAsync(pdfPath);
var fileContent = new ByteArrayContent(fileBytes);
fileContent.Headers.ContentType = MediaTypeHeaderValue.Parse("application/pdf");
form.Add(fileContent, "file", Path.GetFileName(pdfPath));
var response = await _http.PostAsync($"{_baseUrl}/extract", form);
response.EnsureSuccessStatusCode();
var json = await response.Content.ReadAsStringAsync();
return JsonSerializer.Deserialize<ExtractionResult>(json)
?? throw new PdftractException("Empty response");
}
public void Dispose() => _http.Dispose();
}
```
**Track B — native (P/Invoke):**
Requires `cbindgen` to generate `pdftract.h`, then:
```csharp
using System.Runtime.InteropServices;
internal static class NativeMethods
{
private const string LibName = "pdftract";
[DllImport(LibName, EntryPoint = "pdftract_extract_file")]
internal static extern IntPtr ExtractFile(
[MarshalAs(UnmanagedType.LPUTF8Str)] string path);
[DllImport(LibName, EntryPoint = "pdftract_free_result")]
internal static extern void FreeResult(IntPtr result);
}
```
**NuGet packaging** — the `.nupkg` must embed the shared library per Runtime Identifier:
```
lib/
net8.0/
Pdftract.dll
runtimes/
linux-x64/native/libpdftract.so
linux-arm64/native/libpdftract.so
osx-x64/native/libpdftract.dylib
osx-arm64/native/libpdftract.dylib
win-x64/native/pdftract.dll
```
The `.csproj` sets `RuntimeIdentifiers` and uses `<PackagePath>` to map each binary into the
correct runtime folder. This is the primary complexity in C# packaging.
---
### C++
**Track A — subprocess:**
```cpp
#include <array>
#include <cstdio>
#include <memory>
#include <stdexcept>
#include <string>
std::string pdftract_extract_json(const std::string& pdf_path) {
std::string cmd = "pdftract extract \"" + pdf_path + "\"";
std::array<char, 4096> buf{};
std::string result;
std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(cmd.c_str(), "r"), pclose);
if (!pipe) throw std::runtime_error("popen failed");
while (fgets(buf.data(), buf.size(), pipe.get()) != nullptr)
result += buf.data();
return result;
}
std::string pdftract_extract_text(const std::string& pdf_path) {
std::string cmd = "pdftract extract --text \"" + pdf_path + "\"";
std::array<char, 4096> buf{};
std::string result;
std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(cmd.c_str(), "r"), pclose);
if (!pipe) throw std::runtime_error("popen failed");
while (fgets(buf.data(), buf.size(), pipe.get()) != nullptr)
result += buf.data();
return result;
}
```
**Track A — HTTP (using libcurl):**
```cpp
#include <curl/curl.h>
#include <string>
static size_t write_cb(char* ptr, size_t size, size_t nmemb, std::string* data) {
data->append(ptr, size * nmemb);
return size * nmemb;
}
std::string pdftract_http_extract(const std::string& pdf_path,
const std::string& base_url = "http://localhost:8080") {
CURL* curl = curl_easy_init();
if (!curl) throw std::runtime_error("curl_easy_init failed");
std::string response;
curl_mime* mime = curl_mime_init(curl);
curl_mimepart* part = curl_mime_addpart(mime);
curl_mime_name(part, "file");
curl_mime_filedata(part, pdf_path.c_str());
curl_easy_setopt(curl, CURLOPT_URL, (base_url + "/extract").c_str());
curl_easy_setopt(curl, CURLOPT_MIMEPOST, mime);
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_cb);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);
CURLcode res = curl_easy_perform(curl);
curl_mime_free(mime);
curl_easy_cleanup(curl);
if (res != CURLE_OK)
throw std::runtime_error(curl_easy_strerror(res));
return response;
}
```
**Track B — native:** `cbindgen` generates `pdftract.h`; link against `libpdftract.so` /
`pdftract.dll`. Distribute as a vcpkg port or conan recipe with the header and shared library.
No standard package manager — provide both options.
---
### PHP
**Track A — subprocess:**
```php
<?php
class Pdftract
{
public function __construct(
private string $binaryPath = 'pdftract'
) {}
public function extract(string $pdfPath): array
{
$cmd = escapeshellcmd($this->binaryPath)
. ' extract '
. escapeshellarg($pdfPath);
$descriptors = [
1 => ['pipe', 'w'],
2 => ['pipe', 'w'],
];
$proc = proc_open($cmd, $descriptors, $pipes);
if (!is_resource($proc)) {
throw new RuntimeException('Failed to start pdftract');
}
$stdout = stream_get_contents($pipes[1]);
$stderr = stream_get_contents($pipes[2]);
fclose($pipes[1]);
fclose($pipes[2]);
$exit = proc_close($proc);
if ($exit !== 0) {
throw new RuntimeException("pdftract exited $exit: $stderr");
}
return json_decode($stdout, true, 512, JSON_THROW_ON_ERROR);
}
public function extractText(string $pdfPath): string
{
$result = $this->extract($pdfPath);
$lines = [];
foreach ($result['pages'] as $page) {
foreach ($page['blocks'] as $block) {
$lines[] = $block['text'];
}
}
return implode("\n\n", $lines);
}
public function extractPage(string $pdfPath, int $page): array
{
$result = $this->extract($pdfPath);
foreach ($result['pages'] as $p) {
if ($p['page'] === $page) return $p;
}
throw new OutOfRangeException("Page $page not found");
}
}
```
**Track A — HTTP:**
```php
<?php
class PdftractHttpClient
{
public function __construct(
private string $baseUrl = 'http://localhost:8080'
) {}
public function extract(string $pdfPath): array
{
$ch = curl_init($this->baseUrl . '/extract');
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => ['file' => new CURLFile($pdfPath, 'application/pdf')],
]);
$response = curl_exec($ch);
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($status !== 200) {
throw new RuntimeException("HTTP $status from pdftract serve");
}
return json_decode($response, true, 512, JSON_THROW_ON_ERROR);
}
public function extractText(string $pdfPath): string
{
$result = $this->extract($pdfPath);
$lines = array_map(
fn($page) => implode("\n", array_column($page['blocks'], 'text')),
$result['pages']
);
return implode("\n\n", $lines);
}
}
```
**Track B — native:** `ext-php-rs` compiles a PHP extension in Rust directly. Alternatively,
PHP 8+ FFI (`FFI::load`) can call into a C ABI shared library without writing a C extension.
The FFI approach is easier to distribute but has higher per-call overhead than a compiled
extension.
**Distribution:** Composer package (`packagist.org`). The package downloads the platform binary
in a post-install script. PHP extension distribution requires `pecl` and per-version compilation,
which is significant maintenance overhead — subprocess Track A is the right starting point.
---
### Kotlin
The JVM is shared with Java, so the implementation is the same `ProcessBuilder` and
`java.net.http.HttpClient` approach. The Kotlin wrapper adds idiomatic sugar: coroutines for
async, extension functions, and data classes for the JSON model.
**Subprocess:**
```kotlin
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.withContext
import kotlinx.serialization.Serializable
import kotlinx.serialization.json.Json
import java.io.File
@Serializable
data class Span(val text: String, val bbox: List<Double>, val font: String,
val size: Double, val confidence: Double)
@Serializable
data class Block(val kind: String, val text: String, val bbox: List<Double>)
@Serializable
data class Page(val page: Int, val spans: List<Span>, val blocks: List<Block>)
@Serializable
data class Metadata(val title: String? = null, val author: String? = null,
val page_count: Int)
@Serializable
data class ExtractionResult(val pages: List<Page>, val metadata: Metadata)
class Pdftract(private val binaryPath: String = "pdftract") {
private val json = Json { ignoreUnknownKeys = true }
suspend fun extract(pdfPath: String): ExtractionResult = withContext(Dispatchers.IO) {
val process = ProcessBuilder(binaryPath, "extract", pdfPath)
.redirectErrorStream(false)
.start()
val stdout = process.inputStream.bufferedReader().readText()
val stderr = process.errorStream.bufferedReader().readText()
val exit = process.waitFor()
if (exit != 0) throw RuntimeException("pdftract exited $exit: $stderr")
json.decodeFromString(stdout)
}
suspend fun extractText(pdfPath: String): String =
extract(pdfPath).pages
.flatMap { it.blocks }
.joinToString("\n\n") { it.text }
suspend fun extractPage(pdfPath: String, page: Int): Page =
extract(pdfPath).pages.first { it.page == page }
}
```
**HTTP:**
```kotlin
import io.ktor.client.*
import io.ktor.client.request.forms.*
import io.ktor.client.statement.*
import io.ktor.http.*
import java.io.File
class PdftractHttpClient(
private val baseUrl: String = "http://localhost:8080",
private val client: HttpClient = HttpClient()
) {
private val json = Json { ignoreUnknownKeys = true }
suspend fun extract(pdfPath: String): ExtractionResult {
val file = File(pdfPath)
val response: HttpResponse = client.submitFormWithBinaryData(
url = "$baseUrl/extract",
formData = formData {
append("file", file.readBytes(), Headers.build {
append(HttpHeaders.ContentType, "application/pdf")
append(HttpHeaders.ContentDisposition, "filename=\"${file.name}\"")
})
}
)
return json.decodeFromString(response.bodyAsText())
}
suspend fun extractText(pdfPath: String): String =
extract(pdfPath).pages
.flatMap { it.blocks }
.joinToString("\n\n") { it.text }
}
```
**Distribution:** Maven Central, same artifact group as the Java package (`com.pdftract`).
Separate artifact ID (`pdftract-kotlin`) so Java users don't pull in Kotlin stdlib.
---
## Implementation sequencing
| Priority | Language | Effort | Rationale |
|---|---|---|---|
| 1 | TypeScript | Half a day | Type definitions on top of existing JS code |
| 2 | Kotlin | Half a day | JVM wrapper on top of existing Java code |
| 3 | C# | 12 days | Subprocess is straightforward; NuGet RID packaging is the complexity |
| 4 | PHP | 1 day | Composer subprocess wrapper; avoid extension track initially |
| 5 | C++ | 12 days | `popen` + libcurl; no package manager standard, distribute as vcpkg port |
All five are blocked on the GitHub Releases binary distribution infrastructure being in place first.