jedarden ef9c03095d Add SDK architecture notes covering top 10 languages

Covers TypeScript, C#, C++, PHP, and Kotlin gaps with full code examples
for both subprocess and HTTP tracks, NuGet RID packaging detail, PHP FFI
options, and implementation sequencing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 14:51:25 -04:00

17 KiB

Raw Blame History

SDK Architecture and Language Coverage

Top 10 language coverage status

Based on Stack Overflow 2024 survey rankings. The existing sdk-invocation.md covers Python, JavaScript, Go, Ruby, Java, Rust, and Bash. Gaps: TypeScript, C#, C++, PHP, and Kotlin.

Common infrastructure (required before any SDK ships)

Binary distribution

Every SDK approach — subprocess or native — depends on platform binaries published to GitHub Releases:

Target triple	Platform
`x86_64-unknown-linux-gnu`	Linux x86_64
`aarch64-unknown-linux-gnu`	Linux ARM64
`x86_64-apple-darwin`	macOS Intel
`aarch64-apple-darwin`	macOS Apple Silicon
`x86_64-pc-windows-msvc`	Windows x86_64

The CI workflow must cross-compile for all five targets and attach the binaries to a versioned GitHub Release tag on every release. SDKs pin to a binary version and download the appropriate artifact at install time.

Release format

https://github.com/jedarden/pdftract/releases/download/v{VERSION}/pdftract-{TARGET}.tar.gz

Semantic versioning is required before any package is published to a package registry.

Two SDK tracks

Track A — Subprocess / HTTP wrappers

Each SDK ships a thin wrapper that:

Downloads and caches the platform binary on first use (or at install time)
Invokes it via subprocess for one-off extractions
Optionally connects to a pdftract serve instance over HTTP for high-throughput use

Tradeoffs: Fast to implement for any language, no FFI complexity, slight per-call overhead from process spawn. Acceptable for batch and interactive workloads.

Track B — Native bindings

The Rust core exposes a C ABI via cbindgen. Each language calls into the compiled shared library directly, bypassing subprocess entirely.

Requires:

cdylib crate type in Cargo.toml
cbindgen generating a pdftract.h C header
A #[no_mangle] extern "C" public API surface in the Rust core
Per-language FFI glue

Tradeoffs: Zero process-spawn overhead, suitable for embedding in long-running services, but requires per-language binding work and platform-specific shared library distribution.

Starting recommendation: Track A for all languages, Track B for Python first (PyO3 is mature, Python is the highest-volume use case for RAG and LLM preprocessing pipelines).

Per-language breakdown

Language	Package Manager	Track A	Track B	Status
Python	PyPI	`subprocess`	PyO3 + maturin	Covered
JavaScript	npm	`child_process`	napi-rs	Covered
TypeScript	npm	Same as JS	Same + `.d.ts` types	Gap — types only
Java	Maven Central	`ProcessBuilder`	JNI via `jni` crate	Covered
C#	NuGet	`System.Diagnostics.Process`	P/Invoke via cbindgen	Gap
C++	vcpkg / conan	`popen`	cbindgen → `.h` + shared lib	Gap
Go	Go modules	`os/exec`	cgo + cbindgen	Covered
PHP	Packagist	`proc_open`	ext-php-rs or PHP FFI	Gap
Kotlin	Maven Central	`ProcessBuilder` (JVM)	JNI (same as Java)	Gap
Rust	crates.io	`std::process::Command`	native library crate	Covered

Gap detail

TypeScript

Minimal work on top of the existing JavaScript notes. The implementation is identical — add a pdftract.d.ts type definition file and publish to npm as @pdftract/sdk or alongside the JS package. Types to define:

export interface Span {
  text: string;
  bbox: [number, number, number, number];
  font: string;
  size: number;
  confidence: number;
}

export interface Block {
  kind: 'paragraph' | 'heading' | 'table' | 'figure' | 'list';
  text: string;
  bbox: [number, number, number, number];
}

export interface Page {
  page: number;
  spans: Span[];
  blocks: Block[];
}

export interface ExtractionResult {
  pages: Page[];
  metadata: {
    title?: string;
    author?: string;
    page_count: number;
  };
}

export function extract(filePath: string): Promise<ExtractionResult>;
export function extractText(filePath: string): Promise<string>;
export function extractPage(filePath: string, page: number): Promise<Page>;
export function createClient(baseUrl: string): PdftractClient;

export class PdftractClient {
  extract(filePath: string): Promise<ExtractionResult>;
  extractText(filePath: string): Promise<string>;
}

C# / .NET

Track A — subprocess:

using System.Diagnostics;
using System.Text.Json;

public class PdftractClient
{
    private readonly string _binaryPath;

    public PdftractClient(string binaryPath = "pdftract")
    {
        _binaryPath = binaryPath;
    }

    public async Task<ExtractionResult> ExtractAsync(string pdfPath)
    {
        using var process = new Process
        {
            StartInfo = new ProcessStartInfo
            {
                FileName = _binaryPath,
                Arguments = $"extract \"{pdfPath}\"",
                RedirectStandardOutput = true,
                RedirectStandardError = true,
                UseShellExecute = false,
            }
        };

        process.Start();
        string stdout = await process.StandardOutput.ReadToEndAsync();
        await process.WaitForExitAsync();

        if (process.ExitCode != 0)
        {
            string stderr = await process.StandardError.ReadToEndAsync();
            throw new PdftractException($"pdftract exited {process.ExitCode}: {stderr}");
        }

        return JsonSerializer.Deserialize<ExtractionResult>(stdout)
            ?? throw new PdftractException("Empty response");
    }

    public async Task<string> ExtractTextAsync(string pdfPath)
    {
        var result = await ExtractAsync(pdfPath);
        return string.Join("\n\n", result.Pages.Select(p =>
            string.Join("\n", p.Blocks.Select(b => b.Text))));
    }
}

Track A — HTTP:

using System.Net.Http.Headers;

public class PdftractHttpClient : IDisposable
{
    private readonly HttpClient _http;
    private readonly string _baseUrl;

    public PdftractHttpClient(string baseUrl = "http://localhost:8080")
    {
        _http = new HttpClient();
        _baseUrl = baseUrl;
    }

    public async Task<ExtractionResult> ExtractAsync(string pdfPath)
    {
        using var form = new MultipartFormDataContent();
        var fileBytes = await File.ReadAllBytesAsync(pdfPath);
        var fileContent = new ByteArrayContent(fileBytes);
        fileContent.Headers.ContentType = MediaTypeHeaderValue.Parse("application/pdf");
        form.Add(fileContent, "file", Path.GetFileName(pdfPath));

        var response = await _http.PostAsync($"{_baseUrl}/extract", form);
        response.EnsureSuccessStatusCode();

        var json = await response.Content.ReadAsStringAsync();
        return JsonSerializer.Deserialize<ExtractionResult>(json)
            ?? throw new PdftractException("Empty response");
    }

    public void Dispose() => _http.Dispose();
}

Track B — native (P/Invoke):

Requires cbindgen to generate pdftract.h, then:

using System.Runtime.InteropServices;

internal static class NativeMethods
{
    private const string LibName = "pdftract";

    [DllImport(LibName, EntryPoint = "pdftract_extract_file")]
    internal static extern IntPtr ExtractFile(
        [MarshalAs(UnmanagedType.LPUTF8Str)] string path);

    [DllImport(LibName, EntryPoint = "pdftract_free_result")]
    internal static extern void FreeResult(IntPtr result);
}

NuGet packaging — the .nupkg must embed the shared library per Runtime Identifier:

lib/
  net8.0/
    Pdftract.dll
runtimes/
  linux-x64/native/libpdftract.so
  linux-arm64/native/libpdftract.so
  osx-x64/native/libpdftract.dylib
  osx-arm64/native/libpdftract.dylib
  win-x64/native/pdftract.dll

The .csproj sets RuntimeIdentifiers and uses <PackagePath> to map each binary into the correct runtime folder. This is the primary complexity in C# packaging.

C++

Track A — subprocess:

#include <array>
#include <cstdio>
#include <memory>
#include <stdexcept>
#include <string>

std::string pdftract_extract_json(const std::string& pdf_path) {
    std::string cmd = "pdftract extract \"" + pdf_path + "\"";
    std::array<char, 4096> buf{};
    std::string result;

    std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(cmd.c_str(), "r"), pclose);
    if (!pipe) throw std::runtime_error("popen failed");

    while (fgets(buf.data(), buf.size(), pipe.get()) != nullptr)
        result += buf.data();

    return result;
}

std::string pdftract_extract_text(const std::string& pdf_path) {
    std::string cmd = "pdftract extract --text \"" + pdf_path + "\"";
    std::array<char, 4096> buf{};
    std::string result;

    std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(cmd.c_str(), "r"), pclose);
    if (!pipe) throw std::runtime_error("popen failed");

    while (fgets(buf.data(), buf.size(), pipe.get()) != nullptr)
        result += buf.data();

    return result;
}

Track A — HTTP (using libcurl):

#include <curl/curl.h>
#include <string>

static size_t write_cb(char* ptr, size_t size, size_t nmemb, std::string* data) {
    data->append(ptr, size * nmemb);
    return size * nmemb;
}

std::string pdftract_http_extract(const std::string& pdf_path,
                                   const std::string& base_url = "http://localhost:8080") {
    CURL* curl = curl_easy_init();
    if (!curl) throw std::runtime_error("curl_easy_init failed");

    std::string response;
    curl_mime* mime = curl_mime_init(curl);
    curl_mimepart* part = curl_mime_addpart(mime);
    curl_mime_name(part, "file");
    curl_mime_filedata(part, pdf_path.c_str());

    curl_easy_setopt(curl, CURLOPT_URL, (base_url + "/extract").c_str());
    curl_easy_setopt(curl, CURLOPT_MIMEPOST, mime);
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_cb);
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);

    CURLcode res = curl_easy_perform(curl);
    curl_mime_free(mime);
    curl_easy_cleanup(curl);

    if (res != CURLE_OK)
        throw std::runtime_error(curl_easy_strerror(res));

    return response;
}

Track B — native: cbindgen generates pdftract.h; link against libpdftract.so / pdftract.dll. Distribute as a vcpkg port or conan recipe with the header and shared library. No standard package manager — provide both options.

PHP

Track A — subprocess:

<?php

class Pdftract
{
    public function __construct(
        private string $binaryPath = 'pdftract'
    ) {}

    public function extract(string $pdfPath): array
    {
        $cmd = escapeshellcmd($this->binaryPath)
             . ' extract '
             . escapeshellarg($pdfPath);

        $descriptors = [
            1 => ['pipe', 'w'],
            2 => ['pipe', 'w'],
        ];

        $proc = proc_open($cmd, $descriptors, $pipes);
        if (!is_resource($proc)) {
            throw new RuntimeException('Failed to start pdftract');
        }

        $stdout = stream_get_contents($pipes[1]);
        $stderr = stream_get_contents($pipes[2]);
        fclose($pipes[1]);
        fclose($pipes[2]);
        $exit = proc_close($proc);

        if ($exit !== 0) {
            throw new RuntimeException("pdftract exited $exit: $stderr");
        }

        return json_decode($stdout, true, 512, JSON_THROW_ON_ERROR);
    }

    public function extractText(string $pdfPath): string
    {
        $result = $this->extract($pdfPath);
        $lines = [];
        foreach ($result['pages'] as $page) {
            foreach ($page['blocks'] as $block) {
                $lines[] = $block['text'];
            }
        }
        return implode("\n\n", $lines);
    }

    public function extractPage(string $pdfPath, int $page): array
    {
        $result = $this->extract($pdfPath);
        foreach ($result['pages'] as $p) {
            if ($p['page'] === $page) return $p;
        }
        throw new OutOfRangeException("Page $page not found");
    }
}

Track A — HTTP:

<?php

class PdftractHttpClient
{
    public function __construct(
        private string $baseUrl = 'http://localhost:8080'
    ) {}

    public function extract(string $pdfPath): array
    {
        $ch = curl_init($this->baseUrl . '/extract');
        curl_setopt_array($ch, [
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_POST           => true,
            CURLOPT_POSTFIELDS     => ['file' => new CURLFile($pdfPath, 'application/pdf')],
        ]);

        $response = curl_exec($ch);
        $status   = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);

        if ($status !== 200) {
            throw new RuntimeException("HTTP $status from pdftract serve");
        }

        return json_decode($response, true, 512, JSON_THROW_ON_ERROR);
    }

    public function extractText(string $pdfPath): string
    {
        $result = $this->extract($pdfPath);
        $lines = array_map(
            fn($page) => implode("\n", array_column($page['blocks'], 'text')),
            $result['pages']
        );
        return implode("\n\n", $lines);
    }
}

Track B — native: ext-php-rs compiles a PHP extension in Rust directly. Alternatively, PHP 8+ FFI (FFI::load) can call into a C ABI shared library without writing a C extension. The FFI approach is easier to distribute but has higher per-call overhead than a compiled extension.

Distribution: Composer package (packagist.org). The package downloads the platform binary in a post-install script. PHP extension distribution requires pecl and per-version compilation, which is significant maintenance overhead — subprocess Track A is the right starting point.

Kotlin

The JVM is shared with Java, so the implementation is the same ProcessBuilder and java.net.http.HttpClient approach. The Kotlin wrapper adds idiomatic sugar: coroutines for async, extension functions, and data classes for the JSON model.

Subprocess:

import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.withContext
import kotlinx.serialization.Serializable
import kotlinx.serialization.json.Json
import java.io.File

@Serializable
data class Span(val text: String, val bbox: List<Double>, val font: String,
                val size: Double, val confidence: Double)

@Serializable
data class Block(val kind: String, val text: String, val bbox: List<Double>)

@Serializable
data class Page(val page: Int, val spans: List<Span>, val blocks: List<Block>)

@Serializable
data class Metadata(val title: String? = null, val author: String? = null,
                    val page_count: Int)

@Serializable
data class ExtractionResult(val pages: List<Page>, val metadata: Metadata)

class Pdftract(private val binaryPath: String = "pdftract") {

    private val json = Json { ignoreUnknownKeys = true }

    suspend fun extract(pdfPath: String): ExtractionResult = withContext(Dispatchers.IO) {
        val process = ProcessBuilder(binaryPath, "extract", pdfPath)
            .redirectErrorStream(false)
            .start()

        val stdout = process.inputStream.bufferedReader().readText()
        val stderr = process.errorStream.bufferedReader().readText()
        val exit = process.waitFor()

        if (exit != 0) throw RuntimeException("pdftract exited $exit: $stderr")
        json.decodeFromString(stdout)
    }

    suspend fun extractText(pdfPath: String): String =
        extract(pdfPath).pages
            .flatMap { it.blocks }
            .joinToString("\n\n") { it.text }

    suspend fun extractPage(pdfPath: String, page: Int): Page =
        extract(pdfPath).pages.first { it.page == page }
}

HTTP:

import io.ktor.client.*
import io.ktor.client.request.forms.*
import io.ktor.client.statement.*
import io.ktor.http.*
import java.io.File

class PdftractHttpClient(
    private val baseUrl: String = "http://localhost:8080",
    private val client: HttpClient = HttpClient()
) {
    private val json = Json { ignoreUnknownKeys = true }

    suspend fun extract(pdfPath: String): ExtractionResult {
        val file = File(pdfPath)
        val response: HttpResponse = client.submitFormWithBinaryData(
            url = "$baseUrl/extract",
            formData = formData {
                append("file", file.readBytes(), Headers.build {
                    append(HttpHeaders.ContentType, "application/pdf")
                    append(HttpHeaders.ContentDisposition, "filename=\"${file.name}\"")
                })
            }
        )
        return json.decodeFromString(response.bodyAsText())
    }

    suspend fun extractText(pdfPath: String): String =
        extract(pdfPath).pages
            .flatMap { it.blocks }
            .joinToString("\n\n") { it.text }
}

Distribution: Maven Central, same artifact group as the Java package (com.pdftract). Separate artifact ID (pdftract-kotlin) so Java users don't pull in Kotlin stdlib.

Implementation sequencing

Priority	Language	Effort	Rationale
1	TypeScript	Half a day	Type definitions on top of existing JS code
2	Kotlin	Half a day	JVM wrapper on top of existing Java code
3	C#	1–2 days	Subprocess is straightforward; NuGet RID packaging is the complexity
4	PHP	1 day	Composer subprocess wrapper; avoid extension track initially
5	C++	1–2 days	`popen` + libcurl; no package manager standard, distribute as vcpkg port

All five are blocked on the GitHub Releases binary distribution infrastructure being in place first.

17 KiB Raw Blame History Unescape Escape

SDK Architecture and Language Coverage

Top 10 language coverage status

Common infrastructure (required before any SDK ships)

Binary distribution

Release format

Two SDK tracks

Track A — Subprocess / HTTP wrappers

Track B — Native bindings

Per-language breakdown

Gap detail

TypeScript

C# / .NET

C++

PHP

Kotlin

Implementation sequencing

17 KiB

Raw Blame History