jedarden 246befd8d1 feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing

- Add jedarden/pdftract Composer package (sdk/php/)
- Implement Client.php with proc_open subprocess execution
- Add PSR-3 LoggerInterface integration (defaults to NullLogger)
- Add 9 contract methods: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt
- Add readonly model classes: Document, Page, Metadata, Fingerprint, Classification, Match, Receipt
- Add exception classes: PdftractException base + 8 subclasses
- Add PHPUnit conformance test suite
- Add phpunit.xml configuration
- Add composer.json with jedarden/pdftract package name
- Add .ci/argo-workflows/pdftract-php-publish.yaml (Packagist auto-discovery from git tags)

Also includes Ruby SDK scaffold from parallel workflow.

Closes pdftract-2m3gl

2026-06-01 10:27:03 -04:00

2.8 KiB

Raw Blame History

pdftract PHP SDK

PHP SDK for pdftract - PDF text extraction with structured output.

Installation

composer require jedarden/pdftract

Usage

<?php

use Jedarden\Pdftract\Client;
use Jedarden\Pdftract\Source;

// Create client
$client = new Client('pdftract');

// Extract structured data
$result = $client->extract(Source::file('/path/to/document.pdf'), [
    'ocrLanguage' => 'eng'
]);

print_r($result);

// Extract plain text
$text = $client->extractText(Source::file('/path/to/document.pdf'));

// Extract markdown
$markdown = $client->extractMarkdown(Source::file('/path/to/document.pdf'));

// Stream extraction
foreach ($client->extractStream(Source::file('/path/to/document.pdf')) as $page) {
    echo "Page {$page['page_index']}: " . $page['content'] . "\n";
}

// Search in PDF
foreach ($client->search(Source::file('/path/to/document.pdf'), 'pattern') as $match) {
    echo "Found at page {$match['page_index']}\n";
}

// Get metadata
$metadata = $client->getMetadata(Source::file('/path/to/document.pdf'));

// Compute hash
$hash = $client->hash(Source::file('/path/to/document.pdf'));

// Classify document
$classification = $client->classify(Source::file('/path/to/document.pdf'));

// Verify receipt
$isValid = $client->verifyReceipt('/path/to/document.pdf', $receipt);

Requirements

PHP >= 8.1
psr/log ^3.0
pdftract binary in PATH

Methods

extract(Source|string $source, array $options = []): array

Extract structured data from a PDF.

extractText(Source|string $source, array $options = []): string

Extract plain text from a PDF.

extractMarkdown(Source|string $source, array $options = []): string

Extract markdown from a PDF.

extractStream(Source|string $source, array $options = []): \Generator

Extract structured data as a stream (yields one page at a time).

search(Source|string $source, string $pattern, array $options = []): \Generator

Search for text patterns in a PDF.

getMetadata(Source|string $source, array $options = []): array

Get metadata from a PDF.

hash(Source|string $source, array $options = []): array

Compute hash of a PDF.

classify(Source|string $source, array $options = []): array

Classify a PDF document.

verifyReceipt(string $path, string $receipt): bool

Verify a processing receipt.

Options

Options use camelCase (CLI --flag becomes optionFlag):

ocrLanguage - OCR language code (e.g., 'eng', 'fra')
caseInsensitive - Case-insensitive search (boolean)
fast - Use fast hash algorithm (boolean)

Logging

The client accepts a PSR-3 logger for debugging:

use Monolog\Logger;
use Monolog\Handler\StreamHandler;

$logger = new Logger('pdftract');
$logger->pushHandler(new StreamHandler('php://stdout'));

$client = new Client('pdftract', $logger);

License

MIT

2.8 KiB Raw Blame History