- Add jedarden/pdftract Composer package (sdk/php/) - Implement Client.php with proc_open subprocess execution - Add PSR-3 LoggerInterface integration (defaults to NullLogger) - Add 9 contract methods: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt - Add readonly model classes: Document, Page, Metadata, Fingerprint, Classification, Match, Receipt - Add exception classes: PdftractException base + 8 subclasses - Add PHPUnit conformance test suite - Add phpunit.xml configuration - Add composer.json with jedarden/pdftract package name - Add .ci/argo-workflows/pdftract-php-publish.yaml (Packagist auto-discovery from git tags) Also includes Ruby SDK scaffold from parallel workflow. Closes pdftract-2m3gl
88 lines
1.7 KiB
Markdown
88 lines
1.7 KiB
Markdown
# jedarden/pdftract
|
|
|
|
PHP subprocess SDK for pdftract document extraction.
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
composer require jedarden/pdftract
|
|
```
|
|
|
|
## Requirements
|
|
|
|
- PHP 8.2 or higher
|
|
- The `pdftract` binary must be in your PATH or specified via constructor
|
|
|
|
## Usage
|
|
|
|
```php
|
|
use Jedarden\Pdftract\Client;
|
|
use Monolog\Logger;
|
|
use Monolog\Handler\StreamHandler;
|
|
|
|
// With optional PSR-3 logger
|
|
$logger = new Logger('pdftract');
|
|
$logger->pushHandler(new StreamHandler('php://stdout', Logger::DEBUG));
|
|
|
|
$client = new Client(logger: $logger);
|
|
|
|
// Extract document
|
|
$document = $client->extract('document.pdf');
|
|
echo "Pages: {$document->pageCount}\n";
|
|
|
|
// Extract text
|
|
$text = $client->extractText('document.pdf');
|
|
|
|
// Extract Markdown
|
|
$markdown = $client->extractMarkdown('document.pdf');
|
|
|
|
// Stream pages
|
|
foreach ($client->extractStream('document.pdf') as $page) {
|
|
echo "Page {$page->number}: {$page->text}\n";
|
|
}
|
|
|
|
// Search
|
|
foreach ($client->search('document.pdf', 'invoice') as $match) {
|
|
echo "Found at page {$match->page}\n";
|
|
}
|
|
|
|
// Get metadata
|
|
$metadata = $client->getMetadata('document.pdf');
|
|
|
|
// Hash for fingerprinting
|
|
$fingerprint = $client->hash('document.pdf');
|
|
|
|
// Classify document
|
|
$classification = $client->classify('document.pdf');
|
|
|
|
// Verify receipt
|
|
$valid = $client->verifyReceipt('document.pdf', $receipt);
|
|
```
|
|
|
|
## Options
|
|
|
|
Pass options as an associative array:
|
|
|
|
```php
|
|
$document = $client->extract('document.pdf', [
|
|
'ocrLanguage' => 'eng',
|
|
'structure' => true,
|
|
]);
|
|
```
|
|
|
|
## Logging
|
|
|
|
The Client accepts any PSR-3 LoggerInterface:
|
|
|
|
```php
|
|
$client = new Client(logger: $myLogger);
|
|
```
|
|
|
|
## License
|
|
|
|
MIT
|
|
|
|
## Support
|
|
|
|
- Issues: https://github.com/jedarden/pdftract-php/issues
|
|
- Upstream: https://github.com/jedarden/pdftract
|