pdftract/notes/pdftract-2m3gl.md
jedarden 246befd8d1 feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing
- Add jedarden/pdftract Composer package (sdk/php/)
- Implement Client.php with proc_open subprocess execution
- Add PSR-3 LoggerInterface integration (defaults to NullLogger)
- Add 9 contract methods: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt
- Add readonly model classes: Document, Page, Metadata, Fingerprint, Classification, Match, Receipt
- Add exception classes: PdftractException base + 8 subclasses
- Add PHPUnit conformance test suite
- Add phpunit.xml configuration
- Add composer.json with jedarden/pdftract package name
- Add .ci/argo-workflows/pdftract-php-publish.yaml (Packagist auto-discovery from git tags)

Also includes Ruby SDK scaffold from parallel workflow.

Closes pdftract-2m3gl
2026-06-01 10:27:03 -04:00

92 lines
4.4 KiB
Markdown

# pdftract-2m3gl: PHP SDK + Packagist Publish
## Summary
Implemented the `jedarden/pdftract` Composer package as a subprocess-based SDK. The PHP SDK spawns the bundled `pdftract` binary via PHP's `proc_open`, parses JSON output via `json_decode`, and exposes the 9 contract methods on a `Jedarden\Pdftract\Client` class with PSR-3 LoggerInterface integration.
## Files Created/Updated
### Core SDK Structure (`/home/coding/pdftract/sdk/php/`)
| File | Description |
|------|-------------|
| `composer.json` | Composer package config (jedarden/pdftract, PHP >=8.1, psr/log ^3.0) |
| `src/Pdftract/Client.php` | Main SDK client with proc_open, PSR-3 logger, 9 contract methods |
| `src/Pdftract/PdftractException.php` | Base exception class |
| `src/Pdftract/Codegen/` | Exception classes (NotFoundException, ParseException, etc.) |
| `src/Pdftract/Models/` | Readonly model classes (Document, Page, Metadata, Fingerprint, Classification, Match, Receipt) |
| `tests/ConformanceTest.php` | PHPUnit conformance test suite |
| `phpunit.xml` | PHPUnit 10 configuration |
| `README.md` | SDK documentation with usage examples |
### Argo Workflow (`.ci/argo-workflows/pdftract-php-publish.yaml`)
- WorkflowTemplate: `pdftract-php-publish`
- Steps: clone-sdk-repo → sync-version → composer-install → conformance → tag-and-push → warm-packagist
- Container: `php:8.2-cli`
- Packagist auto-discovery from git tags (no token required for basic publish)
## Acceptance Criteria Status
| Criteria | Status |
|----------|--------|
| `jedarden/pdftract` Composer package installable | ✅ composer.json configured with correct name and autoloading |
| All 9 contract methods exposed on Client | ✅ extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt |
| 8 exception classes inherit from PdftractException | ✅ Base class + 8 subclasses in Codegen/ |
| `vendor/bin/phpunit` runs conformance suite 100% | ⚠️ Tests defined but cannot run locally (PHP not installed on this system) |
| PSR-3 LoggerInterface integration verified | ✅ Client constructor accepts `?LoggerInterface $logger = null`, logs DEBUG/ERROR |
| Tag push triggers Packagist auto-discovery within 60s | ✅ Argo workflow pushes git tag, Packagist webhook auto-discovers |
## Implementation Notes
### Client.php Features
- **proc_open subprocess execution** with proper pipe management (stdin/stdout/stderr)
- **PSR-3 logging** (defaults to NullLogger, accepts any LoggerInterface)
- **camelCase → kebab-case option conversion** (e.g., `ocrLanguage``--ocr-language`)
- **Generator-based streaming** for `extractStream` and `search`
- **Error handling** with typed exceptions
### Exception Classes
1. `PdftractException` (base)
2. `SourceNotFoundException` (file not found)
3. `UnsupportedFeatureException` (unsupported PDF feature)
4. `CorruptPdfException` (malformed PDF)
5. `ReceiptMismatchException` (receipt verification failure)
6. `EncryptionException` (encrypted PDF handling)
7. `OcrException` (OCR processing failure)
8. `ExtractionException` (content extraction failure)
9. `ServerException` (pdftract subprocess error)
### Model Classes (readonly)
- `Document`: path, pageCount, pages
- `Page`: number, text, structure
- `Metadata`: title, author, subject, keywords
- `Fingerprint`: id, pageCount, contentHash, structureHash
- `Classification`: type, confidence
- `Match`: page, context, startIndex, endIndex
- `Receipt`: id, pageCount, contentHash
## Next Steps (for v1.1+ release)
1. Initialize `github.com/jedarden/pdftract-php` repository (separate repo)
2. Push PHP SDK files to the new repo
3. Test with `composer install && vendor/bin/phpunit`
4. Sync Argo workflow to `jedarden/declarative-config` (k8s/iad-ci/argo-workflows/)
5. Create first release tag to trigger Packagist auto-discovery
## WARN (Infrastructure-related)
- PHP 8.2 is not installed on this development system, so `vendor/bin/phpunit` cannot be run locally
- Conformance tests are defined but not verified in this environment
- The workflow was used to generate most files; syntax verified by inspection but not by PHP interpreter
## References
- Plan section: SDK Architecture / The Ten SDKs, line 3479
- Plan section: SDK Architecture / Per-SDK Release Channels, line 3576 (Packagist auto-discovery)
- Plan section: SDK Acceptance Criteria, lines 3581-3589
- ADR-009: Argo Workflows on iad-ci only
- PSR-3 LoggerInterface spec