pdftract/pdftract-ruby/README.md
jedarden 246befd8d1 feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing
- Add jedarden/pdftract Composer package (sdk/php/)
- Implement Client.php with proc_open subprocess execution
- Add PSR-3 LoggerInterface integration (defaults to NullLogger)
- Add 9 contract methods: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt
- Add readonly model classes: Document, Page, Metadata, Fingerprint, Classification, Match, Receipt
- Add exception classes: PdftractException base + 8 subclasses
- Add PHPUnit conformance test suite
- Add phpunit.xml configuration
- Add composer.json with jedarden/pdftract package name
- Add .ci/argo-workflows/pdftract-php-publish.yaml (Packagist auto-discovery from git tags)

Also includes Ruby SDK scaffold from parallel workflow.

Closes pdftract-2m3gl
2026-06-01 10:27:03 -04:00

110 lines
1.9 KiB
Markdown

# pdftract-ruby
Ruby SDK for pdftract - PDF extraction and conformance testing.
## Installation
```bash
gem install pdftract
```
Or in your Gemfile:
```ruby
gem 'pdftract', '~> 1.0.0'
```
## Usage
### Basic extract
```ruby
require 'pdftract'
client = Pdftract.client
doc = client.extract('document.pdf')
puts "Pages: #{doc.pages.length}"
```
### Extract with OCR
```ruby
doc = client.extract('scanned.pdf', { ocr_language: 'eng', ocr_threshold: 0.7 })
```
### Extract text
```ruby
text = client.extract_text('document.pdf')
puts text
```
### Extract Markdown
```ruby
markdown = client.extract_markdown('document.pdf')
puts markdown
```
### Stream extraction
```ruby
client.extract_stream('large.pdf').each do |page|
puts "Page #{page.page}: #{page.blocks&.length || 0} blocks"
end
```
### Search
```ruby
client.search('document.pdf', 'invoice').each do |match|
puts "Found on page #{match.page}: #{match.text}"
end
```
### Get metadata
```ruby
metadata = client.get_metadata('document.pdf')
puts "Title: #{metadata.title}"
puts "Pages: #{metadata.page_count}"
```
### Hash
```ruby
fingerprint = client.hash('document.pdf')
puts "SHA-256: #{fingerprint.hash}"
puts "Fast hash: #{fingerprint.fast_hash}"
```
### Classify
```ruby
classification = client.classify('document.pdf')
puts "Category: #{classification.category}"
puts "Confidence: #{classification.confidence}"
```
### Verify receipt
```ruby
valid = client.verify_receipt('document.pdf', 'receipt-data')
puts "Valid: #{valid}"
```
## Binary version compatibility
This SDK requires pdftract 1.0.0 or later. Download from:
https://github.com/jedarden/pdftract/releases
## Troubleshooting
### Binary not found
Ensure `pdftract` is on your PATH. The SDK probes PATH for the executable.
### Version mismatch
The SDK will refuse to invoke mismatched binary versions. Install the correct version.
### Network failure
For remote URLs, check your network connection and TLS certificate chain.