- Add jedarden/pdftract Composer package (sdk/php/) - Implement Client.php with proc_open subprocess execution - Add PSR-3 LoggerInterface integration (defaults to NullLogger) - Add 9 contract methods: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt - Add readonly model classes: Document, Page, Metadata, Fingerprint, Classification, Match, Receipt - Add exception classes: PdftractException base + 8 subclasses - Add PHPUnit conformance test suite - Add phpunit.xml configuration - Add composer.json with jedarden/pdftract package name - Add .ci/argo-workflows/pdftract-php-publish.yaml (Packagist auto-discovery from git tags) Also includes Ruby SDK scaffold from parallel workflow. Closes pdftract-2m3gl
110 lines
1.9 KiB
Markdown
110 lines
1.9 KiB
Markdown
# pdftract-ruby
|
|
|
|
Ruby SDK for pdftract - PDF extraction and conformance testing.
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
gem install pdftract
|
|
```
|
|
|
|
Or in your Gemfile:
|
|
|
|
```ruby
|
|
gem 'pdftract', '~> 1.0.0'
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic extract
|
|
|
|
```ruby
|
|
require 'pdftract'
|
|
|
|
client = Pdftract.client
|
|
doc = client.extract('document.pdf')
|
|
puts "Pages: #{doc.pages.length}"
|
|
```
|
|
|
|
### Extract with OCR
|
|
|
|
```ruby
|
|
doc = client.extract('scanned.pdf', { ocr_language: 'eng', ocr_threshold: 0.7 })
|
|
```
|
|
|
|
### Extract text
|
|
|
|
```ruby
|
|
text = client.extract_text('document.pdf')
|
|
puts text
|
|
```
|
|
|
|
### Extract Markdown
|
|
|
|
```ruby
|
|
markdown = client.extract_markdown('document.pdf')
|
|
puts markdown
|
|
```
|
|
|
|
### Stream extraction
|
|
|
|
```ruby
|
|
client.extract_stream('large.pdf').each do |page|
|
|
puts "Page #{page.page}: #{page.blocks&.length || 0} blocks"
|
|
end
|
|
```
|
|
|
|
### Search
|
|
|
|
```ruby
|
|
client.search('document.pdf', 'invoice').each do |match|
|
|
puts "Found on page #{match.page}: #{match.text}"
|
|
end
|
|
```
|
|
|
|
### Get metadata
|
|
|
|
```ruby
|
|
metadata = client.get_metadata('document.pdf')
|
|
puts "Title: #{metadata.title}"
|
|
puts "Pages: #{metadata.page_count}"
|
|
```
|
|
|
|
### Hash
|
|
|
|
```ruby
|
|
fingerprint = client.hash('document.pdf')
|
|
puts "SHA-256: #{fingerprint.hash}"
|
|
puts "Fast hash: #{fingerprint.fast_hash}"
|
|
```
|
|
|
|
### Classify
|
|
|
|
```ruby
|
|
classification = client.classify('document.pdf')
|
|
puts "Category: #{classification.category}"
|
|
puts "Confidence: #{classification.confidence}"
|
|
```
|
|
|
|
### Verify receipt
|
|
|
|
```ruby
|
|
valid = client.verify_receipt('document.pdf', 'receipt-data')
|
|
puts "Valid: #{valid}"
|
|
```
|
|
|
|
## Binary version compatibility
|
|
|
|
This SDK requires pdftract 1.0.0 or later. Download from:
|
|
https://github.com/jedarden/pdftract/releases
|
|
|
|
## Troubleshooting
|
|
|
|
### Binary not found
|
|
Ensure `pdftract` is on your PATH. The SDK probes PATH for the executable.
|
|
|
|
### Version mismatch
|
|
The SDK will refuse to invoke mismatched binary versions. Install the correct version.
|
|
|
|
### Network failure
|
|
For remote URLs, check your network connection and TLS certificate chain.
|