pdftract/pdftract-ruby
jedarden 246befd8d1 feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing
- Add jedarden/pdftract Composer package (sdk/php/)
- Implement Client.php with proc_open subprocess execution
- Add PSR-3 LoggerInterface integration (defaults to NullLogger)
- Add 9 contract methods: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt
- Add readonly model classes: Document, Page, Metadata, Fingerprint, Classification, Match, Receipt
- Add exception classes: PdftractException base + 8 subclasses
- Add PHPUnit conformance test suite
- Add phpunit.xml configuration
- Add composer.json with jedarden/pdftract package name
- Add .ci/argo-workflows/pdftract-php-publish.yaml (Packagist auto-discovery from git tags)

Also includes Ruby SDK scaffold from parallel workflow.

Closes pdftract-2m3gl
2026-06-01 10:27:03 -04:00
..
lib feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
test feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
.gitignore feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
GENERATED feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
LICENSE feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
pdftract.gemspec feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
Rakefile feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
README.md feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00

pdftract-ruby

Ruby SDK for pdftract - PDF extraction and conformance testing.

Installation

gem install pdftract

Or in your Gemfile:

gem 'pdftract', '~> 1.0.0'

Usage

Basic extract

require 'pdftract'

client = Pdftract.client
doc = client.extract('document.pdf')
puts "Pages: #{doc.pages.length}"

Extract with OCR

doc = client.extract('scanned.pdf', { ocr_language: 'eng', ocr_threshold: 0.7 })

Extract text

text = client.extract_text('document.pdf')
puts text

Extract Markdown

markdown = client.extract_markdown('document.pdf')
puts markdown

Stream extraction

client.extract_stream('large.pdf').each do |page|
  puts "Page #{page.page}: #{page.blocks&.length || 0} blocks"
end
client.search('document.pdf', 'invoice').each do |match|
  puts "Found on page #{match.page}: #{match.text}"
end

Get metadata

metadata = client.get_metadata('document.pdf')
puts "Title: #{metadata.title}"
puts "Pages: #{metadata.page_count}"

Hash

fingerprint = client.hash('document.pdf')
puts "SHA-256: #{fingerprint.hash}"
puts "Fast hash: #{fingerprint.fast_hash}"

Classify

classification = client.classify('document.pdf')
puts "Category: #{classification.category}"
puts "Confidence: #{classification.confidence}"

Verify receipt

valid = client.verify_receipt('document.pdf', 'receipt-data')
puts "Valid: #{valid}"

Binary version compatibility

This SDK requires pdftract 1.0.0 or later. Download from: https://github.com/jedarden/pdftract/releases

Troubleshooting

Binary not found

Ensure pdftract is on your PATH. The SDK probes PATH for the executable.

Version mismatch

The SDK will refuse to invoke mismatched binary versions. Install the correct version.

Network failure

For remote URLs, check your network connection and TLS certificate chain.