pdftract/pdftract-ruby/README.md

# pdftract-ruby

Ruby SDK for pdftract - PDF extraction and conformance testing.

## Installation

```bash
gem install pdftract
```

Or in your Gemfile:

```ruby
gem 'pdftract', '~> 1.0.0'
```

## Usage

### Basic extract

```ruby
require 'pdftract'

client = Pdftract.client
doc = client.extract('document.pdf')
puts "Pages: #{doc.pages.length}"
```

### Extract with OCR

```ruby
doc = client.extract('scanned.pdf', { ocr_language: 'eng', ocr_threshold: 0.7 })
```

### Extract text

```ruby
text = client.extract_text('document.pdf')
puts text
```

### Extract Markdown

```ruby
markdown = client.extract_markdown('document.pdf')
puts markdown
```

### Stream extraction

```ruby
client.extract_stream('large.pdf').each do |page|
  puts "Page #{page.page}: #{page.blocks&.length || 0} blocks"
end
```

### Search

```ruby
client.search('document.pdf', 'invoice').each do |match|
  puts "Found on page #{match.page}: #{match.text}"
end
```

### Get metadata

```ruby
metadata = client.get_metadata('document.pdf')
puts "Title: #{metadata.title}"
puts "Pages: #{metadata.page_count}"
```

### Hash

```ruby
fingerprint = client.hash('document.pdf')
puts "SHA-256: #{fingerprint.hash}"
puts "Fast hash: #{fingerprint.fast_hash}"
```

### Classify

```ruby
classification = client.classify('document.pdf')
puts "Category: #{classification.category}"
puts "Confidence: #{classification.confidence}"
```

### Verify receipt

```ruby
valid = client.verify_receipt('document.pdf', 'receipt-data')
puts "Valid: #{valid}"
```

## Binary version compatibility

This SDK requires pdftract 1.0.0 or later. Download from:
https://github.com/jedarden/pdftract/releases

## Troubleshooting

### Binary not found
Ensure `pdftract` is on your PATH. The SDK probes PATH for the executable.

### Version mismatch
The SDK will refuse to invoke mismatched binary versions. Install the correct version.

### Network failure
For remote URLs, check your network connection and TLS certificate chain.