diff --git a/docs/user-docs/src/SUMMARY.md b/docs/user-docs/src/SUMMARY.md index 881c447..3c38fec 100644 --- a/docs/user-docs/src/SUMMARY.md +++ b/docs/user-docs/src/SUMMARY.md @@ -50,6 +50,8 @@ - [Hybrid Routing](./advanced/hybrid-routing.md) - [Provenance and Confidence](./advanced/provenance.md) +- [Troubleshooting Guide](./troubleshooting.md) + - [Troubleshooting](./troubleshooting/README.md) - [Common Issues](./troubleshooting/common-issues.md) - [Diagnostics](./troubleshooting/diagnostics.md) diff --git a/docs/user-docs/src/sdk/python.md b/docs/user-docs/src/sdk/python.md index e3182c8..4352e38 100644 --- a/docs/user-docs/src/sdk/python.md +++ b/docs/user-docs/src/sdk/python.md @@ -1,5 +1,250 @@ # Python SDK -> **Draft** — This page is a placeholder for future content. +The Python SDK (`pdftract`) provides native Python bindings with idiomatic ergonomics including an exception hierarchy, dataclass types, and optional asyncio wrappers. -Using pdftract from Python. +## Installation + +```bash +pip install pdftract +``` + +The package includes a precompiled native module for your platform. If the native module fails to import, a subprocess fallback is automatically used (with significantly degraded performance). + +## Basic Extraction + +```python +import pdftract + +doc = pdftract.extract("document.pdf") +print(f"Extracted {len(doc.pages)} pages") + +for page in doc.pages: + for span in page.spans: + print(span.text) +``` + +## Text-Only Extraction + +For RAG pipelines that just need the text body: + +```python +import pdftract + +text = pdftract.extract_text("document.pdf") +print(text) +``` + +## Streaming + +For large PDFs, stream pages one at a time to keep memory usage bounded: + +```python +import pdftract + +for page in pdftract.extract_stream("large_document.pdf"): + print(f"Page {page.page_index}: {len(page.spans)} spans") + # Process page while only one page is resident in memory +``` + +## Markdown Extraction + +Extract Markdown with optional anchor links for mapping back to PDF locations: + +```python +import pdftract + +# Basic Markdown +markdown = pdftract.extract_markdown("document.pdf") + +# With anchor links (HTML comments) +markdown = pdftract.extract_markdown("document.pdf", anchors=True) +``` + +## Options + +Pass extraction options as keyword arguments: + +```python +import pdftract + +doc = pdftract.extract( + "document.pdf", + pages="1-5,7", # Page range + password="secret123", # PDF password + receipts="lite" # Receipt generation mode +) +``` + +### Available Options + +| Option | Type | Default | Use Case | +|--------|------|---------|----------| +| `pages` | `str \| None` | `None` | Page range (e.g., `"1-5,7,12-"`) | +| `password` | `str \| None` | `None` | PDF password for encrypted documents | +| `receipts` | `str \| None` | `None` | Receipt mode: `"off"`, `"lite"`, or `"full"` | +| `ocr` | `bool` | `False` | Enable OCR for scanned documents | +| `ocr_language` | `list[str]` | `["eng"]` | OCR language codes | +| `include_invisible` | `bool` | `False` | Include invisible text in output | +| `extract_forms` | `bool` | `True` | Extract AcroForm fields | +| `extract_attachments` | `bool` | `True` | Extract embedded attachments | +| `readability_threshold` | `float` | `0.0` | Minimum readability score | +| `max_decompress_gb` | `int` | `512` | Max decompressed GB per stream | +| `full_render` | `bool` | `False` | Enable full rendering | + +## Error Handling + +The SDK provides a structured exception hierarchy: + +```python +import pdftract + +try: + doc = pdftract.extract("encrypted.pdf", password="wrong") +except pdftract.EncryptionError as e: + print(f"Encryption error: {e.code} - {e.hint}") +except pdftract.CorruptPdfError as e: + print(f"Corrupt PDF: {e}") +except pdftract.SourceUnreachableError as e: + print(f"File not found: {e}") +except pdftract.PdftractError as e: + print(f"Extraction failed: {e}") +``` + +### Exception Hierarchy + +All exceptions inherit from `PdftractError`: + +- `PdftractError` — Base exception for all extraction errors +- `EncryptionError` — PDF encryption/password errors +- `CorruptPdfError` — Malformed or corrupted PDF +- `SourceUnreachableError` — File or URL unreachable +- `RemoteFetchInterruptedError` — Network interruption during fetch +- `TlsError` — TLS/certificate errors +- `ReceiptVerifyError` — Receipt verification failed +- `UnsupportedOperationError` — Requested operation not available + +### Exception Attributes + +All exceptions have the following attributes: + +- `code` — Diagnostic code (e.g., `"ENCRYPTION_WRONG_PASSWORD"`) +- `page_index` — Page number where error occurred (if applicable) +- `hint` — Suggested action for resolution + +## Metadata + +Get document metadata without full extraction: + +```python +import pdftract + +metadata = pdftract.get_metadata("document.pdf") +print(f"Pages: {metadata.page_count}") +print(f"Title: {metadata.title}") +print(f"Author: {metadata.author}") +print(f"Fingerprint: {metadata.fingerprint}") +``` + +## Search + +Search for a regex pattern in the PDF: + +```python +import pdftract + +for match in pdftract.search("document.pdf", r"\b\d{3}-\d{2}-\d{4}\b"): + print(f"Found SSN at page {match.page_index}: {match.text}") +``` + +## Fingerprint + +Compute the structural fingerprint of a PDF: + +```python +import pdftract + +fingerprint = pdftract.hash("document.pdf") +print(f"Fingerprint: {fingerprint.value}") +``` + +## Classify + +Classify a PDF page type: + +```python +import pdftract + +classification = pdftract.classify("document.pdf") +print(f"Type: {classification.class_name}") +print(f"Confidence: {classification.confidence}") +``` + +## Verify Receipt + +Verify a cryptographic receipt: + +```python +import pdftract + +# Extract with receipts enabled +doc = pdftract.extract("document.pdf", receipts="lite") +receipt = doc.pages[0].receipt + +# Verify later +verified = pdftract.verify_receipt("document.pdf", receipt) +print(f"Verified: {verified}") +``` + +## Remote PDFs + +Extract from HTTP/HTTPS URLs: + +```python +import pdftract + +doc = pdftract.extract("https://example.com/document.pdf") +``` + +## MCP Integration + +For AI-assisted PDF extraction, pdftract provides an [MCP (Model Context Protocol) server](../integrations/mcp-clients.md). The Python SDK can be used alongside MCP clients like Claude Desktop: + +```bash +pdftract mcp --stdio +``` + +See [MCP Client Configuration Guide](../integrations/mcp-clients.md) for setup instructions. + +## Types + +The SDK provides typed wrappers for all output structures: + +```python +from pdftract.types import Document, Page, Span, Block, Metadata + +# All extraction functions return typed objects +doc: Document = pdftract.extract("document.pdf") +page: Page = doc.pages[0] +span: Span = page.spans[0] +block: Block = page.blocks[0] +metadata: Metadata = pdftract.get_metadata("document.pdf") +``` + +## Async API + +For asyncio-based applications, use the async API: + +```python +import pdftract.asyncio as pdftract_async + +async def extract_async(): + doc = await pdftract_async.extract("document.pdf") + print(f"Extracted {len(doc.pages)} pages") +``` + +## See Also + +- [MCP Client Configuration Guide](../integrations/mcp-clients.md) +- [JSON Schema Reference](../json-schema-reference.md) +- [CLI Reference](../cli/README.md) +- [Advanced: OCR Configuration](../advanced/ocr.md) diff --git a/docs/user-docs/src/sdk/rust.md b/docs/user-docs/src/sdk/rust.md index 83f256c..6accaaa 100644 --- a/docs/user-docs/src/sdk/rust.md +++ b/docs/user-docs/src/sdk/rust.md @@ -1,5 +1,190 @@ # Rust SDK -> **Draft** — This page is a placeholder for future content. +The Rust SDK is the `pdftract-core` crate. It provides native PDF text extraction with zero-copy memory mapping and streaming support. -Using pdftract from Rust. +## Installation + +Add to your `Cargo.toml`: + +```toml +[dependencies] +pdftract-core = "1.0" +``` + +For OCR support, enable the `ocr` feature: + +```toml +[dependencies] +pdftract-core = { version = "1.0", features = ["ocr"] } +``` + +## Basic Extraction + +```rust +use pdftract_core::{extract_pdf, ExtractionOptions, OutputOptions}; + +fn main() -> anyhow::Result<()> { + let opts = ExtractionOptions::default(); + let output = OutputOptions::default(); + + let result = extract_pdf("document.pdf", &opts, &output)?; + + for (i, page) in result.pages.iter().enumerate() { + println!("Page {}: {} chars", i + 1, page.text.len()); + for span in &page.spans { + println!(" {}", span.text); + } + } + Ok(()) +} +``` + +## Streaming Extraction + +For large PDFs, stream pages one at a time to keep memory usage bounded: + +```rust +use pdftract_core::{extract_pdf_streaming, ExtractionOptions, OutputOptions}; +use std::fs::File; + +fn main() -> anyhow::Result<()> { + let mut output = File::create("output.ndjson")?; + extract_pdf_streaming( + "large_document.pdf", + &ExtractionOptions::default(), + &OutputOptions::default(), + &mut output, + )?; + Ok(()) +} +``` + +## Options + +### ExtractionOptions + +| Field | Type | Default | Use Case | +|-------|------|---------|----------| +| `receipts` | `ReceiptsMode` | `Off` | Generate cryptographic receipts | +| `max_parallel_pages` | `usize` | `4` | Control memory for concurrent page processing | +| `memory_budget_mb` | `usize` | `512` | Target peak RSS in MB | +| `full_render` | `bool` | `false` | Enable PDFium rendering (requires `full-render` feature) | +| `ocr_dpi_override` | `Option` | `None` | Override automatic DPI selection | +| `ocr_language` | `Vec` | `vec!["eng"]` | Tesseract language codes | +| `markdown_anchors` | `bool` | `false` | Emit HTML comment anchors in Markdown | +| `max_decompress_bytes` | `u64` | `512 MiB` | Bomb limit for decompressed streams | +| `output` | `OutputOptions` | `default()` | Output filtering options | +| `pages` | `Option` | `None` | Page range (e.g., `"1-5,7,12-"`) | +| `password` | `Option` | `None` | PDF password for encrypted documents | + +### OutputOptions + +| Field | Type | Default | Use Case | +|-------|------|---------|----------| +| `include_invisible` | `bool` | `false` | Include invisible text in output | +| `extract_forms` | `bool` | `true` | Extract AcroForm fields | +| `extract_attachments` | `bool` | `true` | Extract embedded attachments | + +## Receipts + +Generate cryptographic receipts for verification: + +```rust +use pdftract_core::{extract_pdf, ExtractionOptions, OutputOptions}; +use pdftract_core::options::ReceiptsMode; + +fn main() -> anyhow::Result<()> { + let opts = ExtractionOptions { + receipts: ReceiptsMode::Lite, + ..Default::default() + }; + let output = OutputOptions::default(); + let result = extract_pdf("document.pdf", &opts, &output)?; + + // Receipts are embedded in page metadata + if let Some(receipt) = &result.pages[0].receipt { + println!("Receipt: {}", receipt); + } + Ok(()) +} +``` + +## Remote PDFs + +With the `remote` feature, fetch PDFs via HTTP: + +```rust +use pdftract_core::{extract_pdf, ExtractionOptions, OutputOptions}; + +fn main() -> anyhow::Result<()> { + let opts = ExtractionOptions::default(); + let output = OutputOptions::default(); + let result = extract_pdf("https://example.com/document.pdf", &opts, &output)?; + Ok(()) +} +``` + +## Error Handling + +Most functions return `anyhow::Result` which wraps various error types: + +```rust +use pdftract_core::{extract_pdf, ExtractionOptions, OutputOptions}; + +fn main() { + let opts = ExtractionOptions::default(); + let output = OutputOptions::default(); + + match extract_pdf("document.pdf", &opts, &output) { + Ok(result) => { + println!("Extracted {} pages", result.pages.len()); + } + Err(e) => { + eprintln!("Extraction failed: {}", e); + // Inspect error chain + for cause in e.chain() { + eprintln!(" caused by: {}", cause); + } + } + } +} +``` + +## Feature Flags + +| Feature | Adds | Default | +|---------|------|---------| +| `serde` | JSON serialization support | ✓ | +| `decrypt` | Decryption of encrypted PDFs | ✓ | +| `quick-xml` | Conformance detection via XML metadata | ✓ | +| `ocr` | Tesseract OCR for scanned documents | - | +| `full-render` | PDFium-based rendering (requires `ocr`) | - | +| `remote` | HTTP range fetching for remote PDFs | - | +| `profiles` | Extraction profiles | - | +| `receipts` | Cryptographic receipt generation | - | +| `cjk` | CJK text extraction via predefined CMap registry | - | +| `schemars` | JSON Schema generation | - | + +## Source Types + +The SDK supports multiple source types via the `PdfSource` trait: + +```rust +use pdftract_core::source::{FileSource, MmapSource, MemorySource}; + +// Memory-mapped source (zero-copy for large files) +let source = MmapSource::open("document.pdf")?; + +// In-memory source (for byte buffers) +let data = std::fs::read("document.pdf")?; +let source = MemorySource::new(data); + +// Standard file source +let source = FileSource::open("document.pdf")?; +``` + +## See Also + +- [JSON Schema Reference](../json-schema-reference.md) +- [CLI Reference](../cli/README.md) +- [Advanced: OCR Configuration](../advanced/ocr.md) diff --git a/docs/user-docs/src/troubleshooting.md b/docs/user-docs/src/troubleshooting.md new file mode 100644 index 0000000..9837c86 --- /dev/null +++ b/docs/user-docs/src/troubleshooting.md @@ -0,0 +1,303 @@ +# Troubleshooting + +This guide maps common pdftract failures to their causes and fixes. Each error is associated with a **diagnostic code** that appears in extraction output (see `diagnostics` in the JSON response or CLI stderr). + +> **For the authoritative diagnostic code catalog**, see the [Diagnostics Reference](./troubleshooting/diagnostics.md). + +## Symptom → Diagnostic Lookup + +| Symptom | Likely Diagnostic Code | +|---------|----------------------| +| PDF won't open, "encrypted" error | `ENCRYPTION_UNSUPPORTED` | +| Text extraction incomplete or missing | `XREF_REPAIRED`, `OCR_*_UNSUPPORTED` | +| Process hangs or runs very long | `STREAM_BOMB` | +| "Path outside root" (MCP mode) | `MCP_PATH_TRAVERSAL` | +| Cache errors / corrupted entries | `CACHE_ENTRY_CORRUPT`, `CACHE_INTEGRITY_FAIL` | +| Profile fails to load | `PROFILE_INVALID`, `PROFILE_SECRETS_FORBIDDEN` | +| Remote URL fetch blocked | `URL_PRIVATE_NETWORK` | +| Requested page doesn't exist | `PAGE_OUT_OF_RANGE` | +| Text contains placeholder characters (⍰) | `GLYPH_UNMAPPED` | +| Broken vector graphics not recovered | `BROKENVECTOR_OCR_UNAVAILABLE` | +| JavaScript warning in output | `JAVASCRIPT_PRESENT` | +| Circular reference warnings | `STRUCT_CIRCULAR_REF`, `STRUCT_XOBJECT_CYCLE` | +| Stack overflow warnings | `GSTATE_STACK_OVERFLOW` | + +--- + +## XREF_REPAIRED warning + +**What it means**: pdftract found the PDF's cross-reference table was corrupt and ran the forward-scan fallback (Phase 1.3) to recover. + +**Cause**: PDF created or transmitted with truncation or corruption. The `startxref` offset points outside the file, or the xref table is malformed. + +**Fix**: Usually no action needed; extraction succeeds with the recovered xref. Output may be incomplete on truncated files. If extraction fails, the PDF is unsalvageable. + +**Severity**: info (extraction continues) + +--- + +## STREAM_BOMB error + +**What it means**: A compressed stream exceeded the decompression size limit (default: 512 MB). + +**Cause**: A hostile PDF with a "compression bomb" — a small stream that expands to multi-GB size (e.g., 10 KB → 2 GB). This is a common security exploit pattern. + +**Fix**: +- If the PDF is **trusted**: Increase the limit with `--max-decompress-gb 2` (or higher) +- If the PDF is **untrusted**: Treat as a hostile file; do not process + +**Severity**: error (stream aborted; partial extraction returned) + +--- + +## ENCRYPTION_UNSUPPORTED fatal + +**What it means**: The PDF is encrypted with an unsupported handler or the wrong password. + +**Cause**: +- PDF encrypted with an unknown handler (e.g., Adobe LiveCycle policy server) +- PDF password-protected but no password (or wrong password) supplied + +**Fix**: +```bash +# Supply password via environment variable +export PDFTRACT_PASSWORD="your-password" +pdftract extract document.pdf + +# Or via stdin +echo "your-password" | pdftract extract --password-stdin document.pdf +``` + +If the handler is unsupported (e.g., Adobe LiveCycle), use an Adobe-side decryption tool first, or a dedicated password recovery tool like `pdfcrack` or `john`. + +**Severity**: fatal (process exits with code 3) + +--- + +## OCR_JBIG2_UNSUPPORTED / OCR_JPX_UNSUPPORTED / OCR_CCITT_UNSUPPORTED warning + +**What it means**: A page contains an image that requires a decoder not available in the current build. + +**Cause**: +- `OCR_JBIG2_UNSUPPORTED`: JBIG2-encoded image (rare) +- `OCR_JPX_UNSUPPORTED`: JPEG 2000-encoded image +- `OCR_CCITT_UNSUPPORTED`: CCITT fax-encoded image + +**Fix**: +```bash +# Build with full-render feature (enables all decoders via PDFium) +cargo build --release --features full-render + +# Or install system libraries: +# - JPX: install libopenjp2 +# - CCITT: install libtiff +``` + +**Severity**: warn (page skipped from OCR; extraction continues) + +--- + +## BROKENVECTOR_OCR_UNAVAILABLE warning + +**What it means**: A page contains broken vector graphics that could be recovered via OCR, but the OCR feature is disabled. + +**Cause**: Build was compiled without the `ocr` feature. + +**Fix**: Rebuild with OCR enabled: +```bash +cargo build --release --features ocr +``` + +**Severity**: warn (broken vector graphics not recovered; extraction continues) + +--- + +## MCP_PATH_TRAVERSAL / PATH_OUTSIDE_ROOT error + +**What it means**: (MCP mode) The requested path escapes the `--root` directory boundary. + +**Cause**: A tool call attempted path traversal (e.g., `../../etc/passwd`). + +**Fix**: +- Adjust the requested path to stay within `--root` +- Or restart the MCP server without `--root` restriction (not recommended for multi-tenant deployments) + +**Severity**: error (request rejected) + +--- + +## URL_PRIVATE_NETWORK error + +**What it means**: Remote fetch blocked because the URL targets a private network address. + +**Cause**: URL targets localhost, private IP ranges (RFC 1918), or link-local addresses. This is an SSRF (Server-Side Request Forgery) protection. + +**Fix**: +```bash +# If you trust the URL, allow private networks: +pdftract extract --allow-private-networks https://internal-server/docs.pdf +``` + +**Severity**: error (request rejected with HTTP 400 in serve mode) + +--- + +## CACHE_ENTRY_CORRUPT warning + +**What it means**: A cache entry failed integrity verification. + +**Cause**: Cache file corruption (disk error, concurrent write, etc.). + +**Fix**: None needed — the entry is automatically deleted and extraction re-runs. If this recurs frequently, check your disk filesystem. + +**Severity**: warn (entry deleted; extraction re-runs) + +--- + +## CACHE_INTEGRITY_FAIL diagnostic + +**What it means**: A cache entry's HMAC verification failed, indicating potential cache poisoning. + +**Cause**: Malicious co-tenant wrote a forged cache entry (multi-user cache scenarios), or disk corruption. + +**Fix**: The entry is treated as a cache miss and extraction re-runs. In multi-user environments, ensure per-user cache directories or verify cache permissions. + +**Severity**: warn (entry rejected; extraction re-runs) + +--- + +## PROFILE_INVALID / PROFILE_SECRETS_FORBIDDEN error + +**What it means**: Profile YAML failed validation. + +**Cause**: +- `PROFILE_INVALID`: YAML syntax error or schema violation +- `PROFILE_SECRETS_FORBIDDEN`: Profile contains secret-keyword keys (`password:`, `token:`, `secret:`, `api_key:`) + +**Fix**: +```bash +# For schema errors, check the YAML syntax: +pdftract profile show --profile-path your-profile.yaml + +# For secrets errors, remove secret keys from the profile. +# Secrets should be passed via environment variables, not profiles. +``` + +**Severity**: error (profile rejected) + +--- + +## PAGE_OUT_OF_RANGE warning + +**What it means**: The `--pages` argument exceeds the document's actual page count. + +**Cause**: Page range specified (e.g., `--pages 1-100`) on a document with fewer pages (e.g., 10 pages). + +**Fix**: Adjust the `--pages` argument to the actual page count: +```bash +# First, get the page count: +pdftract inspect document.json | jq '.page_count' + +# Then extract with a valid range: +pdftract extract --pages 1-10 document.pdf +``` + +**Severity**: warn (pages clamped to available range) + +--- + +## GLYPH_UNMAPPED warning + +**What it means**: A glyph could not be resolved by any of the four encoding levels. + +**Cause**: Font encoding corruption, missing font embedding, or non-standard encoding. + +**Fix**: Output contains the Unicode replacement character (⍰). No direct fix; consider re-saving the PDF through a normalizing tool (e.g., Adobe Acrobat, qpdf). + +**Severity**: warn (character replaced with U+FFFD; extraction continues) + +--- + +## JAVASCRIPT_PRESENT info + +**What it means**: PDF contains embedded JavaScript (in `/AA`, `/OpenAction`, or `/JS` entries). + +**Cause**: PDF includes JavaScript actions (common in forms, interactive documents). + +**Fix**: None needed for extraction — pdftract NEVER executes embedded JavaScript. JavaScript actions are surfaced in `metadata.javascript_actions[]` for downstream review. + +**Severity**: info (JavaScript is not executed) + +--- + +## STRUCT_CIRCULAR_REF / STRUCT_XOBJECT_CYCLE / GSTATE_STACK_OVERFLOW warning + +**What it means**: PDF contains circular references or malformed content streams. + +**Cause**: +- `STRUCT_CIRCULAR_REF`: Indirect object reference cycle +- `STRUCT_XOBJECT_CYCLE`: XObject (image/form) reference cycle +- `GSTATE_STACK_OVERFLOW`: Graphics state stack exceeds depth limit + +**Fix**: Usually no action needed — pdftract breaks cycles at the second visit (or depth 20 for XObjects). If output is incomplete, investigate the source PDF for a producer bug. + +**Severity**: warn (cycle broken; extraction continues) + +--- + +## REMOTE_FETCH_INTERRUPTED error + +**What it means**: Remote fetch was interrupted (network timeout, connection reset, etc.). + +**Cause**: Network connectivity issues, server timeout, or premature connection close. + +**Fix**: Retry the request; check network connectivity: +```bash +# Retry with increased timeout: +pdftract extract --timeout-seconds 120 https://example.com/document.pdf +``` + +**Severity**: error (request aborted) + +--- + +## REMOTE_NO_RANGE_SUPPORT warning + +**What it means**: Remote server does not support HTTP Range requests. + +**Cause**: Server lacks `Accept-Ranges` header or returns 206 Unsupported. + +**Fix**: None needed — pdftract falls back to whole-file download. For large files, consider hosting on a Range-supporting server. + +**Severity**: warn (fallback to whole-file download) + +--- + +## TAGGED_PDF_STRUCT_TREE_DEFERRED info + +**What it means**: Tagged PDF structure tree extraction is deferred in this version. + +**Cause**: Phase 7.1 (full structure tree extraction) is not yet implemented. + +**Fix**: None needed — this is a temporary fallback. Structure tree extraction will be added in v1.0.0. + +**Severity**: info (structure tree not extracted) + +--- + +## Getting Help + +If you encounter a diagnostic code not listed here, or the suggested fix doesn't resolve your issue: + +1. **Check the [Diagnostics Reference](./troubleshooting/diagnostics.md)** for the full catalog +2. **Search existing issues** on [GitHub](https://github.com/jedarden/pdftract/issues) +3. **Open a new issue** with: + - The diagnostic code(s) + - A minimal reproducible example (PDF or command) + - The `--debug` output if safe to share + +## Related Documentation + +- [Diagnostics Reference](./troubleshooting/diagnostics.md) — Full diagnostic code catalog +- [FAQ](./faq.md) — Common questions and answers +- [Advanced: OCR Configuration](./advanced/ocr.md) — OCR troubleshooting details