pdftract/docs/user-docs/src/sdk/python.md
jedarden b93bb53ac2 docs(pdftract-46tdo): add comprehensive troubleshooting guide with diagnostic code mappings
- Created troubleshooting.md mapping 22+ user-visible diagnostic codes
- Added symptom-to-diagnostic lookup table for quick navigation
- Each diagnostic code includes: what it means, cause, fix, severity
- Cross-references the Diagnostics Reference for full catalog
- Updated SUMMARY.md to include new troubleshooting guide
- Verified mdBook builds successfully

Acceptance criteria:
- Covers 15+ diagnostic codes (actual: 22+)
- Top-level TOC for navigation
- Cross-links to Diagnostic Code Catalog
- mdBook renders cleanly

Diagnostic codes covered:
XREF_REPAIRED, STREAM_BOMB, ENCRYPTION_UNSUPPORTED,
OCR_JBIG2_UNSUPPORTED, OCR_JPX_UNSUPPORTED, OCR_CCITT_UNSUPPORTED,
BROKENVECTOR_OCR_UNAVAILABLE, MCP_PATH_TRAVERSAL, PATH_OUTSIDE_ROOT,
URL_PRIVATE_NETWORK, CACHE_ENTRY_CORRUPT, CACHE_INTEGRITY_FAIL,
PROFILE_INVALID, PROFILE_SECRETS_FORBIDDEN, PAGE_OUT_OF_RANGE,
GLYPH_UNMAPPED, JAVASCRIPT_PRESENT, STRUCT_CIRCULAR_REF,
STRUCT_XOBJECT_CYCLE, GSTATE_STACK_OVERFLOW, REMOTE_FETCH_INTERRUPTED,
REMOTE_NO_RANGE_SUPPORT, TAGGED_PDF_STRUCT_TREE_DEFERRED
2026-05-31 23:24:42 -04:00

6.3 KiB

Python SDK

The Python SDK (pdftract) provides native Python bindings with idiomatic ergonomics including an exception hierarchy, dataclass types, and optional asyncio wrappers.

Installation

pip install pdftract

The package includes a precompiled native module for your platform. If the native module fails to import, a subprocess fallback is automatically used (with significantly degraded performance).

Basic Extraction

import pdftract

doc = pdftract.extract("document.pdf")
print(f"Extracted {len(doc.pages)} pages")

for page in doc.pages:
    for span in page.spans:
        print(span.text)

Text-Only Extraction

For RAG pipelines that just need the text body:

import pdftract

text = pdftract.extract_text("document.pdf")
print(text)

Streaming

For large PDFs, stream pages one at a time to keep memory usage bounded:

import pdftract

for page in pdftract.extract_stream("large_document.pdf"):
    print(f"Page {page.page_index}: {len(page.spans)} spans")
    # Process page while only one page is resident in memory

Markdown Extraction

Extract Markdown with optional anchor links for mapping back to PDF locations:

import pdftract

# Basic Markdown
markdown = pdftract.extract_markdown("document.pdf")

# With anchor links (HTML comments)
markdown = pdftract.extract_markdown("document.pdf", anchors=True)

Options

Pass extraction options as keyword arguments:

import pdftract

doc = pdftract.extract(
    "document.pdf",
    pages="1-5,7",           # Page range
    password="secret123",    # PDF password
    receipts="lite"          # Receipt generation mode
)

Available Options

Option Type Default Use Case
pages str | None None Page range (e.g., "1-5,7,12-")
password str | None None PDF password for encrypted documents
receipts str | None None Receipt mode: "off", "lite", or "full"
ocr bool False Enable OCR for scanned documents
ocr_language list[str] ["eng"] OCR language codes
include_invisible bool False Include invisible text in output
extract_forms bool True Extract AcroForm fields
extract_attachments bool True Extract embedded attachments
readability_threshold float 0.0 Minimum readability score
max_decompress_gb int 512 Max decompressed GB per stream
full_render bool False Enable full rendering

Error Handling

The SDK provides a structured exception hierarchy:

import pdftract

try:
    doc = pdftract.extract("encrypted.pdf", password="wrong")
except pdftract.EncryptionError as e:
    print(f"Encryption error: {e.code} - {e.hint}")
except pdftract.CorruptPdfError as e:
    print(f"Corrupt PDF: {e}")
except pdftract.SourceUnreachableError as e:
    print(f"File not found: {e}")
except pdftract.PdftractError as e:
    print(f"Extraction failed: {e}")

Exception Hierarchy

All exceptions inherit from PdftractError:

  • PdftractError — Base exception for all extraction errors
  • EncryptionError — PDF encryption/password errors
  • CorruptPdfError — Malformed or corrupted PDF
  • SourceUnreachableError — File or URL unreachable
  • RemoteFetchInterruptedError — Network interruption during fetch
  • TlsError — TLS/certificate errors
  • ReceiptVerifyError — Receipt verification failed
  • UnsupportedOperationError — Requested operation not available

Exception Attributes

All exceptions have the following attributes:

  • code — Diagnostic code (e.g., "ENCRYPTION_WRONG_PASSWORD")
  • page_index — Page number where error occurred (if applicable)
  • hint — Suggested action for resolution

Metadata

Get document metadata without full extraction:

import pdftract

metadata = pdftract.get_metadata("document.pdf")
print(f"Pages: {metadata.page_count}")
print(f"Title: {metadata.title}")
print(f"Author: {metadata.author}")
print(f"Fingerprint: {metadata.fingerprint}")

Search for a regex pattern in the PDF:

import pdftract

for match in pdftract.search("document.pdf", r"\b\d{3}-\d{2}-\d{4}\b"):
    print(f"Found SSN at page {match.page_index}: {match.text}")

Fingerprint

Compute the structural fingerprint of a PDF:

import pdftract

fingerprint = pdftract.hash("document.pdf")
print(f"Fingerprint: {fingerprint.value}")

Classify

Classify a PDF page type:

import pdftract

classification = pdftract.classify("document.pdf")
print(f"Type: {classification.class_name}")
print(f"Confidence: {classification.confidence}")

Verify Receipt

Verify a cryptographic receipt:

import pdftract

# Extract with receipts enabled
doc = pdftract.extract("document.pdf", receipts="lite")
receipt = doc.pages[0].receipt

# Verify later
verified = pdftract.verify_receipt("document.pdf", receipt)
print(f"Verified: {verified}")

Remote PDFs

Extract from HTTP/HTTPS URLs:

import pdftract

doc = pdftract.extract("https://example.com/document.pdf")

MCP Integration

For AI-assisted PDF extraction, pdftract provides an MCP (Model Context Protocol) server. The Python SDK can be used alongside MCP clients like Claude Desktop:

pdftract mcp --stdio

See MCP Client Configuration Guide for setup instructions.

Types

The SDK provides typed wrappers for all output structures:

from pdftract.types import Document, Page, Span, Block, Metadata

# All extraction functions return typed objects
doc: Document = pdftract.extract("document.pdf")
page: Page = doc.pages[0]
span: Span = page.spans[0]
block: Block = page.blocks[0]
metadata: Metadata = pdftract.get_metadata("document.pdf")

Async API

For asyncio-based applications, use the async API:

import pdftract.asyncio as pdftract_async

async def extract_async():
    doc = await pdftract_async.extract("document.pdf")
    print(f"Extracted {len(doc.pages)} pages")

See Also