# Python SDK The Python SDK (`pdftract`) provides native Python bindings with idiomatic ergonomics including an exception hierarchy, dataclass types, and optional asyncio wrappers. ## Installation ```bash pip install pdftract ``` The package includes a precompiled native module for your platform. If the native module fails to import, a subprocess fallback is automatically used (with significantly degraded performance). ## Basic Extraction ```python import pdftract doc = pdftract.extract("document.pdf") print(f"Extracted {len(doc.pages)} pages") for page in doc.pages: for span in page.spans: print(span.text) ``` ## Text-Only Extraction For RAG pipelines that just need the text body: ```python import pdftract text = pdftract.extract_text("document.pdf") print(text) ``` ## Streaming For large PDFs, stream pages one at a time to keep memory usage bounded: ```python import pdftract for page in pdftract.extract_stream("large_document.pdf"): print(f"Page {page.page_index}: {len(page.spans)} spans") # Process page while only one page is resident in memory ``` ## Markdown Extraction Extract Markdown with optional anchor links for mapping back to PDF locations: ```python import pdftract # Basic Markdown markdown = pdftract.extract_markdown("document.pdf") # With anchor links (HTML comments) markdown = pdftract.extract_markdown("document.pdf", anchors=True) ``` ## Options Pass extraction options as keyword arguments: ```python import pdftract doc = pdftract.extract( "document.pdf", pages="1-5,7", # Page range password="secret123", # PDF password receipts="lite" # Receipt generation mode ) ``` ### Available Options | Option | Type | Default | Use Case | |--------|------|---------|----------| | `pages` | `str \| None` | `None` | Page range (e.g., `"1-5,7,12-"`) | | `password` | `str \| None` | `None` | PDF password for encrypted documents | | `receipts` | `str \| None` | `None` | Receipt mode: `"off"`, `"lite"`, or `"full"` | | `ocr` | `bool` | `False` | Enable OCR for scanned documents | | `ocr_language` | `list[str]` | `["eng"]` | OCR language codes | | `include_invisible` | `bool` | `False` | Include invisible text in output | | `extract_forms` | `bool` | `True` | Extract AcroForm fields | | `extract_attachments` | `bool` | `True` | Extract embedded attachments | | `readability_threshold` | `float` | `0.0` | Minimum readability score | | `max_decompress_gb` | `int` | `512` | Max decompressed GB per stream | | `full_render` | `bool` | `False` | Enable full rendering | ## Error Handling The SDK provides a structured exception hierarchy: ```python import pdftract try: doc = pdftract.extract("encrypted.pdf", password="wrong") except pdftract.EncryptionError as e: print(f"Encryption error: {e.code} - {e.hint}") except pdftract.CorruptPdfError as e: print(f"Corrupt PDF: {e}") except pdftract.SourceUnreachableError as e: print(f"File not found: {e}") except pdftract.PdftractError as e: print(f"Extraction failed: {e}") ``` ### Exception Hierarchy All exceptions inherit from `PdftractError`: - `PdftractError` — Base exception for all extraction errors - `EncryptionError` — PDF encryption/password errors - `CorruptPdfError` — Malformed or corrupted PDF - `SourceUnreachableError` — File or URL unreachable - `RemoteFetchInterruptedError` — Network interruption during fetch - `TlsError` — TLS/certificate errors - `ReceiptVerifyError` — Receipt verification failed - `UnsupportedOperationError` — Requested operation not available ### Exception Attributes All exceptions have the following attributes: - `code` — Diagnostic code (e.g., `"ENCRYPTION_WRONG_PASSWORD"`) - `page_index` — Page number where error occurred (if applicable) - `hint` — Suggested action for resolution ## Metadata Get document metadata without full extraction: ```python import pdftract metadata = pdftract.get_metadata("document.pdf") print(f"Pages: {metadata.page_count}") print(f"Title: {metadata.title}") print(f"Author: {metadata.author}") print(f"Fingerprint: {metadata.fingerprint}") ``` ## Search Search for a regex pattern in the PDF: ```python import pdftract for match in pdftract.search("document.pdf", r"\b\d{3}-\d{2}-\d{4}\b"): print(f"Found SSN at page {match.page_index}: {match.text}") ``` ## Fingerprint Compute the structural fingerprint of a PDF: ```python import pdftract fingerprint = pdftract.hash("document.pdf") print(f"Fingerprint: {fingerprint.value}") ``` ## Classify Classify a PDF page type: ```python import pdftract classification = pdftract.classify("document.pdf") print(f"Type: {classification.class_name}") print(f"Confidence: {classification.confidence}") ``` ## Verify Receipt Verify a cryptographic receipt: ```python import pdftract # Extract with receipts enabled doc = pdftract.extract("document.pdf", receipts="lite") receipt = doc.pages[0].receipt # Verify later verified = pdftract.verify_receipt("document.pdf", receipt) print(f"Verified: {verified}") ``` ## Remote PDFs Extract from HTTP/HTTPS URLs: ```python import pdftract doc = pdftract.extract("https://example.com/document.pdf") ``` ## MCP Integration For AI-assisted PDF extraction, pdftract provides an [MCP (Model Context Protocol) server](../integrations/mcp-clients.md). The Python SDK can be used alongside MCP clients like Claude Desktop: ```bash pdftract mcp --stdio ``` See [MCP Client Configuration Guide](../integrations/mcp-clients.md) for setup instructions. ## Types The SDK provides typed wrappers for all output structures: ```python from pdftract.types import Document, Page, Span, Block, Metadata # All extraction functions return typed objects doc: Document = pdftract.extract("document.pdf") page: Page = doc.pages[0] span: Span = page.spans[0] block: Block = page.blocks[0] metadata: Metadata = pdftract.get_metadata("document.pdf") ``` ## Async API For asyncio-based applications, use the async API: ```python import pdftract.asyncio as pdftract_async async def extract_async(): doc = await pdftract_async.extract("document.pdf") print(f"Extracted {len(doc.pages)} pages") ``` ## See Also - [MCP Client Configuration Guide](../integrations/mcp-clients.md) - [JSON Schema Reference](../json-schema-reference.md) - [CLI Reference](../cli/README.md) - [Advanced: OCR Configuration](../advanced/ocr.md)