- Created troubleshooting.md mapping 22+ user-visible diagnostic codes - Added symptom-to-diagnostic lookup table for quick navigation - Each diagnostic code includes: what it means, cause, fix, severity - Cross-references the Diagnostics Reference for full catalog - Updated SUMMARY.md to include new troubleshooting guide - Verified mdBook builds successfully Acceptance criteria: - Covers 15+ diagnostic codes (actual: 22+) - Top-level TOC for navigation - Cross-links to Diagnostic Code Catalog - mdBook renders cleanly Diagnostic codes covered: XREF_REPAIRED, STREAM_BOMB, ENCRYPTION_UNSUPPORTED, OCR_JBIG2_UNSUPPORTED, OCR_JPX_UNSUPPORTED, OCR_CCITT_UNSUPPORTED, BROKENVECTOR_OCR_UNAVAILABLE, MCP_PATH_TRAVERSAL, PATH_OUTSIDE_ROOT, URL_PRIVATE_NETWORK, CACHE_ENTRY_CORRUPT, CACHE_INTEGRITY_FAIL, PROFILE_INVALID, PROFILE_SECRETS_FORBIDDEN, PAGE_OUT_OF_RANGE, GLYPH_UNMAPPED, JAVASCRIPT_PRESENT, STRUCT_CIRCULAR_REF, STRUCT_XOBJECT_CYCLE, GSTATE_STACK_OVERFLOW, REMOTE_FETCH_INTERRUPTED, REMOTE_NO_RANGE_SUPPORT, TAGGED_PDF_STRUCT_TREE_DEFERRED
6.3 KiB
Python SDK
The Python SDK (pdftract) provides native Python bindings with idiomatic ergonomics including an exception hierarchy, dataclass types, and optional asyncio wrappers.
Installation
pip install pdftract
The package includes a precompiled native module for your platform. If the native module fails to import, a subprocess fallback is automatically used (with significantly degraded performance).
Basic Extraction
import pdftract
doc = pdftract.extract("document.pdf")
print(f"Extracted {len(doc.pages)} pages")
for page in doc.pages:
for span in page.spans:
print(span.text)
Text-Only Extraction
For RAG pipelines that just need the text body:
import pdftract
text = pdftract.extract_text("document.pdf")
print(text)
Streaming
For large PDFs, stream pages one at a time to keep memory usage bounded:
import pdftract
for page in pdftract.extract_stream("large_document.pdf"):
print(f"Page {page.page_index}: {len(page.spans)} spans")
# Process page while only one page is resident in memory
Markdown Extraction
Extract Markdown with optional anchor links for mapping back to PDF locations:
import pdftract
# Basic Markdown
markdown = pdftract.extract_markdown("document.pdf")
# With anchor links (HTML comments)
markdown = pdftract.extract_markdown("document.pdf", anchors=True)
Options
Pass extraction options as keyword arguments:
import pdftract
doc = pdftract.extract(
"document.pdf",
pages="1-5,7", # Page range
password="secret123", # PDF password
receipts="lite" # Receipt generation mode
)
Available Options
| Option | Type | Default | Use Case |
|---|---|---|---|
pages |
str | None |
None |
Page range (e.g., "1-5,7,12-") |
password |
str | None |
None |
PDF password for encrypted documents |
receipts |
str | None |
None |
Receipt mode: "off", "lite", or "full" |
ocr |
bool |
False |
Enable OCR for scanned documents |
ocr_language |
list[str] |
["eng"] |
OCR language codes |
include_invisible |
bool |
False |
Include invisible text in output |
extract_forms |
bool |
True |
Extract AcroForm fields |
extract_attachments |
bool |
True |
Extract embedded attachments |
readability_threshold |
float |
0.0 |
Minimum readability score |
max_decompress_gb |
int |
512 |
Max decompressed GB per stream |
full_render |
bool |
False |
Enable full rendering |
Error Handling
The SDK provides a structured exception hierarchy:
import pdftract
try:
doc = pdftract.extract("encrypted.pdf", password="wrong")
except pdftract.EncryptionError as e:
print(f"Encryption error: {e.code} - {e.hint}")
except pdftract.CorruptPdfError as e:
print(f"Corrupt PDF: {e}")
except pdftract.SourceUnreachableError as e:
print(f"File not found: {e}")
except pdftract.PdftractError as e:
print(f"Extraction failed: {e}")
Exception Hierarchy
All exceptions inherit from PdftractError:
PdftractError— Base exception for all extraction errorsEncryptionError— PDF encryption/password errorsCorruptPdfError— Malformed or corrupted PDFSourceUnreachableError— File or URL unreachableRemoteFetchInterruptedError— Network interruption during fetchTlsError— TLS/certificate errorsReceiptVerifyError— Receipt verification failedUnsupportedOperationError— Requested operation not available
Exception Attributes
All exceptions have the following attributes:
code— Diagnostic code (e.g.,"ENCRYPTION_WRONG_PASSWORD")page_index— Page number where error occurred (if applicable)hint— Suggested action for resolution
Metadata
Get document metadata without full extraction:
import pdftract
metadata = pdftract.get_metadata("document.pdf")
print(f"Pages: {metadata.page_count}")
print(f"Title: {metadata.title}")
print(f"Author: {metadata.author}")
print(f"Fingerprint: {metadata.fingerprint}")
Search
Search for a regex pattern in the PDF:
import pdftract
for match in pdftract.search("document.pdf", r"\b\d{3}-\d{2}-\d{4}\b"):
print(f"Found SSN at page {match.page_index}: {match.text}")
Fingerprint
Compute the structural fingerprint of a PDF:
import pdftract
fingerprint = pdftract.hash("document.pdf")
print(f"Fingerprint: {fingerprint.value}")
Classify
Classify a PDF page type:
import pdftract
classification = pdftract.classify("document.pdf")
print(f"Type: {classification.class_name}")
print(f"Confidence: {classification.confidence}")
Verify Receipt
Verify a cryptographic receipt:
import pdftract
# Extract with receipts enabled
doc = pdftract.extract("document.pdf", receipts="lite")
receipt = doc.pages[0].receipt
# Verify later
verified = pdftract.verify_receipt("document.pdf", receipt)
print(f"Verified: {verified}")
Remote PDFs
Extract from HTTP/HTTPS URLs:
import pdftract
doc = pdftract.extract("https://example.com/document.pdf")
MCP Integration
For AI-assisted PDF extraction, pdftract provides an MCP (Model Context Protocol) server. The Python SDK can be used alongside MCP clients like Claude Desktop:
pdftract mcp --stdio
See MCP Client Configuration Guide for setup instructions.
Types
The SDK provides typed wrappers for all output structures:
from pdftract.types import Document, Page, Span, Block, Metadata
# All extraction functions return typed objects
doc: Document = pdftract.extract("document.pdf")
page: Page = doc.pages[0]
span: Span = page.spans[0]
block: Block = page.blocks[0]
metadata: Metadata = pdftract.get_metadata("document.pdf")
Async API
For asyncio-based applications, use the async API:
import pdftract.asyncio as pdftract_async
async def extract_async():
doc = await pdftract_async.extract("document.pdf")
print(f"Extracted {len(doc.pages)} pages")