jedarden b93bb53ac2 docs(pdftract-46tdo): add comprehensive troubleshooting guide with diagnostic code mappings

- Created troubleshooting.md mapping 22+ user-visible diagnostic codes
- Added symptom-to-diagnostic lookup table for quick navigation
- Each diagnostic code includes: what it means, cause, fix, severity
- Cross-references the Diagnostics Reference for full catalog
- Updated SUMMARY.md to include new troubleshooting guide
- Verified mdBook builds successfully

Acceptance criteria:
- Covers 15+ diagnostic codes (actual: 22+)
- Top-level TOC for navigation
- Cross-links to Diagnostic Code Catalog
- mdBook renders cleanly

Diagnostic codes covered:
XREF_REPAIRED, STREAM_BOMB, ENCRYPTION_UNSUPPORTED,
OCR_JBIG2_UNSUPPORTED, OCR_JPX_UNSUPPORTED, OCR_CCITT_UNSUPPORTED,
BROKENVECTOR_OCR_UNAVAILABLE, MCP_PATH_TRAVERSAL, PATH_OUTSIDE_ROOT,
URL_PRIVATE_NETWORK, CACHE_ENTRY_CORRUPT, CACHE_INTEGRITY_FAIL,
PROFILE_INVALID, PROFILE_SECRETS_FORBIDDEN, PAGE_OUT_OF_RANGE,
GLYPH_UNMAPPED, JAVASCRIPT_PRESENT, STRUCT_CIRCULAR_REF,
STRUCT_XOBJECT_CYCLE, GSTATE_STACK_OVERFLOW, REMOTE_FETCH_INTERRUPTED,
REMOTE_NO_RANGE_SUPPORT, TAGGED_PDF_STRUCT_TREE_DEFERRED

2026-05-31 23:24:42 -04:00

6.3 KiB

Raw Blame History

Python SDK

The Python SDK (pdftract) provides native Python bindings with idiomatic ergonomics including an exception hierarchy, dataclass types, and optional asyncio wrappers.

Installation

pip install pdftract

The package includes a precompiled native module for your platform. If the native module fails to import, a subprocess fallback is automatically used (with significantly degraded performance).

Basic Extraction

import pdftract

doc = pdftract.extract("document.pdf")
print(f"Extracted {len(doc.pages)} pages")

for page in doc.pages:
    for span in page.spans:
        print(span.text)

Text-Only Extraction

For RAG pipelines that just need the text body:

import pdftract

text = pdftract.extract_text("document.pdf")
print(text)

Streaming

For large PDFs, stream pages one at a time to keep memory usage bounded:

import pdftract

for page in pdftract.extract_stream("large_document.pdf"):
    print(f"Page {page.page_index}: {len(page.spans)} spans")
    # Process page while only one page is resident in memory

Markdown Extraction

Extract Markdown with optional anchor links for mapping back to PDF locations:

import pdftract

# Basic Markdown
markdown = pdftract.extract_markdown("document.pdf")

# With anchor links (HTML comments)
markdown = pdftract.extract_markdown("document.pdf", anchors=True)

Options

Pass extraction options as keyword arguments:

import pdftract

doc = pdftract.extract(
    "document.pdf",
    pages="1-5,7",           # Page range
    password="secret123",    # PDF password
    receipts="lite"          # Receipt generation mode
)

Available Options

Option	Type	Default	Use Case
`pages`	`str \| None`	`None`	Page range (e.g., `"1-5,7,12-"`)
`password`	`str \| None`	`None`	PDF password for encrypted documents
`receipts`	`str \| None`	`None`	Receipt mode: `"off"`, `"lite"`, or `"full"`
`ocr`	`bool`	`False`	Enable OCR for scanned documents
`ocr_language`	`list[str]`	`["eng"]`	OCR language codes
`include_invisible`	`bool`	`False`	Include invisible text in output
`extract_forms`	`bool`	`True`	Extract AcroForm fields
`extract_attachments`	`bool`	`True`	Extract embedded attachments
`readability_threshold`	`float`	`0.0`	Minimum readability score
`max_decompress_gb`	`int`	`512`	Max decompressed GB per stream
`full_render`	`bool`	`False`	Enable full rendering

Error Handling

The SDK provides a structured exception hierarchy:

import pdftract

try:
    doc = pdftract.extract("encrypted.pdf", password="wrong")
except pdftract.EncryptionError as e:
    print(f"Encryption error: {e.code} - {e.hint}")
except pdftract.CorruptPdfError as e:
    print(f"Corrupt PDF: {e}")
except pdftract.SourceUnreachableError as e:
    print(f"File not found: {e}")
except pdftract.PdftractError as e:
    print(f"Extraction failed: {e}")

Exception Hierarchy

All exceptions inherit from PdftractError:

PdftractError — Base exception for all extraction errors
EncryptionError — PDF encryption/password errors
CorruptPdfError — Malformed or corrupted PDF
SourceUnreachableError — File or URL unreachable
RemoteFetchInterruptedError — Network interruption during fetch
TlsError — TLS/certificate errors
ReceiptVerifyError — Receipt verification failed
UnsupportedOperationError — Requested operation not available

Exception Attributes

All exceptions have the following attributes:

code — Diagnostic code (e.g., "ENCRYPTION_WRONG_PASSWORD")
page_index — Page number where error occurred (if applicable)
hint — Suggested action for resolution

Metadata

Get document metadata without full extraction:

import pdftract

metadata = pdftract.get_metadata("document.pdf")
print(f"Pages: {metadata.page_count}")
print(f"Title: {metadata.title}")
print(f"Author: {metadata.author}")
print(f"Fingerprint: {metadata.fingerprint}")

Search

Search for a regex pattern in the PDF:

import pdftract

for match in pdftract.search("document.pdf", r"\b\d{3}-\d{2}-\d{4}\b"):
    print(f"Found SSN at page {match.page_index}: {match.text}")

Fingerprint

Compute the structural fingerprint of a PDF:

import pdftract

fingerprint = pdftract.hash("document.pdf")
print(f"Fingerprint: {fingerprint.value}")

Classify

Classify a PDF page type:

import pdftract

classification = pdftract.classify("document.pdf")
print(f"Type: {classification.class_name}")
print(f"Confidence: {classification.confidence}")

Verify Receipt

Verify a cryptographic receipt:

import pdftract

# Extract with receipts enabled
doc = pdftract.extract("document.pdf", receipts="lite")
receipt = doc.pages[0].receipt

# Verify later
verified = pdftract.verify_receipt("document.pdf", receipt)
print(f"Verified: {verified}")

Remote PDFs

Extract from HTTP/HTTPS URLs:

import pdftract

doc = pdftract.extract("https://example.com/document.pdf")

MCP Integration

For AI-assisted PDF extraction, pdftract provides an MCP (Model Context Protocol) server. The Python SDK can be used alongside MCP clients like Claude Desktop:

pdftract mcp --stdio

See MCP Client Configuration Guide for setup instructions.

Types

The SDK provides typed wrappers for all output structures:

from pdftract.types import Document, Page, Span, Block, Metadata

# All extraction functions return typed objects
doc: Document = pdftract.extract("document.pdf")
page: Page = doc.pages[0]
span: Span = page.spans[0]
block: Block = page.blocks[0]
metadata: Metadata = pdftract.get_metadata("document.pdf")

Async API

For asyncio-based applications, use the async API:

import pdftract.asyncio as pdftract_async

async def extract_async():
    doc = await pdftract_async.extract("document.pdf")
    print(f"Extracted {len(doc.pages)} pages")

6.3 KiB Raw Blame History