docs(pdftract-46tdo): add comprehensive troubleshooting guide with diagnostic code mappings
- Created troubleshooting.md mapping 22+ user-visible diagnostic codes - Added symptom-to-diagnostic lookup table for quick navigation - Each diagnostic code includes: what it means, cause, fix, severity - Cross-references the Diagnostics Reference for full catalog - Updated SUMMARY.md to include new troubleshooting guide - Verified mdBook builds successfully Acceptance criteria: - Covers 15+ diagnostic codes (actual: 22+) - Top-level TOC for navigation - Cross-links to Diagnostic Code Catalog - mdBook renders cleanly Diagnostic codes covered: XREF_REPAIRED, STREAM_BOMB, ENCRYPTION_UNSUPPORTED, OCR_JBIG2_UNSUPPORTED, OCR_JPX_UNSUPPORTED, OCR_CCITT_UNSUPPORTED, BROKENVECTOR_OCR_UNAVAILABLE, MCP_PATH_TRAVERSAL, PATH_OUTSIDE_ROOT, URL_PRIVATE_NETWORK, CACHE_ENTRY_CORRUPT, CACHE_INTEGRITY_FAIL, PROFILE_INVALID, PROFILE_SECRETS_FORBIDDEN, PAGE_OUT_OF_RANGE, GLYPH_UNMAPPED, JAVASCRIPT_PRESENT, STRUCT_CIRCULAR_REF, STRUCT_XOBJECT_CYCLE, GSTATE_STACK_OVERFLOW, REMOTE_FETCH_INTERRUPTED, REMOTE_NO_RANGE_SUPPORT, TAGGED_PDF_STRUCT_TREE_DEFERRED
This commit is contained in:
parent
0e7def1d21
commit
b93bb53ac2
4 changed files with 739 additions and 4 deletions
|
|
@ -50,6 +50,8 @@
|
|||
- [Hybrid Routing](./advanced/hybrid-routing.md)
|
||||
- [Provenance and Confidence](./advanced/provenance.md)
|
||||
|
||||
- [Troubleshooting Guide](./troubleshooting.md)
|
||||
|
||||
- [Troubleshooting](./troubleshooting/README.md)
|
||||
- [Common Issues](./troubleshooting/common-issues.md)
|
||||
- [Diagnostics](./troubleshooting/diagnostics.md)
|
||||
|
|
|
|||
|
|
@ -1,5 +1,250 @@
|
|||
# Python SDK
|
||||
|
||||
> **Draft** — This page is a placeholder for future content.
|
||||
The Python SDK (`pdftract`) provides native Python bindings with idiomatic ergonomics including an exception hierarchy, dataclass types, and optional asyncio wrappers.
|
||||
|
||||
Using pdftract from Python.
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install pdftract
|
||||
```
|
||||
|
||||
The package includes a precompiled native module for your platform. If the native module fails to import, a subprocess fallback is automatically used (with significantly degraded performance).
|
||||
|
||||
## Basic Extraction
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
doc = pdftract.extract("document.pdf")
|
||||
print(f"Extracted {len(doc.pages)} pages")
|
||||
|
||||
for page in doc.pages:
|
||||
for span in page.spans:
|
||||
print(span.text)
|
||||
```
|
||||
|
||||
## Text-Only Extraction
|
||||
|
||||
For RAG pipelines that just need the text body:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
text = pdftract.extract_text("document.pdf")
|
||||
print(text)
|
||||
```
|
||||
|
||||
## Streaming
|
||||
|
||||
For large PDFs, stream pages one at a time to keep memory usage bounded:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
for page in pdftract.extract_stream("large_document.pdf"):
|
||||
print(f"Page {page.page_index}: {len(page.spans)} spans")
|
||||
# Process page while only one page is resident in memory
|
||||
```
|
||||
|
||||
## Markdown Extraction
|
||||
|
||||
Extract Markdown with optional anchor links for mapping back to PDF locations:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
# Basic Markdown
|
||||
markdown = pdftract.extract_markdown("document.pdf")
|
||||
|
||||
# With anchor links (HTML comments)
|
||||
markdown = pdftract.extract_markdown("document.pdf", anchors=True)
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
Pass extraction options as keyword arguments:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
doc = pdftract.extract(
|
||||
"document.pdf",
|
||||
pages="1-5,7", # Page range
|
||||
password="secret123", # PDF password
|
||||
receipts="lite" # Receipt generation mode
|
||||
)
|
||||
```
|
||||
|
||||
### Available Options
|
||||
|
||||
| Option | Type | Default | Use Case |
|
||||
|--------|------|---------|----------|
|
||||
| `pages` | `str \| None` | `None` | Page range (e.g., `"1-5,7,12-"`) |
|
||||
| `password` | `str \| None` | `None` | PDF password for encrypted documents |
|
||||
| `receipts` | `str \| None` | `None` | Receipt mode: `"off"`, `"lite"`, or `"full"` |
|
||||
| `ocr` | `bool` | `False` | Enable OCR for scanned documents |
|
||||
| `ocr_language` | `list[str]` | `["eng"]` | OCR language codes |
|
||||
| `include_invisible` | `bool` | `False` | Include invisible text in output |
|
||||
| `extract_forms` | `bool` | `True` | Extract AcroForm fields |
|
||||
| `extract_attachments` | `bool` | `True` | Extract embedded attachments |
|
||||
| `readability_threshold` | `float` | `0.0` | Minimum readability score |
|
||||
| `max_decompress_gb` | `int` | `512` | Max decompressed GB per stream |
|
||||
| `full_render` | `bool` | `False` | Enable full rendering |
|
||||
|
||||
## Error Handling
|
||||
|
||||
The SDK provides a structured exception hierarchy:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
try:
|
||||
doc = pdftract.extract("encrypted.pdf", password="wrong")
|
||||
except pdftract.EncryptionError as e:
|
||||
print(f"Encryption error: {e.code} - {e.hint}")
|
||||
except pdftract.CorruptPdfError as e:
|
||||
print(f"Corrupt PDF: {e}")
|
||||
except pdftract.SourceUnreachableError as e:
|
||||
print(f"File not found: {e}")
|
||||
except pdftract.PdftractError as e:
|
||||
print(f"Extraction failed: {e}")
|
||||
```
|
||||
|
||||
### Exception Hierarchy
|
||||
|
||||
All exceptions inherit from `PdftractError`:
|
||||
|
||||
- `PdftractError` — Base exception for all extraction errors
|
||||
- `EncryptionError` — PDF encryption/password errors
|
||||
- `CorruptPdfError` — Malformed or corrupted PDF
|
||||
- `SourceUnreachableError` — File or URL unreachable
|
||||
- `RemoteFetchInterruptedError` — Network interruption during fetch
|
||||
- `TlsError` — TLS/certificate errors
|
||||
- `ReceiptVerifyError` — Receipt verification failed
|
||||
- `UnsupportedOperationError` — Requested operation not available
|
||||
|
||||
### Exception Attributes
|
||||
|
||||
All exceptions have the following attributes:
|
||||
|
||||
- `code` — Diagnostic code (e.g., `"ENCRYPTION_WRONG_PASSWORD"`)
|
||||
- `page_index` — Page number where error occurred (if applicable)
|
||||
- `hint` — Suggested action for resolution
|
||||
|
||||
## Metadata
|
||||
|
||||
Get document metadata without full extraction:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
metadata = pdftract.get_metadata("document.pdf")
|
||||
print(f"Pages: {metadata.page_count}")
|
||||
print(f"Title: {metadata.title}")
|
||||
print(f"Author: {metadata.author}")
|
||||
print(f"Fingerprint: {metadata.fingerprint}")
|
||||
```
|
||||
|
||||
## Search
|
||||
|
||||
Search for a regex pattern in the PDF:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
for match in pdftract.search("document.pdf", r"\b\d{3}-\d{2}-\d{4}\b"):
|
||||
print(f"Found SSN at page {match.page_index}: {match.text}")
|
||||
```
|
||||
|
||||
## Fingerprint
|
||||
|
||||
Compute the structural fingerprint of a PDF:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
fingerprint = pdftract.hash("document.pdf")
|
||||
print(f"Fingerprint: {fingerprint.value}")
|
||||
```
|
||||
|
||||
## Classify
|
||||
|
||||
Classify a PDF page type:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
classification = pdftract.classify("document.pdf")
|
||||
print(f"Type: {classification.class_name}")
|
||||
print(f"Confidence: {classification.confidence}")
|
||||
```
|
||||
|
||||
## Verify Receipt
|
||||
|
||||
Verify a cryptographic receipt:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
# Extract with receipts enabled
|
||||
doc = pdftract.extract("document.pdf", receipts="lite")
|
||||
receipt = doc.pages[0].receipt
|
||||
|
||||
# Verify later
|
||||
verified = pdftract.verify_receipt("document.pdf", receipt)
|
||||
print(f"Verified: {verified}")
|
||||
```
|
||||
|
||||
## Remote PDFs
|
||||
|
||||
Extract from HTTP/HTTPS URLs:
|
||||
|
||||
```python
|
||||
import pdftract
|
||||
|
||||
doc = pdftract.extract("https://example.com/document.pdf")
|
||||
```
|
||||
|
||||
## MCP Integration
|
||||
|
||||
For AI-assisted PDF extraction, pdftract provides an [MCP (Model Context Protocol) server](../integrations/mcp-clients.md). The Python SDK can be used alongside MCP clients like Claude Desktop:
|
||||
|
||||
```bash
|
||||
pdftract mcp --stdio
|
||||
```
|
||||
|
||||
See [MCP Client Configuration Guide](../integrations/mcp-clients.md) for setup instructions.
|
||||
|
||||
## Types
|
||||
|
||||
The SDK provides typed wrappers for all output structures:
|
||||
|
||||
```python
|
||||
from pdftract.types import Document, Page, Span, Block, Metadata
|
||||
|
||||
# All extraction functions return typed objects
|
||||
doc: Document = pdftract.extract("document.pdf")
|
||||
page: Page = doc.pages[0]
|
||||
span: Span = page.spans[0]
|
||||
block: Block = page.blocks[0]
|
||||
metadata: Metadata = pdftract.get_metadata("document.pdf")
|
||||
```
|
||||
|
||||
## Async API
|
||||
|
||||
For asyncio-based applications, use the async API:
|
||||
|
||||
```python
|
||||
import pdftract.asyncio as pdftract_async
|
||||
|
||||
async def extract_async():
|
||||
doc = await pdftract_async.extract("document.pdf")
|
||||
print(f"Extracted {len(doc.pages)} pages")
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [MCP Client Configuration Guide](../integrations/mcp-clients.md)
|
||||
- [JSON Schema Reference](../json-schema-reference.md)
|
||||
- [CLI Reference](../cli/README.md)
|
||||
- [Advanced: OCR Configuration](../advanced/ocr.md)
|
||||
|
|
|
|||
|
|
@ -1,5 +1,190 @@
|
|||
# Rust SDK
|
||||
|
||||
> **Draft** — This page is a placeholder for future content.
|
||||
The Rust SDK is the `pdftract-core` crate. It provides native PDF text extraction with zero-copy memory mapping and streaming support.
|
||||
|
||||
Using pdftract from Rust.
|
||||
## Installation
|
||||
|
||||
Add to your `Cargo.toml`:
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
pdftract-core = "1.0"
|
||||
```
|
||||
|
||||
For OCR support, enable the `ocr` feature:
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
pdftract-core = { version = "1.0", features = ["ocr"] }
|
||||
```
|
||||
|
||||
## Basic Extraction
|
||||
|
||||
```rust
|
||||
use pdftract_core::{extract_pdf, ExtractionOptions, OutputOptions};
|
||||
|
||||
fn main() -> anyhow::Result<()> {
|
||||
let opts = ExtractionOptions::default();
|
||||
let output = OutputOptions::default();
|
||||
|
||||
let result = extract_pdf("document.pdf", &opts, &output)?;
|
||||
|
||||
for (i, page) in result.pages.iter().enumerate() {
|
||||
println!("Page {}: {} chars", i + 1, page.text.len());
|
||||
for span in &page.spans {
|
||||
println!(" {}", span.text);
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Streaming Extraction
|
||||
|
||||
For large PDFs, stream pages one at a time to keep memory usage bounded:
|
||||
|
||||
```rust
|
||||
use pdftract_core::{extract_pdf_streaming, ExtractionOptions, OutputOptions};
|
||||
use std::fs::File;
|
||||
|
||||
fn main() -> anyhow::Result<()> {
|
||||
let mut output = File::create("output.ndjson")?;
|
||||
extract_pdf_streaming(
|
||||
"large_document.pdf",
|
||||
&ExtractionOptions::default(),
|
||||
&OutputOptions::default(),
|
||||
&mut output,
|
||||
)?;
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
### ExtractionOptions
|
||||
|
||||
| Field | Type | Default | Use Case |
|
||||
|-------|------|---------|----------|
|
||||
| `receipts` | `ReceiptsMode` | `Off` | Generate cryptographic receipts |
|
||||
| `max_parallel_pages` | `usize` | `4` | Control memory for concurrent page processing |
|
||||
| `memory_budget_mb` | `usize` | `512` | Target peak RSS in MB |
|
||||
| `full_render` | `bool` | `false` | Enable PDFium rendering (requires `full-render` feature) |
|
||||
| `ocr_dpi_override` | `Option<u32>` | `None` | Override automatic DPI selection |
|
||||
| `ocr_language` | `Vec<String>` | `vec!["eng"]` | Tesseract language codes |
|
||||
| `markdown_anchors` | `bool` | `false` | Emit HTML comment anchors in Markdown |
|
||||
| `max_decompress_bytes` | `u64` | `512 MiB` | Bomb limit for decompressed streams |
|
||||
| `output` | `OutputOptions` | `default()` | Output filtering options |
|
||||
| `pages` | `Option<String>` | `None` | Page range (e.g., `"1-5,7,12-"`) |
|
||||
| `password` | `Option<SecretString>` | `None` | PDF password for encrypted documents |
|
||||
|
||||
### OutputOptions
|
||||
|
||||
| Field | Type | Default | Use Case |
|
||||
|-------|------|---------|----------|
|
||||
| `include_invisible` | `bool` | `false` | Include invisible text in output |
|
||||
| `extract_forms` | `bool` | `true` | Extract AcroForm fields |
|
||||
| `extract_attachments` | `bool` | `true` | Extract embedded attachments |
|
||||
|
||||
## Receipts
|
||||
|
||||
Generate cryptographic receipts for verification:
|
||||
|
||||
```rust
|
||||
use pdftract_core::{extract_pdf, ExtractionOptions, OutputOptions};
|
||||
use pdftract_core::options::ReceiptsMode;
|
||||
|
||||
fn main() -> anyhow::Result<()> {
|
||||
let opts = ExtractionOptions {
|
||||
receipts: ReceiptsMode::Lite,
|
||||
..Default::default()
|
||||
};
|
||||
let output = OutputOptions::default();
|
||||
let result = extract_pdf("document.pdf", &opts, &output)?;
|
||||
|
||||
// Receipts are embedded in page metadata
|
||||
if let Some(receipt) = &result.pages[0].receipt {
|
||||
println!("Receipt: {}", receipt);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Remote PDFs
|
||||
|
||||
With the `remote` feature, fetch PDFs via HTTP:
|
||||
|
||||
```rust
|
||||
use pdftract_core::{extract_pdf, ExtractionOptions, OutputOptions};
|
||||
|
||||
fn main() -> anyhow::Result<()> {
|
||||
let opts = ExtractionOptions::default();
|
||||
let output = OutputOptions::default();
|
||||
let result = extract_pdf("https://example.com/document.pdf", &opts, &output)?;
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
Most functions return `anyhow::Result<T>` which wraps various error types:
|
||||
|
||||
```rust
|
||||
use pdftract_core::{extract_pdf, ExtractionOptions, OutputOptions};
|
||||
|
||||
fn main() {
|
||||
let opts = ExtractionOptions::default();
|
||||
let output = OutputOptions::default();
|
||||
|
||||
match extract_pdf("document.pdf", &opts, &output) {
|
||||
Ok(result) => {
|
||||
println!("Extracted {} pages", result.pages.len());
|
||||
}
|
||||
Err(e) => {
|
||||
eprintln!("Extraction failed: {}", e);
|
||||
// Inspect error chain
|
||||
for cause in e.chain() {
|
||||
eprintln!(" caused by: {}", cause);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Feature Flags
|
||||
|
||||
| Feature | Adds | Default |
|
||||
|---------|------|---------|
|
||||
| `serde` | JSON serialization support | ✓ |
|
||||
| `decrypt` | Decryption of encrypted PDFs | ✓ |
|
||||
| `quick-xml` | Conformance detection via XML metadata | ✓ |
|
||||
| `ocr` | Tesseract OCR for scanned documents | - |
|
||||
| `full-render` | PDFium-based rendering (requires `ocr`) | - |
|
||||
| `remote` | HTTP range fetching for remote PDFs | - |
|
||||
| `profiles` | Extraction profiles | - |
|
||||
| `receipts` | Cryptographic receipt generation | - |
|
||||
| `cjk` | CJK text extraction via predefined CMap registry | - |
|
||||
| `schemars` | JSON Schema generation | - |
|
||||
|
||||
## Source Types
|
||||
|
||||
The SDK supports multiple source types via the `PdfSource` trait:
|
||||
|
||||
```rust
|
||||
use pdftract_core::source::{FileSource, MmapSource, MemorySource};
|
||||
|
||||
// Memory-mapped source (zero-copy for large files)
|
||||
let source = MmapSource::open("document.pdf")?;
|
||||
|
||||
// In-memory source (for byte buffers)
|
||||
let data = std::fs::read("document.pdf")?;
|
||||
let source = MemorySource::new(data);
|
||||
|
||||
// Standard file source
|
||||
let source = FileSource::open("document.pdf")?;
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [JSON Schema Reference](../json-schema-reference.md)
|
||||
- [CLI Reference](../cli/README.md)
|
||||
- [Advanced: OCR Configuration](../advanced/ocr.md)
|
||||
|
|
|
|||
303
docs/user-docs/src/troubleshooting.md
Normal file
303
docs/user-docs/src/troubleshooting.md
Normal file
|
|
@ -0,0 +1,303 @@
|
|||
# Troubleshooting
|
||||
|
||||
This guide maps common pdftract failures to their causes and fixes. Each error is associated with a **diagnostic code** that appears in extraction output (see `diagnostics` in the JSON response or CLI stderr).
|
||||
|
||||
> **For the authoritative diagnostic code catalog**, see the [Diagnostics Reference](./troubleshooting/diagnostics.md).
|
||||
|
||||
## Symptom → Diagnostic Lookup
|
||||
|
||||
| Symptom | Likely Diagnostic Code |
|
||||
|---------|----------------------|
|
||||
| PDF won't open, "encrypted" error | `ENCRYPTION_UNSUPPORTED` |
|
||||
| Text extraction incomplete or missing | `XREF_REPAIRED`, `OCR_*_UNSUPPORTED` |
|
||||
| Process hangs or runs very long | `STREAM_BOMB` |
|
||||
| "Path outside root" (MCP mode) | `MCP_PATH_TRAVERSAL` |
|
||||
| Cache errors / corrupted entries | `CACHE_ENTRY_CORRUPT`, `CACHE_INTEGRITY_FAIL` |
|
||||
| Profile fails to load | `PROFILE_INVALID`, `PROFILE_SECRETS_FORBIDDEN` |
|
||||
| Remote URL fetch blocked | `URL_PRIVATE_NETWORK` |
|
||||
| Requested page doesn't exist | `PAGE_OUT_OF_RANGE` |
|
||||
| Text contains placeholder characters (⍰) | `GLYPH_UNMAPPED` |
|
||||
| Broken vector graphics not recovered | `BROKENVECTOR_OCR_UNAVAILABLE` |
|
||||
| JavaScript warning in output | `JAVASCRIPT_PRESENT` |
|
||||
| Circular reference warnings | `STRUCT_CIRCULAR_REF`, `STRUCT_XOBJECT_CYCLE` |
|
||||
| Stack overflow warnings | `GSTATE_STACK_OVERFLOW` |
|
||||
|
||||
---
|
||||
|
||||
## XREF_REPAIRED warning
|
||||
|
||||
**What it means**: pdftract found the PDF's cross-reference table was corrupt and ran the forward-scan fallback (Phase 1.3) to recover.
|
||||
|
||||
**Cause**: PDF created or transmitted with truncation or corruption. The `startxref` offset points outside the file, or the xref table is malformed.
|
||||
|
||||
**Fix**: Usually no action needed; extraction succeeds with the recovered xref. Output may be incomplete on truncated files. If extraction fails, the PDF is unsalvageable.
|
||||
|
||||
**Severity**: info (extraction continues)
|
||||
|
||||
---
|
||||
|
||||
## STREAM_BOMB error
|
||||
|
||||
**What it means**: A compressed stream exceeded the decompression size limit (default: 512 MB).
|
||||
|
||||
**Cause**: A hostile PDF with a "compression bomb" — a small stream that expands to multi-GB size (e.g., 10 KB → 2 GB). This is a common security exploit pattern.
|
||||
|
||||
**Fix**:
|
||||
- If the PDF is **trusted**: Increase the limit with `--max-decompress-gb 2` (or higher)
|
||||
- If the PDF is **untrusted**: Treat as a hostile file; do not process
|
||||
|
||||
**Severity**: error (stream aborted; partial extraction returned)
|
||||
|
||||
---
|
||||
|
||||
## ENCRYPTION_UNSUPPORTED fatal
|
||||
|
||||
**What it means**: The PDF is encrypted with an unsupported handler or the wrong password.
|
||||
|
||||
**Cause**:
|
||||
- PDF encrypted with an unknown handler (e.g., Adobe LiveCycle policy server)
|
||||
- PDF password-protected but no password (or wrong password) supplied
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Supply password via environment variable
|
||||
export PDFTRACT_PASSWORD="your-password"
|
||||
pdftract extract document.pdf
|
||||
|
||||
# Or via stdin
|
||||
echo "your-password" | pdftract extract --password-stdin document.pdf
|
||||
```
|
||||
|
||||
If the handler is unsupported (e.g., Adobe LiveCycle), use an Adobe-side decryption tool first, or a dedicated password recovery tool like `pdfcrack` or `john`.
|
||||
|
||||
**Severity**: fatal (process exits with code 3)
|
||||
|
||||
---
|
||||
|
||||
## OCR_JBIG2_UNSUPPORTED / OCR_JPX_UNSUPPORTED / OCR_CCITT_UNSUPPORTED warning
|
||||
|
||||
**What it means**: A page contains an image that requires a decoder not available in the current build.
|
||||
|
||||
**Cause**:
|
||||
- `OCR_JBIG2_UNSUPPORTED`: JBIG2-encoded image (rare)
|
||||
- `OCR_JPX_UNSUPPORTED`: JPEG 2000-encoded image
|
||||
- `OCR_CCITT_UNSUPPORTED`: CCITT fax-encoded image
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Build with full-render feature (enables all decoders via PDFium)
|
||||
cargo build --release --features full-render
|
||||
|
||||
# Or install system libraries:
|
||||
# - JPX: install libopenjp2
|
||||
# - CCITT: install libtiff
|
||||
```
|
||||
|
||||
**Severity**: warn (page skipped from OCR; extraction continues)
|
||||
|
||||
---
|
||||
|
||||
## BROKENVECTOR_OCR_UNAVAILABLE warning
|
||||
|
||||
**What it means**: A page contains broken vector graphics that could be recovered via OCR, but the OCR feature is disabled.
|
||||
|
||||
**Cause**: Build was compiled without the `ocr` feature.
|
||||
|
||||
**Fix**: Rebuild with OCR enabled:
|
||||
```bash
|
||||
cargo build --release --features ocr
|
||||
```
|
||||
|
||||
**Severity**: warn (broken vector graphics not recovered; extraction continues)
|
||||
|
||||
---
|
||||
|
||||
## MCP_PATH_TRAVERSAL / PATH_OUTSIDE_ROOT error
|
||||
|
||||
**What it means**: (MCP mode) The requested path escapes the `--root` directory boundary.
|
||||
|
||||
**Cause**: A tool call attempted path traversal (e.g., `../../etc/passwd`).
|
||||
|
||||
**Fix**:
|
||||
- Adjust the requested path to stay within `--root`
|
||||
- Or restart the MCP server without `--root` restriction (not recommended for multi-tenant deployments)
|
||||
|
||||
**Severity**: error (request rejected)
|
||||
|
||||
---
|
||||
|
||||
## URL_PRIVATE_NETWORK error
|
||||
|
||||
**What it means**: Remote fetch blocked because the URL targets a private network address.
|
||||
|
||||
**Cause**: URL targets localhost, private IP ranges (RFC 1918), or link-local addresses. This is an SSRF (Server-Side Request Forgery) protection.
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# If you trust the URL, allow private networks:
|
||||
pdftract extract --allow-private-networks https://internal-server/docs.pdf
|
||||
```
|
||||
|
||||
**Severity**: error (request rejected with HTTP 400 in serve mode)
|
||||
|
||||
---
|
||||
|
||||
## CACHE_ENTRY_CORRUPT warning
|
||||
|
||||
**What it means**: A cache entry failed integrity verification.
|
||||
|
||||
**Cause**: Cache file corruption (disk error, concurrent write, etc.).
|
||||
|
||||
**Fix**: None needed — the entry is automatically deleted and extraction re-runs. If this recurs frequently, check your disk filesystem.
|
||||
|
||||
**Severity**: warn (entry deleted; extraction re-runs)
|
||||
|
||||
---
|
||||
|
||||
## CACHE_INTEGRITY_FAIL diagnostic
|
||||
|
||||
**What it means**: A cache entry's HMAC verification failed, indicating potential cache poisoning.
|
||||
|
||||
**Cause**: Malicious co-tenant wrote a forged cache entry (multi-user cache scenarios), or disk corruption.
|
||||
|
||||
**Fix**: The entry is treated as a cache miss and extraction re-runs. In multi-user environments, ensure per-user cache directories or verify cache permissions.
|
||||
|
||||
**Severity**: warn (entry rejected; extraction re-runs)
|
||||
|
||||
---
|
||||
|
||||
## PROFILE_INVALID / PROFILE_SECRETS_FORBIDDEN error
|
||||
|
||||
**What it means**: Profile YAML failed validation.
|
||||
|
||||
**Cause**:
|
||||
- `PROFILE_INVALID`: YAML syntax error or schema violation
|
||||
- `PROFILE_SECRETS_FORBIDDEN`: Profile contains secret-keyword keys (`password:`, `token:`, `secret:`, `api_key:`)
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# For schema errors, check the YAML syntax:
|
||||
pdftract profile show --profile-path your-profile.yaml
|
||||
|
||||
# For secrets errors, remove secret keys from the profile.
|
||||
# Secrets should be passed via environment variables, not profiles.
|
||||
```
|
||||
|
||||
**Severity**: error (profile rejected)
|
||||
|
||||
---
|
||||
|
||||
## PAGE_OUT_OF_RANGE warning
|
||||
|
||||
**What it means**: The `--pages` argument exceeds the document's actual page count.
|
||||
|
||||
**Cause**: Page range specified (e.g., `--pages 1-100`) on a document with fewer pages (e.g., 10 pages).
|
||||
|
||||
**Fix**: Adjust the `--pages` argument to the actual page count:
|
||||
```bash
|
||||
# First, get the page count:
|
||||
pdftract inspect document.json | jq '.page_count'
|
||||
|
||||
# Then extract with a valid range:
|
||||
pdftract extract --pages 1-10 document.pdf
|
||||
```
|
||||
|
||||
**Severity**: warn (pages clamped to available range)
|
||||
|
||||
---
|
||||
|
||||
## GLYPH_UNMAPPED warning
|
||||
|
||||
**What it means**: A glyph could not be resolved by any of the four encoding levels.
|
||||
|
||||
**Cause**: Font encoding corruption, missing font embedding, or non-standard encoding.
|
||||
|
||||
**Fix**: Output contains the Unicode replacement character (⍰). No direct fix; consider re-saving the PDF through a normalizing tool (e.g., Adobe Acrobat, qpdf).
|
||||
|
||||
**Severity**: warn (character replaced with U+FFFD; extraction continues)
|
||||
|
||||
---
|
||||
|
||||
## JAVASCRIPT_PRESENT info
|
||||
|
||||
**What it means**: PDF contains embedded JavaScript (in `/AA`, `/OpenAction`, or `/JS` entries).
|
||||
|
||||
**Cause**: PDF includes JavaScript actions (common in forms, interactive documents).
|
||||
|
||||
**Fix**: None needed for extraction — pdftract NEVER executes embedded JavaScript. JavaScript actions are surfaced in `metadata.javascript_actions[]` for downstream review.
|
||||
|
||||
**Severity**: info (JavaScript is not executed)
|
||||
|
||||
---
|
||||
|
||||
## STRUCT_CIRCULAR_REF / STRUCT_XOBJECT_CYCLE / GSTATE_STACK_OVERFLOW warning
|
||||
|
||||
**What it means**: PDF contains circular references or malformed content streams.
|
||||
|
||||
**Cause**:
|
||||
- `STRUCT_CIRCULAR_REF`: Indirect object reference cycle
|
||||
- `STRUCT_XOBJECT_CYCLE`: XObject (image/form) reference cycle
|
||||
- `GSTATE_STACK_OVERFLOW`: Graphics state stack exceeds depth limit
|
||||
|
||||
**Fix**: Usually no action needed — pdftract breaks cycles at the second visit (or depth 20 for XObjects). If output is incomplete, investigate the source PDF for a producer bug.
|
||||
|
||||
**Severity**: warn (cycle broken; extraction continues)
|
||||
|
||||
---
|
||||
|
||||
## REMOTE_FETCH_INTERRUPTED error
|
||||
|
||||
**What it means**: Remote fetch was interrupted (network timeout, connection reset, etc.).
|
||||
|
||||
**Cause**: Network connectivity issues, server timeout, or premature connection close.
|
||||
|
||||
**Fix**: Retry the request; check network connectivity:
|
||||
```bash
|
||||
# Retry with increased timeout:
|
||||
pdftract extract --timeout-seconds 120 https://example.com/document.pdf
|
||||
```
|
||||
|
||||
**Severity**: error (request aborted)
|
||||
|
||||
---
|
||||
|
||||
## REMOTE_NO_RANGE_SUPPORT warning
|
||||
|
||||
**What it means**: Remote server does not support HTTP Range requests.
|
||||
|
||||
**Cause**: Server lacks `Accept-Ranges` header or returns 206 Unsupported.
|
||||
|
||||
**Fix**: None needed — pdftract falls back to whole-file download. For large files, consider hosting on a Range-supporting server.
|
||||
|
||||
**Severity**: warn (fallback to whole-file download)
|
||||
|
||||
---
|
||||
|
||||
## TAGGED_PDF_STRUCT_TREE_DEFERRED info
|
||||
|
||||
**What it means**: Tagged PDF structure tree extraction is deferred in this version.
|
||||
|
||||
**Cause**: Phase 7.1 (full structure tree extraction) is not yet implemented.
|
||||
|
||||
**Fix**: None needed — this is a temporary fallback. Structure tree extraction will be added in v1.0.0.
|
||||
|
||||
**Severity**: info (structure tree not extracted)
|
||||
|
||||
---
|
||||
|
||||
## Getting Help
|
||||
|
||||
If you encounter a diagnostic code not listed here, or the suggested fix doesn't resolve your issue:
|
||||
|
||||
1. **Check the [Diagnostics Reference](./troubleshooting/diagnostics.md)** for the full catalog
|
||||
2. **Search existing issues** on [GitHub](https://github.com/jedarden/pdftract/issues)
|
||||
3. **Open a new issue** with:
|
||||
- The diagnostic code(s)
|
||||
- A minimal reproducible example (PDF or command)
|
||||
- The `--debug` output if safe to share
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Diagnostics Reference](./troubleshooting/diagnostics.md) — Full diagnostic code catalog
|
||||
- [FAQ](./faq.md) — Common questions and answers
|
||||
- [Advanced: OCR Configuration](./advanced/ocr.md) — OCR troubleshooting details
|
||||
Loading…
Add table
Reference in a new issue