jedarden 2ccdaecda1 docs(pdftract-5nare): add comprehensive FAQ with 24 questions

Added docs/user-docs/src/faq.md with 24 FAQ entries covering:
- General questions (what is pdftract, extract vs extract_text, JS execution)
- Installation and setup (proxy, system requirements)
- Usage (broken_vector, OCR speed, page ranges, images, batch processing)
- Configuration (custom profiles, OCR accuracy, confidence scores)
- Output formats (Markdown, tables, metadata, passwords)
- Troubleshooting (errors, empty output, debugging, memory usage)

Each answer is 1-3 paragraphs with cross-links to fuller docs.
mdBook builds successfully.

Acceptance criteria:
- PASS: docs/user-docs/src/faq.md exists
- PASS: 24 questions covered (target: 15-25)
- PASS: Each answer is 1-3 paragraphs
- PASS: Cross-links work
- PASS: mdBook renders cleanly

Closes: pdftract-5nare

2026-05-25 00:22:48 -04:00

13 KiB

Raw Blame History

FAQ

Frequently asked questions about pdftract.

General
Installation and Setup
Usage
Configuration
Output and Formats
Troubleshooting

General

What is pdftract?

pdftract is a command-line tool and library for extracting text, structure, and content from PDF files. It combines vector text extraction with OCR fallback to handle both well-formed and problematic PDFs. pdftract is written in Rust and provides Python bindings for programmatic use.

See the Introduction for a complete overview.

What's the difference between extract and extract_text?

extract: The primary command that produces structured JSON output with blocks, spans, metadata, and provenance information. Use this when you need the full extraction with layout, reading order, and confidence scores.
extract_text: A simplified command that outputs plain text only. Use this for quick text extraction when you don't need the structured JSON output.

Example:

# Full structured extraction
pdftract extract document.pdf -o output.json

# Plain text only
pdftract extract_text document.pdf -o output.txt

Does pdftract execute JavaScript embedded in PDFs?

No. pdftract never executes JavaScript embedded in PDFs. JavaScript is detected during parsing for security analysis, but it is never executed. This design prevents malicious PDFs from exploiting JavaScript vulnerabilities.

If you need to analyze JavaScript in PDFs, pdftract can detect and report its presence, but execution must be done separately with appropriate sandboxing.

How do I cite an extracted snippet?

The JSON output from pdftract extract includes provenance information for each text block:

{
  "blocks": [{
    "spans": [{
      "text": "Example snippet",
      "bbox": [100.0, 200.0, 250.0, 215.0],
      "page": 3,
      "confidence": 0.98
    }]
  }],
  "metadata": {
    "path": "/path/to/document.pdf",
    "fingerprint": "sha256:abc123...",
    "extracted_at": "2026-05-25T12:00:00Z"
  }
}

For academic citations, include:

Document path and fingerprint
Page number (from the page field)
Extraction timestamp
The pdftract version used

Installation and Setup

How do I install pdftract?

See the Installation guide for complete instructions. Quick summary:

With cargo (Rust toolchain):

cargo install pdftract

With pip (Python bindings):

pip install pdftract

Pre-built binaries: Download from the releases page.

How do I run pdftract behind a corporate proxy?

pdftract doesn't have built-in proxy support, but you can use the HTTP serve mode with a reverse proxy:

Start pdftract in serve mode:

pdftract serve --port 8080

Configure your reverse proxy (nginx, Apache, etc.) to handle authentication and SSL termination.
Access pdftract through your proxy endpoint.

See Advanced Topics: HTTP Serve for deployment guidance.

What are the system requirements?

OS: Linux, macOS, or Windows
Rust: 1.70+ (if building from source)
Python: 3.8+ (for Python bindings)
OCR (optional): Tesseract 4.0+ for OCR fallback
Memory: 512 MB minimum for typical PDFs; more for large documents

Usage

Why is my PDF returning broken_vector?

The broken_vector classification means the PDF's text layer is unreliable or missing. Common causes:

Invisible text overlay: Text with rendering mode 3 (invisible) overlaid on a raster image
Missing ToUnicode CMap: Font lacks character-to-Unicode mapping
Encoding corruption: Character encodings don't match the actual glyphs

Solution: pdftract automatically routes broken_vector pages to the OCR pipeline (Phase 5.5). If you see broken_vector without OCR output, check that OCR is enabled:

# Verify OCR is available
pdftract doctor tesseract-langs

# Enable OCR explicitly if needed
pdftract extract document.pdf --enable-ocr

See Troubleshooting: Broken Vector for more details.

Why is OCR slow?

OCR performance depends on several factors:

Image resolution: Higher DPI images take longer to process
Tesseract version: Version 4.0+ is significantly faster than 3.x
Language data: Additional language packs increase processing time
Hardware: CPU-bound; more cores help with batch processing

To speed up OCR:

# Reduce DPI (trade-off: accuracy)
pdftract extract document.pdf --ocr-dpi 200

# Use fewer languages
pdftract extract document.pdf --ocr-lang eng

# Disable OCR for vector-only PDFs
pdftract extract document.pdf --disable-ocr

How do I extract text from a specific page range?

Use the --pages flag:

# Single page
pdftract extract document.pdf --pages 5

# Range
pdftract extract document.pdf --pages 1-10

# Multiple ranges
pdftract extract document.pdf --pages 1-5,10,15-20

# All pages from page 5 onward
pdftract extract document.pdf --pages 5-

How do I extract images from a PDF?

pdftract automatically detects and records image XObjects during content stream processing. The output JSON includes image metadata:

{
  "images": [{
    "bbox": [100.0, 200.0, 400.0, 500.0],
    "xobject_ref": "5 0 R",
    "name": "Im1"
  }]
}

For actual image extraction, use the serve mode with the /images endpoint or write a custom script using the Python SDK.

Can I process multiple PDFs at once?

Yes, use shell wildcards or write a batch script:

# Process all PDFs in a directory
for file in *.pdf; do
    pdftract extract "$file" -o "output/$(basename "$file" .json)"
done

# With parallel processing (GNU parallel)
ls *.pdf | parallel -j 4 pdftract extract {} -o output/{/.}.json

Configuration

How do I add a custom profile?

Create a YAML file defining your profile:

# custom-profile.yaml
name: my_custom
description: "Custom extraction profile"

extraction:
  preserve_tables: true
  preserve_columns: true
  ocr_fallback: true

output:
  format: json
  include_provenance: true
  confidence_threshold: 0.7

Then use it:

pdftract extract document.pdf --profile custom-profile.yaml

See Custom Profiles for complete documentation.

How do I adjust OCR accuracy?

Adjust Tesseract parameters via environment variables or the OCR configuration:

# Set OCR engine mode
export TESSERACT_OEM=1  # LSTM only
export TESSERACT_PSM=6  # Assume single column block of text

# Adjust page segmentation mode
pdftract extract document.pdf --tesseract-psm 6

Higher accuracy settings may slow down processing. See OCR Configuration for details.

How do I disable OCR for faster processing?

If you know your PDFs have reliable text layers:

pdftract extract document.pdf --disable-ocr

Or set a confidence threshold to skip low-confidence text:

pdftract extract document.pdf --min-confidence 0.9

What are confidence scores and how do I use them?

Each text span has a confidence score (0.0 to 1.0):

1.0: High confidence (ToUnicode CMap lookup succeeded)
0.3: Medium confidence (encoding + AGL fallback)
0.0: No confidence (PositionHint mode or failed resolution)

Filter by confidence:

pdftract extract document.pdf --min-confidence 0.5

Or filter in post-processing using jq:

pdftract extract document.pdf | jq '.blocks[].spans[] | select(.confidence > 0.5)'

Output and Formats

How do I get output in Markdown format?

Use the --format flag:

pdftract extract document.pdf --format markdown -o output.md

The Markdown output preserves headings, lists, tables, and code blocks where detected.

How do I preserve table structure?

pdftract includes table detection (Phase 4.2). Ensure table preservation is enabled:

pdftract extract document.pdf --preserve-tables

Tables are output with structured cell information:

{
  "type": "table",
  "rows": 3,
  "columns": 4,
  "cells": [...]
}

Can I extract metadata from PDFs?

Yes, metadata is automatically extracted and included in the output:

{
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "subject": "Subject",
    "keywords": ["keyword1", "keyword2"],
    "creator": "Application",
    "producer": "PDF Producer",
    "creation_date": "2026-01-01T00:00:00Z",
    "modified_date": "2026-05-25T12:00:00Z"
  }
}

How do I handle password-protected PDFs?

Provide the password via the --password flag:

pdftract extract document.pdf --password secret123

For security, avoid passing passwords on the command line in production. Use environment variables or a config file:

export PDFTRACT_PASSWORD=secret123
pdftract extract document.pdf

Troubleshooting

Why is extraction failing with an error?

Check the error message and consult the Troubleshooting Guide. Common issues:

Encrypted PDFs: Use --password to decrypt
Corrupted PDFs: pdftract attempts recovery; check diagnostics
Missing dependencies: Verify Tesseract and language packs are installed

Run diagnostics:

pdftract doctor

Why is my output empty or incomplete?

Possible causes:

No text layer: PDF may be image-only. Enable OCR.
Encoding issues: Check diagnostics for FONT_GLYPH_UNMAPPED warnings
Page range issue: Verify your --pages argument
Confidence filter: Lower --min-confidence if set too high

Check diagnostics output:

pdftract extract document.json --verbose

How do I debug extraction issues?

Enable verbose output and diagnostics:

# Full diagnostic output
pdftract extract document.pdf --verbose --diagnostics

# Save diagnostics for analysis
pdftract extract document.pdf --diagnostics -o diagnostics.json

Common diagnostic codes:

FONT_GLYPH_UNMAPPED: Glyph couldn't be mapped to Unicode
STREAM_DECODE_ERROR: Stream decompression failed
STRUCT_INVALID_TYPE: Unexpected object type

See Diagnostics Reference for a complete list.

Why does extraction use so much memory?

Memory usage depends on:

PDF size: Larger PDFs with many images use more memory
OCR: Tesseract loads image data into memory
Output buffering: Large JSON outputs are buffered in memory

To reduce memory usage:

# Process page-by-page
for page in {1..100}; do
    pdftract extract document.pdf --pages $page -o "page-$page.json"
done

# Disable OCR if not needed
pdftract extract document.pdf --disable-ocr

# Stream output (if supported)
pdftract extract document.pdf --stream-output

Still have questions?

Check the Troubleshooting Guide
Review the CLI Reference
Open an issue on GitHub

13 KiB Raw Blame History

FAQ

Table of Contents