docs(pdftract-5nare): add comprehensive FAQ with 24 questions

Added docs/user-docs/src/faq.md with 24 FAQ entries covering: - General questions (what is pdftract, extract vs extract_text, JS execution) - Installation and setup (proxy, system requirements) - Usage (broken_vector, OCR speed, page ranges, images, batch processing) - Configuration (custom profiles, OCR accuracy, confidence scores) - Output formats (Markdown, tables, metadata, passwords) - Troubleshooting (errors, empty output, debugging, memory usage) Each answer is 1-3 paragraphs with cross-links to fuller docs. mdBook builds successfully. Acceptance criteria: - PASS: docs/user-docs/src/faq.md exists - PASS: 24 questions covered (target: 15-25) - PASS: Each answer is 1-3 paragraphs - PASS: Cross-links work - PASS: mdBook renders cleanly Closes: pdftract-5nare
2026-05-25 00:22:48 -04:00 · 2026-05-25 00:22:48 -04:00 · 2ccdaecda1
commit 2ccdaecda1
parent 3fa783f628
1 changed files with 452 additions and 2 deletions
--- a/docs/user-docs/src/faq.md
+++ b/docs/user-docs/src/faq.md
@ -1,5 +1,455 @@
 # FAQ

-> **Draft** — This page is a placeholder for future content.
-
 Frequently asked questions about pdftract.
+
+## Table of Contents
+
+- [General](#general)
+  - [What is pdftract?](#what-is-pdftract)
+  - [What's the difference between extract and extract_text?](#whats-the-difference-between-extract-and-extract_text)
+  - [Does pdftract execute JavaScript embedded in PDFs?](#does-pdftract-execute-javascript-embedded-in-pdfs)
+  - [How do I cite an extracted snippet?](#how-do-i-cite-an-extracted-snippet)
+- [Installation and Setup](#installation-and-setup)
+  - [How do I install pdftract?](#how-do-i-install-pdftract)
+  - [How do I run pdftract behind a corporate proxy?](#how-do-i-run-pdftract-behind-a-corporate-proxy)
+  - [What are the system requirements?](#what-are-the-system-requirements)
+- [Usage](#usage)
+  - [Why is my PDF returning broken_vector?](#why-is-my-pdf-returning-broken_vector)
+  - [Why is OCR slow?](#why-is-ocr-slow)
+  - [How do I extract text from a specific page range?](#how-do-i-extract-text-from-a-specific-page-range)
+  - [How do I extract images from a PDF?](#how-do-i-extract-images-from-a-pdf)
+  - [Can I process multiple PDFs at once?](#can-i-process-multiple-pdfs-at-once)
+- [Configuration](#configuration)
+  - [How do I add a custom profile?](#how-do-i-add-a-custom-profile)
+  - [How do I adjust OCR accuracy?](#how-do-i-adjust-ocr-accuracy)
+  - [How do I disable OCR for faster processing?](#how-do-i-disable-ocr-for-faster-processing)
+  - [What are confidence scores and how do I use them?](#what-are-confidence-scores-and-how-do-i-use-them)
+- [Output and Formats](#output-and-formats)
+  - [How do I get output in Markdown format?](#how-do-i-get-output-in-markdown-format)
+  - [How do I preserve table structure?](#how-do-i-preserve-table-structure)
+  - [Can I extract metadata from PDFs?](#can-i-extract-metadata-from-pdfs)
+  - [How do I handle password-protected PDFs?](#how-do-i-handle-password-protected-pdfs)
+- [Troubleshooting](#troubleshooting)
+  - [Why is extraction failing with an error?](#why-is-extraction-failing-with-an-error)
+  - [Why is my output empty or incomplete?](#why-is-my-output-empty-or-incomplete)
+  - [How do I debug extraction issues?](#how-do-i-debug-extraction-issues)
+  - [Why does extraction use so much memory?](#why-does-extraction-use-so-much-memory)
+
+---
+
+## General
+
+### What is pdftract?
+
+pdftract is a command-line tool and library for extracting text, structure, and content from PDF files. It combines vector text extraction with OCR fallback to handle both well-formed and problematic PDFs. pdftract is written in Rust and provides Python bindings for programmatic use.
+
+See the [Introduction](introduction.md) for a complete overview.
+
+### What's the difference between extract and extract_text?
+
+- **`extract`**: The primary command that produces structured JSON output with blocks, spans, metadata, and provenance information. Use this when you need the full extraction with layout, reading order, and confidence scores.
+
+- **`extract_text`**: A simplified command that outputs plain text only. Use this for quick text extraction when you don't need the structured JSON output.
+
+Example:
+```bash
+# Full structured extraction
+pdftract extract document.pdf -o output.json
+
+# Plain text only
+pdftract extract_text document.pdf -o output.txt
+```
+
+### Does pdftract execute JavaScript embedded in PDFs?
+
+**No.** pdftract never executes JavaScript embedded in PDFs. JavaScript is detected during parsing for security analysis, but it is never executed. This design prevents malicious PDFs from exploiting JavaScript vulnerabilities.
+
+If you need to analyze JavaScript in PDFs, pdftract can detect and report its presence, but execution must be done separately with appropriate sandboxing.
+
+### How do I cite an extracted snippet?
+
+The JSON output from `pdftract extract` includes provenance information for each text block:
+
+```json
+{
+  "blocks": [{
+    "spans": [{
+      "text": "Example snippet",
+      "bbox": [100.0, 200.0, 250.0, 215.0],
+      "page": 3,
+      "confidence": 0.98
+    }]
+  }],
+  "metadata": {
+    "path": "/path/to/document.pdf",
+    "fingerprint": "sha256:abc123...",
+    "extracted_at": "2026-05-25T12:00:00Z"
+  }
+}
+```
+
+For academic citations, include:
+- Document path and fingerprint
+- Page number (from the `page` field)
+- Extraction timestamp
+- The pdftract version used
+
+---
+
+## Installation and Setup
+
+### How do I install pdftract?
+
+See the [Installation](installation.md) guide for complete instructions. Quick summary:
+
+**With cargo (Rust toolchain):**
+```bash
+cargo install pdftract
+```
+
+**With pip (Python bindings):**
+```bash
+pip install pdftract
+```
+
+**Pre-built binaries:** Download from the [releases page](https://github.com/your-org/pdftract/releases).
+
+### How do I run pdftract behind a corporate proxy?
+
+pdftract doesn't have built-in proxy support, but you can use the HTTP serve mode with a reverse proxy:
+
+1. Start pdftract in serve mode:
+```bash
+pdftract serve --port 8080
+```
+
+2. Configure your reverse proxy (nginx, Apache, etc.) to handle authentication and SSL termination.
+
+3. Access pdftract through your proxy endpoint.
+
+See [Advanced Topics: HTTP Serve](../operations/serve-deployment.md) for deployment guidance.
+
+### What are the system requirements?
+
+- **OS**: Linux, macOS, or Windows
+- **Rust**: 1.70+ (if building from source)
+- **Python**: 3.8+ (for Python bindings)
+- **OCR (optional)**: Tesseract 4.0+ for OCR fallback
+- **Memory**: 512 MB minimum for typical PDFs; more for large documents
+
+---
+
+## Usage
+
+### Why is my PDF returning broken_vector?
+
+The `broken_vector` classification means the PDF's text layer is unreliable or missing. Common causes:
+
+- **Invisible text overlay**: Text with rendering mode 3 (invisible) overlaid on a raster image
+- **Missing ToUnicode CMap**: Font lacks character-to-Unicode mapping
+- **Encoding corruption**: Character encodings don't match the actual glyphs
+
+**Solution**: pdftract automatically routes `broken_vector` pages to the OCR pipeline (Phase 5.5). If you see `broken_vector` without OCR output, check that OCR is enabled:
+
+```bash
+# Verify OCR is available
+pdftract doctor tesseract-langs
+
+# Enable OCR explicitly if needed
+pdftract extract document.pdf --enable-ocr
+```
+
+See [Troubleshooting: Broken Vector](troubleshooting/common-issues.md) for more details.
+
+### Why is OCR slow?
+
+OCR performance depends on several factors:
+
+- **Image resolution**: Higher DPI images take longer to process
+- **Tesseract version**: Version 4.0+ is significantly faster than 3.x
+- **Language data**: Additional language packs increase processing time
+- **Hardware**: CPU-bound; more cores help with batch processing
+
+**To speed up OCR:**
+```bash
+# Reduce DPI (trade-off: accuracy)
+pdftract extract document.pdf --ocr-dpi 200
+
+# Use fewer languages
+pdftract extract document.pdf --ocr-lang eng
+
+# Disable OCR for vector-only PDFs
+pdftract extract document.pdf --disable-ocr
+```
+
+### How do I extract text from a specific page range?
+
+Use the `--pages` flag:
+
+```bash
+# Single page
+pdftract extract document.pdf --pages 5
+
+# Range
+pdftract extract document.pdf --pages 1-10
+
+# Multiple ranges
+pdftract extract document.pdf --pages 1-5,10,15-20
+
+# All pages from page 5 onward
+pdftract extract document.pdf --pages 5-
+```
+
+### How do I extract images from a PDF?
+
+pdftract automatically detects and records image XObjects during content stream processing. The output JSON includes image metadata:
+
+```json
+{
+  "images": [{
+    "bbox": [100.0, 200.0, 400.0, 500.0],
+    "xobject_ref": "5 0 R",
+    "name": "Im1"
+  }]
+}
+```
+
+For actual image extraction, use the `serve` mode with the `/images` endpoint or write a custom script using the Python SDK.
+
+### Can I process multiple PDFs at once?
+
+Yes, use shell wildcards or write a batch script:
+
+```bash
+# Process all PDFs in a directory
+for file in *.pdf; do
+    pdftract extract "$file" -o "output/$(basename "$file" .json)"
+done
+
+# With parallel processing (GNU parallel)
+ls *.pdf | parallel -j 4 pdftract extract {} -o output/{/.}.json
+```
+
+---
+
+## Configuration
+
+### How do I add a custom profile?
+
+Create a YAML file defining your profile:
+
+```yaml
+# custom-profile.yaml
+name: my_custom
+description: "Custom extraction profile"
+
+extraction:
+  preserve_tables: true
+  preserve_columns: true
+  ocr_fallback: true
+
+output:
+  format: json
+  include_provenance: true
+  confidence_threshold: 0.7
+```
+
+Then use it:
+```bash
+pdftract extract document.pdf --profile custom-profile.yaml
+```
+
+See [Custom Profiles](profiles/custom.md) for complete documentation.
+
+### How do I adjust OCR accuracy?
+
+Adjust Tesseract parameters via environment variables or the OCR configuration:
+
+```bash
+# Set OCR engine mode
+export TESSERACT_OEM=1  # LSTM only
+export TESSERACT_PSM=6  # Assume single column block of text
+
+# Adjust page segmentation mode
+pdftract extract document.pdf --tesseract-psm 6
+```
+
+Higher accuracy settings may slow down processing. See [OCR Configuration](advanced/ocr.md) for details.
+
+### How do I disable OCR for faster processing?
+
+If you know your PDFs have reliable text layers:
+
+```bash
+pdftract extract document.pdf --disable-ocr
+```
+
+Or set a confidence threshold to skip low-confidence text:
+
+```bash
+pdftract extract document.pdf --min-confidence 0.9
+```
+
+### What are confidence scores and how do I use them?
+
+Each text span has a `confidence` score (0.0 to 1.0):
+
+- **1.0**: High confidence (ToUnicode CMap lookup succeeded)
+- **0.3**: Medium confidence (encoding + AGL fallback)
+- **0.0**: No confidence (PositionHint mode or failed resolution)
+
+Filter by confidence:
+```bash
+pdftract extract document.pdf --min-confidence 0.5
+```
+
+Or filter in post-processing using jq:
+```bash
+pdftract extract document.pdf | jq '.blocks[].spans[] | select(.confidence > 0.5)'
+```
+
+---
+
+## Output and Formats
+
+### How do I get output in Markdown format?
+
+Use the `--format` flag:
+
+```bash
+pdftract extract document.pdf --format markdown -o output.md
+```
+
+The Markdown output preserves headings, lists, tables, and code blocks where detected.
+
+### How do I preserve table structure?
+
+pdftract includes table detection (Phase 4.2). Ensure table preservation is enabled:
+
+```bash
+pdftract extract document.pdf --preserve-tables
+```
+
+Tables are output with structured cell information:
+```json
+{
+  "type": "table",
+  "rows": 3,
+  "columns": 4,
+  "cells": [...]
+}
+```
+
+### Can I extract metadata from PDFs?
+
+Yes, metadata is automatically extracted and included in the output:
+
+```json
+{
+  "metadata": {
+    "title": "Document Title",
+    "author": "Author Name",
+    "subject": "Subject",
+    "keywords": ["keyword1", "keyword2"],
+    "creator": "Application",
+    "producer": "PDF Producer",
+    "creation_date": "2026-01-01T00:00:00Z",
+    "modified_date": "2026-05-25T12:00:00Z"
+  }
+}
+```
+
+### How do I handle password-protected PDFs?
+
+Provide the password via the `--password` flag:
+
+```bash
+pdftract extract document.pdf --password secret123
+```
+
+For security, avoid passing passwords on the command line in production. Use environment variables or a config file:
+
+```bash
+export PDFTRACT_PASSWORD=secret123
+pdftract extract document.pdf
+```
+
+---
+
+## Troubleshooting
+
+### Why is extraction failing with an error?
+
+Check the error message and consult the [Troubleshooting Guide](troubleshooting/README.md). Common issues:
+
+- **Encrypted PDFs**: Use `--password` to decrypt
+- **Corrupted PDFs**: pdftract attempts recovery; check diagnostics
+- **Missing dependencies**: Verify Tesseract and language packs are installed
+
+Run diagnostics:
+```bash
+pdftract doctor
+```
+
+### Why is my output empty or incomplete?
+
+Possible causes:
+
+1. **No text layer**: PDF may be image-only. Enable OCR.
+2. **Encoding issues**: Check diagnostics for `FONT_GLYPH_UNMAPPED` warnings
+3. **Page range issue**: Verify your `--pages` argument
+4. **Confidence filter**: Lower `--min-confidence` if set too high
+
+Check diagnostics output:
+```bash
+pdftract extract document.json --verbose
+```
+
+### How do I debug extraction issues?
+
+Enable verbose output and diagnostics:
+
+```bash
+# Full diagnostic output
+pdftract extract document.pdf --verbose --diagnostics
+
+# Save diagnostics for analysis
+pdftract extract document.pdf --diagnostics -o diagnostics.json
+```
+
+Common diagnostic codes:
+- `FONT_GLYPH_UNMAPPED`: Glyph couldn't be mapped to Unicode
+- `STREAM_DECODE_ERROR`: Stream decompression failed
+- `STRUCT_INVALID_TYPE`: Unexpected object type
+
+See [Diagnostics Reference](troubleshooting/diagnostics.md) for a complete list.
+
+### Why does extraction use so much memory?
+
+Memory usage depends on:
+
+- **PDF size**: Larger PDFs with many images use more memory
+- **OCR**: Tesseract loads image data into memory
+- **Output buffering**: Large JSON outputs are buffered in memory
+
+**To reduce memory usage:**
+```bash
+# Process page-by-page
+for page in {1..100}; do
+    pdftract extract document.pdf --pages $page -o "page-$page.json"
+done
+
+# Disable OCR if not needed
+pdftract extract document.pdf --disable-ocr
+
+# Stream output (if supported)
+pdftract extract document.pdf --stream-output
+```
+
+---
+
+## Still have questions?
+
+- Check the [Troubleshooting Guide](troubleshooting/README.md)
+- Review the [CLI Reference](cli/README.md)
+- Open an issue on [GitHub](https://github.com/your-org/pdftract/issues)