pdftract/docs/research/extraction-pipeline-overview.md
jedarden 04b60a1cf7 Add three research documents: CJK encoding, pipeline synthesis, linearization
- cjk-and-asian-script-encoding: all six CJK encoding systems, Type 0
  composite font pipeline, predefined CMap tables for Japan1/GB1/CNS1/Korea1,
  Shift-JIS/GB18030/Big5 byte structure, missing ToUnicode recovery via
  Adobe CID tables, full-width normalization, vertical text detection
- extraction-pipeline-overview: end-to-end 9-stage synthesis referencing
  all 36 research documents; stages: file open, metadata, page classification,
  content extraction (4 sub-paths), font pipeline, span assembly, normalization
  and quality, supplementary content, output serialization; ASCII data-flow
  diagram
- linearized-pdf-and-streaming: linearization dict keys, hint stream
  bitfield tables, first-page xref lazy parsing, HTTP range request pattern,
  staleness validation, incremental update interaction, NDJSON streaming,
  partial file extraction, lazy PageIter API with rayon par_bridge

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:26:36 -04:00

20 KiB
Raw Blame History

pdftract Extraction Pipeline: End-to-End Architectural Overview

This document synthesizes the 36 specialized research documents in this directory into a coherent architectural blueprint for implementing the pdftract Rust PDF text extraction library. It describes the ordered sequence of stages, decision points, and data transformations that take a PDF file as input and produce readable, structured text as output. Engineers implementing pdftract should treat this as the canonical pipeline reference and consult the named component documents for deeper detail on each subsystem.


Pipeline Inputs and Outputs

Input. The pipeline accepts either a file path (opened via memory-mapped I/O for zero-copy reads) or an in-memory byte slice. All subsequent parsing operates on the raw bytes through a shared reference; no additional buffering is introduced at the entry point. Configuration is provided via an ExtractionOptions struct with fields including: ocr_enabled: bool, ocr_language: Vec<String>, extract_forms: bool, extract_annotations: bool, extract_attachments: bool, extract_images: bool, readability_threshold: f32, ocr_fallback_threshold: f32, include_invisible_text: bool, and streaming: bool.

Output. The pipeline produces a structured JSON document (or NDJSON stream in streaming mode) with the following top-level shape:

{
  "metadata": { ... },          // document-level metadata and diagnostics
  "outline": [ ... ],           // bookmark tree
  "pages": [ ... ],             // per-page content
  "form_fields": [ ... ],       // AcroForm / XFA fields (if enabled)
  "annotations": [ ... ],       // page annotations (if enabled)
  "attachments": [ ... ],       // embedded files (if enabled)
  "warnings": [ ... ]           // extraction warnings across all stages
}

Each page entry carries blocks (containing spans with per-glyph Unicode and confidence), extraction_method, classification_signals, reading_order_algorithm, readability_score, and a page-level warnings array. The --text flag collapses all block content to plain text separated by \n\n. Exit codes follow quality: 0 = clean, 1 = warnings present, 2 = errors or low-confidence pages below threshold.


Stage 1: File Opening and Structure Parsing

See: pdf-specification.md, malformed-pdf-repair-and-recovery.md, pdfa-compliance-and-extraction.md, pdf-encryption-and-security.md.

The pipeline opens the input via mmap and immediately checks the %PDF- header to confirm a valid PDF container, recording pdf_version in the output metadata. Parsing then works backward from the end of file to locate the startxref offset.

Encryption detection. The trailer dictionary is scanned for a /Encrypt entry. If present, the encryption handler is identified (standard password, certificate, or custom). ExtractionOptions may supply a password; if decryption fails or no password is provided, the pipeline returns an EncryptionError immediately. See pdf-encryption-and-security.md for the full handler decision tree.

Cross-reference resolution. The pipeline first attempts to parse the traditional xref table at the startxref offset. If that fails (common in repaired or linearized files), it falls back to xref streams (PDF 1.5+). If both fail, it falls back to a forward object scan — a full-file sequential pass that reconstructs the object map from obj / endobj markers. This scan is slower but handles severely malformed files. Recovered objects are flagged in warnings. The complete strategy is documented in malformed-pdf-repair-and-recovery.md.

Document catalog and page tree. With a valid object map, the pipeline resolves the /Root entry to the document catalog. The page tree (/Pages subtree) is traversed once to build a flat index of page dictionaries with their inherited attributes (media box, resources, rotation), enabling O(log n) lookup by page number for parallel access in Stage 4.

PDF/A and tagging detection. The catalog's /Metadata XMP stream is decoded and inspected for pdfaid:conformance and pdfaid:part to record the conformance level. The /MarkInfo dictionary's /Marked flag records whether the document is tagged. Both influence downstream path selection. See pdfa-compliance-and-extraction.md.


Stage 2: Document-Level Metadata

See: xmp-and-document-metadata.md, pdf-specification.md.

Metadata extraction runs once before per-page work. The pipeline first attempts the XMP metadata stream from the catalog /Metadata key, parsing it as an RDF/XML document to extract standard Dublin Core and PDF namespace fields: title, author, creator, producer, creation date, modification date, keywords, and subject. If the XMP stream is absent or malformed, it falls back to the /Info dictionary, which carries the same fields in PDF string encoding.

When both sources exist, conflicts are resolved in favor of XMP for all fields where XMP provides a value — XMP is the authoritative source in PDF 1.4+ documents. The resolved values are written to metadata in the output.

The pipeline also extracts the document outline (bookmarks) by walking the /Outlines tree, recording title, destination, and nesting level for each entry. Page labels from the /PageLabels number tree are extracted and stored in metadata.page_labels, enabling human-readable page numbering in output.


Stage 3: Per-Page Classification

See: scanned-vs-vector-page-classification.md, pdfa-compliance-and-extraction.md, raster-ocr-pipeline.md.

Before any expensive extraction work, each page is classified to select the optimal extraction path. Classification runs a sequence of fast pre-checks on the page content stream and resource dictionary:

  1. No text operators. If the content stream contains no Tj, TJ, ', ", or TD/Tm operators, the page is initially flagged as Scanned.
  2. Full-page Tr=3 + image. If all text operators set rendering mode 3 (invisible) and a full-page image XObject covers the media box, the page is classified as BrokenVector (a PDF/A OCR layer pattern where real text is hidden beneath a scan). See invisible-and-hidden-text.md.
  3. Image coverage fraction. The pipeline computes the fraction of the page media box area covered by raster image XObjects. Coverage above a configurable threshold (default 0.85) is a strong scanned signal.
  4. Character validity rate. Text operators are parsed and character codes are passed through a quick validity check (ToUnicode CMap lookup + AGL probe). A validity rate below a threshold (default 0.4) indicates a broken or symbolic font encoding, yielding BrokenVector.
  5. High-density valid text. Pages with validity rate above 0.85 and no significant image coverage are classified as Vector.

The result is one of four PageClass values — Vector, Scanned, Hybrid, BrokenVector — each with an associated confidence score. Classification signals are recorded in the page output for diagnostics.


Stage 4: Content Extraction (Per-Page, Parallelized)

See: content-stream-concatenation.md, graphics-state-tracking.md, raster-ocr-pipeline.md, word-boundary-reconstruction.md, type3-font-extraction.md, optional-content-groups.md.

Stage 4 is the core extraction stage and is parallelized across pages using rayon. Each page runs one of four sub-paths determined by its PageClass.

4a. Vector Path

Content streams are concatenated (handling /Length mismatches, flate-decoding, and multi-stream pages) per content-stream-concatenation.md. A PDF graphics state machine processes operators in order, maintaining a stack of GraphicsState structs that track the current transformation matrix (CTM), text matrix (Tm), text line matrix (Tlm), font, font size, character spacing, word spacing, horizontal scaling, and text rise. See graphics-state-tracking.md.

For each glyph, the text matrix is combined with the CTM to produce a device-space bounding box. Character codes are passed to the font pipeline (Stage 5) for Unicode resolution. Inter-glyph gaps are measured in glyph-space units normalized by the current font size; gaps exceeding the word-boundary threshold produce synthetic space characters. See word-boundary-reconstruction.md. Optional content group state (/OC entries) is tracked to suppress content from hidden layers. See optional-content-groups.md.

4b. OCR Path

The page is rendered to a 300 DPI raster using a PDF renderer. The raster undergoes preprocessing: deskew via Hough line detection, binarization via Sauvola local thresholding, and optional denoising. Tesseract is invoked with the language pack(s) specified in ExtractionOptions.ocr_language. HOCR output is parsed into glyph-level spans with bounding boxes and confidence scores. See raster-ocr-pipeline.md for the full preprocessing and Tesseract integration.

4c. Hybrid Path

Vector regions and image regions are identified by comparing text operator bounding boxes and image XObject placements. Regions where vector text is present use sub-path (a); regions covered by raster images with no overlapping vector text use sub-path (b). Spans from both sub-paths are merged by page coordinate order into a unified span list.

4d. Assisted OCR (BrokenVector)

Sub-path (a) is run first in position-hint mode: glyph bounding boxes are collected but Unicode values are discarded. These bounding boxes seed Tesseract's segmentation, improving word boundary detection. The OCR output then resolves the actual characters. Conflicts between position hints and OCR word boundaries are resolved in favor of OCR character shapes.


Stage 5: Font Pipeline

See: pdf-fonts-and-encoding.md, cmap-format-and-cid-encoding.md, glyph-recognition-and-unicode-recovery.md, type3-font-extraction.md.

For every character code encountered in the Vector path, the font pipeline resolves a Unicode scalar value through a prioritized fallback chain:

  1. ToUnicode CMap. If the font dictionary carries a /ToUnicode stream, the CMap is parsed and the character code is looked up. If the result is a non-sentinel value (not U+FFFD, not empty), it is used and unicode_source is set to "to_unicode". See cmap-format-and-cid-encoding.md.
  2. Encoding vector + AGL. If ToUnicode is absent or returns a sentinel, the font's encoding vector maps the character code to a glyph name. The Adobe Glyph List resolves the glyph name to a Unicode code point. unicode_source = "agl". See pdf-fonts-and-encoding.md.
  3. Font fingerprint cache. A precomputed database of known font program checksums maps directly to per-glyph Unicode tables. If the font program hash matches a database entry, the precomputed mapping is used. unicode_source = "fingerprint".
  4. Glyph shape recognition. The glyph is rendered to a small bitmap and hashed. If the shape hash matches an entry in the glyph recognition database, the Unicode value is assigned. unicode_source = "shape_match". See glyph-recognition-and-unicode-recovery.md.
  5. Failure. If all four steps fail, U+FFFD is emitted and confidence is set to 0.0.

Type 3 fonts, which define glyph shapes as content stream fragments, are handled specially: each glyph's content stream is rasterized and passed to the shape recognition step. See type3-font-extraction.md.

Each glyph in the output carries codepoint, unicode_source, and confidence.


Stage 6: Span and Block Assembly

See: complex-layout-reading-order.md, tagged-pdf-structure-and-reading-order.md, document-classification-and-zone-labeling.md, watermark-and-background-separation.md, invisible-and-hidden-text.md.

Raw glyphs are grouped into spans by continuity of font, font size, color (fill and stroke), and rendering mode. A new span begins whenever any of these attributes changes, or when a word boundary gap is detected.

Reading order. If the document is tagged (/MarkInfo /Marked true) or conforms to PDF/A-a, the StructTree is traversed to derive reading order. reading_order_algorithm is set to "struct_tree". For untagged documents, the pipeline applies XY-cut decomposition (for rectilinear layouts) or Docstrum (for documents with irregular column boundaries). See complex-layout-reading-order.md and tagged-pdf-structure-and-reading-order.md.

Zone labeling. After reading order is established, spans are assigned to document zones: body, heading, header, footer, footnote, caption, or sidebar. Zone assignment uses margin heuristics (vertical position relative to media box), font size clustering (headings are statistical outliers in the size distribution), and cross-page consistency (running headers/footers appear at similar positions across pages). See document-classification-and-zone-labeling.md.

Watermark and invisible text filtering. Spans in rendering mode 3 (invisible) are suppressed unless ExtractionOptions.include_invisible_text is true. Spans classified as watermarks (low opacity, Z-order beneath body text, or matching common watermark patterns) are filtered per policy. See watermark-and-background-separation.md and invisible-and-hidden-text.md.

Spans are assembled into blocks representing paragraphs or other logical units, and blocks are ordered within each page according to the reading order algorithm's output.


Stage 7: Text Normalization and Quality

See: post-extraction-normalization.md, post-ocr-text-correction.md, text-readability-validation.md, semantic-text-reconstruction.md, language-detection-and-script-handling.md.

Normalization runs as an ordered pipeline applied to each span's text:

  1. Ligature expansion. Standard ligatures (fi, fl, ffi, ffl, ſt, st) are expanded to their component characters.
  2. Unicode normalization. All text is normalized to NFC.
  3. Whitespace collapse. Runs of whitespace within a span are collapsed to a single space; leading and trailing whitespace is stripped.
  4. Hyphen joining. Lines ending in a hyphen are joined to the next line's first word, with the hyphen removed, if the joined form appears in a language dictionary.
  5. Paragraph reconstruction. Short lines that do not end with sentence-terminal punctuation are joined to the following line when their right edge falls significantly short of the text block width. See semantic-text-reconstruction.md.
  6. Header/footer deduplication. Spans in the header and footer zones that appear with identical or near-identical text across three or more consecutive pages are flagged as deduplicated and excluded from the main text flow. They remain in the output under their zone label for reference.

Readability scoring. Each span is scored on three signals: Shannon entropy of the character distribution, dictionary hit rate against a word list for the detected language, and character validity rate (fraction of non-U+FFFD codepoints). The composite readability_score per block (0.01.0) is written to the output. Blocks scoring below ExtractionOptions.ocr_fallback_threshold trigger an OCR fallback for that region on vector pages, re-running the block through sub-path (b) of Stage 4. See text-readability-validation.md.

Post-OCR correction. For spans produced by the OCR path, a correction pass applies: confusable character substitution (0↔O, 1↔l, rn↔m), regex-based pattern correction (dates, identifiers), and bigram/trigram context correction using a language model. See post-ocr-text-correction.md.

Language detection runs on the assembled block text to confirm or override the per-page language hint. The detected language is used to select the appropriate dictionary and Tesseract language pack for any OCR fallback runs. See language-detection-and-script-handling.md.


Stage 8: Supplementary Content

See: form-fields-and-annotations.md, embedded-files-and-portfolios.md, image-and-figure-extraction.md.

Supplementary extraction runs after all pages complete, guarded by the relevant ExtractionOptions flags.

Forms. If extract_forms is true, the AcroForm dictionary is located in the catalog. Each field in the /Fields array is walked recursively. Field type (Tx, Btn, Ch, Sig), name, value, and appearance state are extracted. If an /XFA stream is present, it is parsed as XFA XML and field values are extracted from the XFA data model. See form-fields-and-annotations.md.

Annotations. If extract_annotations is true, each page's /Annots array is iterated. For text and link annotations, Contents and RC (rich content) fields are extracted. Annotation type, rectangle, and flags are recorded. Redaction annotations (/Redact) are noted in warnings.

Attachments. If extract_attachments is true, the /EmbeddedFiles name tree in the catalog is walked. Each Filespec dictionary yields a filename, description, MIME type, creation date, and the raw file bytes (or a size-limited excerpt if the attachment is large). See embedded-files-and-portfolios.md.

Images. If extract_images is true, image XObjects referenced from each page's resource dictionary are collected. Metadata (width, height, color space, bits per component, filter chain) is always included. Pixel data is decoded and included as base64 only if ExtractionOptions.include_image_data is true. See image-and-figure-extraction.md.


Stage 9: Output Serialization

See: performance-and-streaming-architecture.md, chunking-for-llm-consumption.md.

The final stage assembles all collected data and serializes it.

Buffered JSON mode (default). The complete document tree is serialized to a single JSON object. Field ordering follows the schema defined in the Pipeline Inputs and Outputs section above. serde_json with BufWriter is used; the output is written to stdout or a specified file path.

Streaming NDJSON mode (ExtractionOptions.streaming = true). Metadata is emitted as the first JSON line. Each page is serialized and emitted as a JSON line immediately after it completes extraction, allowing consumers to begin processing before the full document is done. This mode is documented in performance-and-streaming-architecture.md and is designed to support the LLM consumption patterns described in chunking-for-llm-consumption.md.

Each page object in both modes carries:

  • page_number (1-based)
  • extraction_method: one of "vector", "ocr", "hybrid", "assisted_ocr"
  • classification_signals: the raw signals from Stage 3 (image coverage fraction, character validity rate, operator counts)
  • reading_order_algorithm: "struct_tree", "xy_cut", or "docstrum"
  • readability_score: composite 0.01.0 for the page
  • blocks: ordered array of text blocks with spans
  • warnings: page-level warning array

Exit code semantics. After all pages are processed, the pipeline computes the worst-case quality across pages. If all pages have readability score above the clean threshold, exit code 0 is returned. If any page emits warnings (OCR fallback triggered, low-confidence spans, unsupported features), exit code 1 is returned. If any page fails extraction entirely or contains errors, exit code 2 is returned. This allows shell pipelines and CI systems to gate on extraction quality without parsing the output JSON.


Summary: Stage Ordering and Data Flow

Input (file path / bytes)
  │
  ▼
Stage 1: File opening, xref, decryption, page tree index
  │
  ▼
Stage 2: Document metadata (XMP, /Info, outline, page labels)
  │
  ▼
Stage 3: Per-page classification → PageClass × confidence
  │
  ▼
Stage 4: Content extraction (rayon parallelism across pages)
  ├─ Vector  → graphics state machine → raw glyphs
  ├─ OCR     → raster render → Tesseract → raw spans
  ├─ Hybrid  → Vector regions + OCR regions → merged spans
  └─ BrokenVector → position hints + OCR → spans
       │
       ▼ (from Vector path)
Stage 5: Font pipeline → Unicode + confidence per glyph
  │
  ▼
Stage 6: Span + block assembly → reading order → zone labels
  │
  ▼
Stage 7: Normalization → readability scoring → OCR fallback → correction
  │
  ▼
Stage 8: Forms, annotations, attachments, images (conditional)
  │
  ▼
Stage 9: JSON / NDJSON serialization → exit code

Each stage boundary is a well-defined data contract. Stages 12 produce document-scoped structures shared across all pages. Stage 3 produces per-page PageClass values that gate Stage 4 sub-path selection. Stages 47 are the per-page pipeline and are the primary targets for parallelism and optimization. Stages 89 are sequential post-processing passes over the fully assembled extraction result.