diff --git a/docs/research/cjk-and-asian-script-encoding.md b/docs/research/cjk-and-asian-script-encoding.md new file mode 100644 index 0000000..b2a68e3 --- /dev/null +++ b/docs/research/cjk-and-asian-script-encoding.md @@ -0,0 +1,262 @@ +# CJK and Asian Script Encoding in PDF + +CJK documents—Chinese, Japanese, Korean—are among the most common non-Latin PDFs in the wild. Their encoding pipelines differ fundamentally from Latin-script PDFs: multi-byte code spaces, large predefined CMaps, identity mappings, and character sets defined by government standards rather than Unicode consortiums. This document covers the full encoding stack that pdftract must understand to produce readable text from CJK sources. + +--- + +## 1. CJK Encoding Systems Overview + +CJK text in PDF derives from legacy national encoding standards that predate Unicode by decades. + +**Japanese** uses three encodings in PDF contexts: +- **Shift-JIS** (SJIS): variable-width 1–2 byte encoding. The dominant encoding for Japanese Windows software. Covers JIS X 0208 kanji plus hiragana, katakana, and half-width katakana. +- **EUC-JP**: Extended Unix Code, 2-byte encoding used on Unix systems, also covers JIS X 0208 with a simpler lead-byte scheme (0xA1–0xFE range). +- **JIS X 0208**: the underlying 94×94 character table that Shift-JIS and EUC-JP both reference; not itself a byte encoding but the source character set. + +**Chinese Simplified** uses: +- **GB2312** (1981): 6,763 characters in a 94×94 table, 2-byte EUC-style encoding (lead 0xA1–0xFE). +- **GBK** (1993): extends GB2312 to 20,902 characters; lead bytes 0x81–0xFE, trail 0x40–0xFE. +- **GB18030** (2000, mandatory): extends GBK with 4-byte sequences, covering all Unicode planes. + +**Chinese Traditional** uses: +- **Big5**: 2-byte encoding covering ~13,000 traditional characters, widely used in Taiwan and Hong Kong. +- **Big5-HKSCS**: Hong Kong government extension adding characters for Cantonese and Hong Kong-specific usage. + +**Korean** uses: +- **EUC-KR**: 2-byte encoding based on KS X 1001 (formerly KSC 5601), covering ~2,350 Hangul syllables plus Hanja. +- **CP949** (Unified Hangul Code): Microsoft extension of EUC-KR covering all 11,172 modern Hangul syllables. + +**Unicode-based encodings in PDF**: `Identity-H` and `Identity-H`-variant CMaps treat the 2-byte character code directly as a CID, which is then equal to the Unicode codepoint in many modern CJK PDFs generated by applications that use OpenType CFF fonts with Unicode CMAPs internally. + +Because CJK character sets have thousands to tens of thousands of codepoints, they cannot fit in a Type 1 or TrueType simple font (limited to 256 glyphs). This is why **CJK PDFs overwhelmingly use Type 0 composite fonts**. A Type 0 font references a CIDFont (a font whose glyph space is indexed by character IDs rather than a 256-entry encoding vector) and a CMap that maps byte sequences to CIDs. + +--- + +## 2. Type 0 Composite Font Structure + +A Type 0 font dictionary contains: + +``` +/Type /Font +/Subtype /Type0 +/BaseFont /HeiseiKakuGo-W5 % or a subset tag + font name +/Encoding /90ms-RKSJ-H % a CMap name or inline stream +/DescendantFonts [<<...>>] % always an array of exactly one CIDFont dict +/ToUnicode stream % optional but critical for text extraction +``` + +The **CIDFont dictionary** (the single element in `DescendantFonts`) contains: + +- **`/CIDSystemInfo`**: dictionary with `/Registry` (e.g., `Adobe`), `/Ordering` (e.g., `Japan1`), `/Supplement` (integer). This identifies the character collection and its version. Key values: `Adobe/Japan1`, `Adobe/CNS1`, `Adobe/GB1`, `Adobe/Korea1`. +- **`/DW`**: default glyph width in glyph-space units (1/1000 of a text unit). Typically 1000 for full-width CJK glyphs. +- **`/W`**: width exceptions array. Format: `[startCID [w1 w2 ... wn]]` or `[startCID endCID w]`. Essential for correct glyph advance computation. +- **`/CIDToGIDMap`**: either the name `/Identity` (CID equals GID in the embedded font file) or a stream of 2-byte big-endian GID values indexed by CID. + +The encoding pipeline for a CJK text string is: + +``` +raw bytes → CMap lookup → CID → CIDToGIDMap → GID → glyph in font file +``` + +For text extraction, pdftract needs: `raw bytes → CMap lookup → CID → ToUnicode (if present) → Unicode codepoint`, or, lacking ToUnicode, `CID → compiled-in CID-to-Unicode table for the given CIDSystemInfo`. + +--- + +## 3. Predefined CMap Names + +ISO 32000 Annex D defines the predefined CMaps that a conforming PDF processor must know without an embedded stream. These must be compiled into pdftract as lookup tables. + +**Japanese (Adobe/Japan1)**: + +| CMap Name | Encoding | Direction | +|---|---|---| +| `83pv-RKSJ-H` | Shift-JIS (1983 JIS) | horizontal | +| `90ms-RKSJ-H` | Shift-JIS (MS Windows) | horizontal | +| `90ms-RKSJ-V` | Shift-JIS (MS Windows) | vertical | +| `90msp-RKSJ-H` | Shift-JIS proportional | horizontal | +| `EUC-H` | EUC-JP | horizontal | +| `EUC-V` | EUC-JP | vertical | +| `UniJIS-UTF16-H` | UTF-16 → Japan1 CIDs | horizontal | +| `UniJIS-UTF16-V` | UTF-16 → Japan1 CIDs | vertical | +| `UniJIS2004-UTF32-H` | UTF-32 (Unicode 2004) | horizontal | + +**Chinese Simplified (Adobe/GB1)**: + +| CMap Name | Encoding | +|---|---| +| `GB-EUC-H` | GB2312 EUC | +| `GBT-EUC-H` | GB2312 Traditional EUC | +| `UniGB-UCS2-H` | UCS-2 → GB1 CIDs | +| `UniGB-UTF16-H` | UTF-16 → GB1 CIDs | + +**Chinese Traditional (Adobe/CNS1)**: + +| CMap Name | Encoding | +|---|---| +| `ETen-B5-H` | Big5 (ETen extension) | +| `ETen-B5-V` | Big5 (ETen extension), vertical | +| `UniCNS-UCS2-H` | UCS-2 → CNS1 CIDs | +| `UniCNS-UTF16-H` | UTF-16 → CNS1 CIDs | + +**Korean (Adobe/Korea1)**: + +| CMap Name | Encoding | +|---|---| +| `KSC-EUC-H` | EUC-KR | +| `KSC-EUC-V` | EUC-KR, vertical | +| `UniKS-UCS2-H` | UCS-2 → Korea1 CIDs | +| `UniKS-UTF16-H` | UTF-16 → Korea1 CIDs | + +**Universal pass-throughs**: `Identity-H` and `Identity-V` treat the 2-byte big-endian character code directly as the CID. Used by modern tools generating Unicode-mapped CJK fonts. + +Implementation: store each predefined CMap as a sorted `&[(u16, u16)]` slice of `(code, cid)` pairs in a `static` array. For variable-width CMaps (Shift-JIS, GB18030), represent the codespace as a trie or range table keyed on the lead byte. + +--- + +## 4. Shift-JIS Encoding in Detail + +Shift-JIS is a variable-width encoding: + +- **Single-byte** `0x00–0x7F`: ASCII-compatible. +- **Single-byte** `0xA1–0xDF`: half-width katakana (ヲ–゚, 63 characters). No second byte follows. +- **Lead bytes** `0x81–0x9F` and `0xE0–0xFC`: introduce a 2-byte sequence. The trail byte range is `0x40–0x7E` and `0x80–0xFC` (i.e., anything except `0x7F`). + +The 2-byte pairs map to JIS X 0208 row/column indices via: + +``` +row = (lead - (lead < 0xA0 ? 0x70 : 0xB0)) * 2 - (trail < 0x9F ? 1 : 0) +col = trail - (trail < 0x7F ? 0x1F : trail < 0x9F ? 0x20 : 0x7E) +``` + +Each JIS X 0208 cell maps to a Unicode codepoint via the published 94×94 table. The full Shift-JIS→Unicode mapping has approximately 6,879 entries. + +**CP932** (Windows Shift-JIS): adds NEC special characters (`0x8740–0x879C`), IBM extension characters (`0xFA40–0xFC4B`), and maps `0x80` → U+005C (backslash in some contexts). pdftract should treat `90ms-RKSJ-H` as CP932 specifically, not plain Shift-JIS, as it targets Windows-generated PDFs. + +--- + +## 5. GB18030 Encoding + +GB18030 is China's mandatory national standard since 2000. It is a multi-length encoding: + +- **1-byte** `0x00–0x7F`: ASCII. +- **2-byte**: lead `0x81–0xFE`, trail `0x40–0xFE` (excluding `0x7F`). Covers GBK characters (~20,000 codepoints). +- **4-byte**: lead `0x81–0xFE`, second `0x30–0x39`, third `0x81–0xFE`, fourth `0x30–0x39`. Covers the remainder of Unicode through plane 16. + +The 4-byte space provides a linear mapping to Unicode codepoints via a range table: GB18030 4-byte values map to Unicode in monotonically increasing order, enabling binary search over ~1,787 range entries from Adobe's published GB18030→Unicode table. + +In PDF, GB18030 content is identified by `CIDSystemInfo` with `/Registry (Adobe)` `/Ordering (GB1)`. The CMap `UniGB-UTF16-H` maps UTF-16 codes to Adobe/GB1 CIDs. The GB1 character collection contains ~30,284 glyphs as of supplement 5. + +--- + +## 6. Big5 and Big5-HKSCS + +**Big5** is a 2-byte encoding: +- Lead bytes: `0xA1–0xFE`. +- Trail bytes: `0x40–0x7E` and `0xA1–0xFE` (gap at `0x7F–0xA0`). +- Total: ~13,053 Traditional Chinese characters, mapped to Unicode via the CNS 11643 standard. + +The ETen extension (used by `ETen-B5-H`) adds characters at lead bytes `0xC6–0xC8` and `0xF9` ranges, commonly seen in Taiwanese documents. + +**Big5-HKSCS** (Hong Kong Supplementary Character Set, 2016 edition) adds: +- Characters in `0x8740–0xA0FE` (lead bytes below the standard Big5 range). +- Additional characters in `0xC6A1–0xC8FE`. +- Maps to Unicode including characters outside the BMP (requires surrogate pairs in UTF-16 or 4-byte UTF-8). + +Detected via `CIDSystemInfo /Ordering (CNS1)`. The CNS1 collection covers planes 1–7 of CNS 11643. pdftract should carry both the base Big5 mapping table and the HKSCS delta table (~5,000 additional entries). + +--- + +## 7. ToUnicode CMaps for CJK + +When present, a `ToUnicode` stream is the most reliable path to Unicode output. CJK ToUnicode CMaps commonly use `beginbfrange` to cover large contiguous blocks: + +``` +beginbfrange + [ ... ] % row A1 of the 94×94 table +... + [...] % last row +endbfrange +``` + +Some CMaps use a simpler linear `bfrange` when the Unicode mapping is contiguous: + +``` +<4E00> <9FFF> <4E00> % CJK Unified Ideographs: CID == Unicode codepoint +``` + +Unicode block coverage to expect in CJK ToUnicode CMaps: +- **U+3040–U+309F**: Hiragana +- **U+30A0–U+30FF**: Katakana +- **U+4E00–U+9FFF**: CJK Unified Ideographs +- **U+3400–U+4DBF**: CJK Extension A +- **U+20000–U+2A6DF**: CJK Extension B (requires surrogate pairs in UTF-16 bfrange entries) +- **U+AC00–U+D7AF**: Hangul Syllables +- **U+F900–U+FAFF**: CJK Compatibility Ideographs + +Validate extracted CJK codepoints against these ranges relative to `CIDSystemInfo /Ordering`. A Japanese PDF should not produce Hangul; if it does, the CMap was misread. Identity-mapped CMaps (where CID equals Unicode codepoint) appear commonly with `UniJIS-UTF16-H` and modern OpenType-based tools—in these cases ToUnicode is often omitted and the CID is used directly as a Unicode scalar value. + +--- + +## 8. Missing ToUnicode Recovery for CJK + +Many CJK PDFs, especially older ones produced by Japanese or Chinese desktop publishing software, omit `ToUnicode`. Recovery requires: + +1. **Identify the character collection** from `CIDSystemInfo`: Registry + Ordering + Supplement determines which Adobe CID table applies. +2. **Look up CID in the compiled-in table**: Adobe publishes CID-to-Unicode mapping files for each collection: + - `Adobe-Japan1-UCS2.txt`: ~14,664 entries mapping Japan1 CIDs to Unicode. + - `Adobe-CNS1-UCS2.txt`: ~18,964 entries for CNS1. + - `Adobe-GB1-UCS2.txt`: ~30,284 entries for GB1. + - `Adobe-Korea1-UCS2.txt`: ~18,352 entries for Korea1. + +These files are freely redistributable. Compile each into a sorted `&[(u16, u32)]` static slice (CID → Unicode scalar). At runtime, binary-search by CID. For CIDs mapping to multiple Unicode codepoints (compatibility variants), store the primary mapping. + +For very large tables (Japan1), a 64 KB memory-mapped file loaded once at startup is more practical than a static array; alternatively, the `adobe-cid-tables` crate can provide compiled-in data. + +--- + +## 9. Full-Width and Half-Width Normalization + +CJK documents routinely mix full-width and half-width character forms: + +- **Full-width ASCII/Latin**: U+FF01 (`!`) through U+FF5E (`~`). Appear in Japanese text for typographic consistency. +- **Full-width currency symbols**: U+FFE0–U+FFE6 (e.g., U+FFE5 `¥`). +- **Half-width katakana**: U+FF65–U+FF9F. Commonly appear in older Japanese documents and data entry. +- **Full-width katakana**: U+30A0–U+30FF. The standard form in modern Japanese. + +For **search and indexing**, apply NFKC normalization: full-width Latin → ASCII, half-width katakana → full-width katakana. This ensures `A` (U+FF21) matches `A` (U+0041) in search. + +For **display output**, preserve the original forms. pdftract should expose a normalization flag; the default for its text extraction output should be to preserve, with NFKC normalization available as a post-processing step. + +--- + +## 10. Vertical CJK Text Extraction + +Japanese documents—books, newspapers, legal documents—frequently use vertical writing mode (top-to-bottom, right-to-left column order). + +**Detection**: +- The CMap name ends in `-V` (e.g., `90ms-RKSJ-V`, `UniJIS-UTF16-V`). Check the `/Encoding` value in the Type 0 font dictionary. +- The CMap stream contains `/WMode 1` in its dictionary section. +- The CTM (current transformation matrix) for text-drawing operators shows a 90° rotation (approximately `[0 -1 1 0 tx ty]` or `[0 1 -1 0 tx ty]`). + +**Vertical glyph substitutions**: vertical CMaps substitute specific glyphs—brackets, parentheses, and punctuation rotate to their vertical forms. CIDs in the vertical range (e.g., Japan1 CIDs 8284–8285 for vertical brackets) should map to the same Unicode codepoint as their horizontal counterparts (U+FF08/U+FF09, not a separate codepoint) since Unicode encodes only the logical character, not the presentation form. + +**Tate-chu-yoko** (縦中横): short sequences of Latin characters or digits (e.g., "20", "AB") typeset horizontally inline within vertical text. These appear as a horizontal text run with a rotation in the CTM. Detect by the surrounding WMode context and the CTM rotation reversal; output the characters inline in logical order. + +**Column reconstruction**: vertical Japanese text reads top-to-bottom within a column, and columns read right-to-left. After extracting character positions, sort glyphs first by X position descending (right column first), then by Y position descending (top first) within each column. Expose `writing_mode: "ttb"` in the per-page metadata so downstream consumers can reflow correctly. + +--- + +## Implementation Priority + +For pdftract, the recommended implementation order: + +1. Embed predefined CMap lookup tables as `static` byte slices compiled from Adobe's Annex D definitions. +2. Implement Shift-JIS (CP932) and EUC-JP decoders; these cover the majority of Japanese PDF traffic. +3. Implement GBK/GB18030 decoder for Chinese Simplified. +4. Implement Big5/ETen decoder for Chinese Traditional. +5. Implement EUC-KR/CP949 decoder for Korean. +6. Compile Adobe CID-to-Unicode tables as `static` sorted arrays for ToUnicode-absent recovery. +7. Add WMode detection and vertical text column sorting. +8. Expose normalization flags for full-width/half-width conversion. + +Each encoding decoder should return `Option` (or an iterator of `char`) given a byte slice and current position, advancing the position by 1, 2, or 4 bytes. Feed the resulting CID to a CMap lookup, then to the Unicode resolution layer. diff --git a/docs/research/extraction-pipeline-overview.md b/docs/research/extraction-pipeline-overview.md new file mode 100644 index 0000000..811bb9c --- /dev/null +++ b/docs/research/extraction-pipeline-overview.md @@ -0,0 +1,231 @@ +# pdftract Extraction Pipeline: End-to-End Architectural Overview + +This document synthesizes the 36 specialized research documents in this directory into a coherent architectural blueprint for implementing the pdftract Rust PDF text extraction library. It describes the ordered sequence of stages, decision points, and data transformations that take a PDF file as input and produce readable, structured text as output. Engineers implementing pdftract should treat this as the canonical pipeline reference and consult the named component documents for deeper detail on each subsystem. + +--- + +## Pipeline Inputs and Outputs + +**Input.** The pipeline accepts either a file path (opened via memory-mapped I/O for zero-copy reads) or an in-memory byte slice. All subsequent parsing operates on the raw bytes through a shared reference; no additional buffering is introduced at the entry point. Configuration is provided via an `ExtractionOptions` struct with fields including: `ocr_enabled: bool`, `ocr_language: Vec`, `extract_forms: bool`, `extract_annotations: bool`, `extract_attachments: bool`, `extract_images: bool`, `readability_threshold: f32`, `ocr_fallback_threshold: f32`, `include_invisible_text: bool`, and `streaming: bool`. + +**Output.** The pipeline produces a structured JSON document (or NDJSON stream in streaming mode) with the following top-level shape: + +``` +{ + "metadata": { ... }, // document-level metadata and diagnostics + "outline": [ ... ], // bookmark tree + "pages": [ ... ], // per-page content + "form_fields": [ ... ], // AcroForm / XFA fields (if enabled) + "annotations": [ ... ], // page annotations (if enabled) + "attachments": [ ... ], // embedded files (if enabled) + "warnings": [ ... ] // extraction warnings across all stages +} +``` + +Each page entry carries `blocks` (containing `spans` with per-glyph Unicode and confidence), `extraction_method`, `classification_signals`, `reading_order_algorithm`, `readability_score`, and a page-level `warnings` array. The `--text` flag collapses all block content to plain text separated by `\n\n`. Exit codes follow quality: `0` = clean, `1` = warnings present, `2` = errors or low-confidence pages below threshold. + +--- + +## Stage 1: File Opening and Structure Parsing + +See: `pdf-specification.md`, `malformed-pdf-repair-and-recovery.md`, `pdfa-compliance-and-extraction.md`, `pdf-encryption-and-security.md`. + +The pipeline opens the input via `mmap` and immediately checks the `%PDF-` header to confirm a valid PDF container, recording `pdf_version` in the output metadata. Parsing then works backward from the end of file to locate the `startxref` offset. + +**Encryption detection.** The trailer dictionary is scanned for a `/Encrypt` entry. If present, the encryption handler is identified (standard password, certificate, or custom). `ExtractionOptions` may supply a password; if decryption fails or no password is provided, the pipeline returns an `EncryptionError` immediately. See `pdf-encryption-and-security.md` for the full handler decision tree. + +**Cross-reference resolution.** The pipeline first attempts to parse the traditional xref table at the `startxref` offset. If that fails (common in repaired or linearized files), it falls back to xref streams (PDF 1.5+). If both fail, it falls back to a forward object scan — a full-file sequential pass that reconstructs the object map from `obj` / `endobj` markers. This scan is slower but handles severely malformed files. Recovered objects are flagged in `warnings`. The complete strategy is documented in `malformed-pdf-repair-and-recovery.md`. + +**Document catalog and page tree.** With a valid object map, the pipeline resolves the `/Root` entry to the document catalog. The page tree (`/Pages` subtree) is traversed once to build a flat index of page dictionaries with their inherited attributes (media box, resources, rotation), enabling O(log n) lookup by page number for parallel access in Stage 4. + +**PDF/A and tagging detection.** The catalog's `/Metadata` XMP stream is decoded and inspected for `pdfaid:conformance` and `pdfaid:part` to record the conformance level. The `/MarkInfo` dictionary's `/Marked` flag records whether the document is tagged. Both influence downstream path selection. See `pdfa-compliance-and-extraction.md`. + +--- + +## Stage 2: Document-Level Metadata + +See: `xmp-and-document-metadata.md`, `pdf-specification.md`. + +Metadata extraction runs once before per-page work. The pipeline first attempts the XMP metadata stream from the catalog `/Metadata` key, parsing it as an RDF/XML document to extract standard Dublin Core and PDF namespace fields: title, author, creator, producer, creation date, modification date, keywords, and subject. If the XMP stream is absent or malformed, it falls back to the `/Info` dictionary, which carries the same fields in PDF string encoding. + +When both sources exist, conflicts are resolved in favor of XMP for all fields where XMP provides a value — XMP is the authoritative source in PDF 1.4+ documents. The resolved values are written to `metadata` in the output. + +The pipeline also extracts the document outline (bookmarks) by walking the `/Outlines` tree, recording title, destination, and nesting level for each entry. Page labels from the `/PageLabels` number tree are extracted and stored in `metadata.page_labels`, enabling human-readable page numbering in output. + +--- + +## Stage 3: Per-Page Classification + +See: `scanned-vs-vector-page-classification.md`, `pdfa-compliance-and-extraction.md`, `raster-ocr-pipeline.md`. + +Before any expensive extraction work, each page is classified to select the optimal extraction path. Classification runs a sequence of fast pre-checks on the page content stream and resource dictionary: + +1. **No text operators.** If the content stream contains no `Tj`, `TJ`, `'`, `"`, or `TD`/`Tm` operators, the page is initially flagged as `Scanned`. +2. **Full-page Tr=3 + image.** If all text operators set rendering mode 3 (invisible) and a full-page image XObject covers the media box, the page is classified as `BrokenVector` (a PDF/A OCR layer pattern where real text is hidden beneath a scan). See `invisible-and-hidden-text.md`. +3. **Image coverage fraction.** The pipeline computes the fraction of the page media box area covered by raster image XObjects. Coverage above a configurable threshold (default 0.85) is a strong scanned signal. +4. **Character validity rate.** Text operators are parsed and character codes are passed through a quick validity check (ToUnicode CMap lookup + AGL probe). A validity rate below a threshold (default 0.4) indicates a broken or symbolic font encoding, yielding `BrokenVector`. +5. **High-density valid text.** Pages with validity rate above 0.85 and no significant image coverage are classified as `Vector`. + +The result is one of four `PageClass` values — `Vector`, `Scanned`, `Hybrid`, `BrokenVector` — each with an associated `confidence` score. Classification signals are recorded in the page output for diagnostics. + +--- + +## Stage 4: Content Extraction (Per-Page, Parallelized) + +See: `content-stream-concatenation.md`, `graphics-state-tracking.md`, `raster-ocr-pipeline.md`, `word-boundary-reconstruction.md`, `type3-font-extraction.md`, `optional-content-groups.md`. + +Stage 4 is the core extraction stage and is parallelized across pages using `rayon`. Each page runs one of four sub-paths determined by its `PageClass`. + +### 4a. Vector Path + +Content streams are concatenated (handling `/Length` mismatches, flate-decoding, and multi-stream pages) per `content-stream-concatenation.md`. A PDF graphics state machine processes operators in order, maintaining a stack of `GraphicsState` structs that track the current transformation matrix (CTM), text matrix (Tm), text line matrix (Tlm), font, font size, character spacing, word spacing, horizontal scaling, and text rise. See `graphics-state-tracking.md`. + +For each glyph, the text matrix is combined with the CTM to produce a device-space bounding box. Character codes are passed to the font pipeline (Stage 5) for Unicode resolution. Inter-glyph gaps are measured in glyph-space units normalized by the current font size; gaps exceeding the word-boundary threshold produce synthetic space characters. See `word-boundary-reconstruction.md`. Optional content group state (`/OC` entries) is tracked to suppress content from hidden layers. See `optional-content-groups.md`. + +### 4b. OCR Path + +The page is rendered to a 300 DPI raster using a PDF renderer. The raster undergoes preprocessing: deskew via Hough line detection, binarization via Sauvola local thresholding, and optional denoising. Tesseract is invoked with the language pack(s) specified in `ExtractionOptions.ocr_language`. HOCR output is parsed into glyph-level spans with bounding boxes and confidence scores. See `raster-ocr-pipeline.md` for the full preprocessing and Tesseract integration. + +### 4c. Hybrid Path + +Vector regions and image regions are identified by comparing text operator bounding boxes and image XObject placements. Regions where vector text is present use sub-path (a); regions covered by raster images with no overlapping vector text use sub-path (b). Spans from both sub-paths are merged by page coordinate order into a unified span list. + +### 4d. Assisted OCR (BrokenVector) + +Sub-path (a) is run first in position-hint mode: glyph bounding boxes are collected but Unicode values are discarded. These bounding boxes seed Tesseract's segmentation, improving word boundary detection. The OCR output then resolves the actual characters. Conflicts between position hints and OCR word boundaries are resolved in favor of OCR character shapes. + +--- + +## Stage 5: Font Pipeline + +See: `pdf-fonts-and-encoding.md`, `cmap-format-and-cid-encoding.md`, `glyph-recognition-and-unicode-recovery.md`, `type3-font-extraction.md`. + +For every character code encountered in the Vector path, the font pipeline resolves a Unicode scalar value through a prioritized fallback chain: + +1. **ToUnicode CMap.** If the font dictionary carries a `/ToUnicode` stream, the CMap is parsed and the character code is looked up. If the result is a non-sentinel value (not U+FFFD, not empty), it is used and `unicode_source` is set to `"to_unicode"`. See `cmap-format-and-cid-encoding.md`. +2. **Encoding vector + AGL.** If ToUnicode is absent or returns a sentinel, the font's encoding vector maps the character code to a glyph name. The Adobe Glyph List resolves the glyph name to a Unicode code point. `unicode_source` = `"agl"`. See `pdf-fonts-and-encoding.md`. +3. **Font fingerprint cache.** A precomputed database of known font program checksums maps directly to per-glyph Unicode tables. If the font program hash matches a database entry, the precomputed mapping is used. `unicode_source` = `"fingerprint"`. +4. **Glyph shape recognition.** The glyph is rendered to a small bitmap and hashed. If the shape hash matches an entry in the glyph recognition database, the Unicode value is assigned. `unicode_source` = `"shape_match"`. See `glyph-recognition-and-unicode-recovery.md`. +5. **Failure.** If all four steps fail, U+FFFD is emitted and `confidence` is set to `0.0`. + +Type 3 fonts, which define glyph shapes as content stream fragments, are handled specially: each glyph's content stream is rasterized and passed to the shape recognition step. See `type3-font-extraction.md`. + +Each glyph in the output carries `codepoint`, `unicode_source`, and `confidence`. + +--- + +## Stage 6: Span and Block Assembly + +See: `complex-layout-reading-order.md`, `tagged-pdf-structure-and-reading-order.md`, `document-classification-and-zone-labeling.md`, `watermark-and-background-separation.md`, `invisible-and-hidden-text.md`. + +Raw glyphs are grouped into **spans** by continuity of font, font size, color (fill and stroke), and rendering mode. A new span begins whenever any of these attributes changes, or when a word boundary gap is detected. + +**Reading order.** If the document is tagged (`/MarkInfo /Marked true`) or conforms to PDF/A-a, the StructTree is traversed to derive reading order. `reading_order_algorithm` is set to `"struct_tree"`. For untagged documents, the pipeline applies XY-cut decomposition (for rectilinear layouts) or Docstrum (for documents with irregular column boundaries). See `complex-layout-reading-order.md` and `tagged-pdf-structure-and-reading-order.md`. + +**Zone labeling.** After reading order is established, spans are assigned to document zones: `body`, `heading`, `header`, `footer`, `footnote`, `caption`, or `sidebar`. Zone assignment uses margin heuristics (vertical position relative to media box), font size clustering (headings are statistical outliers in the size distribution), and cross-page consistency (running headers/footers appear at similar positions across pages). See `document-classification-and-zone-labeling.md`. + +**Watermark and invisible text filtering.** Spans in rendering mode 3 (invisible) are suppressed unless `ExtractionOptions.include_invisible_text` is true. Spans classified as watermarks (low opacity, Z-order beneath body text, or matching common watermark patterns) are filtered per policy. See `watermark-and-background-separation.md` and `invisible-and-hidden-text.md`. + +Spans are assembled into **blocks** representing paragraphs or other logical units, and blocks are ordered within each page according to the reading order algorithm's output. + +--- + +## Stage 7: Text Normalization and Quality + +See: `post-extraction-normalization.md`, `post-ocr-text-correction.md`, `text-readability-validation.md`, `semantic-text-reconstruction.md`, `language-detection-and-script-handling.md`. + +Normalization runs as an ordered pipeline applied to each span's text: + +1. **Ligature expansion.** Standard ligatures (fi, fl, ffi, ffl, ſt, st) are expanded to their component characters. +2. **Unicode normalization.** All text is normalized to NFC. +3. **Whitespace collapse.** Runs of whitespace within a span are collapsed to a single space; leading and trailing whitespace is stripped. +4. **Hyphen joining.** Lines ending in a hyphen are joined to the next line's first word, with the hyphen removed, if the joined form appears in a language dictionary. +5. **Paragraph reconstruction.** Short lines that do not end with sentence-terminal punctuation are joined to the following line when their right edge falls significantly short of the text block width. See `semantic-text-reconstruction.md`. +6. **Header/footer deduplication.** Spans in the `header` and `footer` zones that appear with identical or near-identical text across three or more consecutive pages are flagged as `deduplicated` and excluded from the main text flow. They remain in the output under their zone label for reference. + +**Readability scoring.** Each span is scored on three signals: Shannon entropy of the character distribution, dictionary hit rate against a word list for the detected language, and character validity rate (fraction of non-U+FFFD codepoints). The composite `readability_score` per block (0.0–1.0) is written to the output. Blocks scoring below `ExtractionOptions.ocr_fallback_threshold` trigger an OCR fallback for that region on vector pages, re-running the block through sub-path (b) of Stage 4. See `text-readability-validation.md`. + +**Post-OCR correction.** For spans produced by the OCR path, a correction pass applies: confusable character substitution (0↔O, 1↔l, rn↔m), regex-based pattern correction (dates, identifiers), and bigram/trigram context correction using a language model. See `post-ocr-text-correction.md`. + +Language detection runs on the assembled block text to confirm or override the per-page language hint. The detected language is used to select the appropriate dictionary and Tesseract language pack for any OCR fallback runs. See `language-detection-and-script-handling.md`. + +--- + +## Stage 8: Supplementary Content + +See: `form-fields-and-annotations.md`, `embedded-files-and-portfolios.md`, `image-and-figure-extraction.md`. + +Supplementary extraction runs after all pages complete, guarded by the relevant `ExtractionOptions` flags. + +**Forms.** If `extract_forms` is true, the AcroForm dictionary is located in the catalog. Each field in the `/Fields` array is walked recursively. Field type (`Tx`, `Btn`, `Ch`, `Sig`), name, value, and appearance state are extracted. If an `/XFA` stream is present, it is parsed as XFA XML and field values are extracted from the XFA data model. See `form-fields-and-annotations.md`. + +**Annotations.** If `extract_annotations` is true, each page's `/Annots` array is iterated. For text and link annotations, `Contents` and `RC` (rich content) fields are extracted. Annotation type, rectangle, and flags are recorded. Redaction annotations (`/Redact`) are noted in warnings. + +**Attachments.** If `extract_attachments` is true, the `/EmbeddedFiles` name tree in the catalog is walked. Each `Filespec` dictionary yields a filename, description, MIME type, creation date, and the raw file bytes (or a size-limited excerpt if the attachment is large). See `embedded-files-and-portfolios.md`. + +**Images.** If `extract_images` is true, image XObjects referenced from each page's resource dictionary are collected. Metadata (width, height, color space, bits per component, filter chain) is always included. Pixel data is decoded and included as base64 only if `ExtractionOptions.include_image_data` is true. See `image-and-figure-extraction.md`. + +--- + +## Stage 9: Output Serialization + +See: `performance-and-streaming-architecture.md`, `chunking-for-llm-consumption.md`. + +The final stage assembles all collected data and serializes it. + +**Buffered JSON mode** (default). The complete document tree is serialized to a single JSON object. Field ordering follows the schema defined in the Pipeline Inputs and Outputs section above. `serde_json` with `BufWriter` is used; the output is written to stdout or a specified file path. + +**Streaming NDJSON mode** (`ExtractionOptions.streaming = true`). Metadata is emitted as the first JSON line. Each page is serialized and emitted as a JSON line immediately after it completes extraction, allowing consumers to begin processing before the full document is done. This mode is documented in `performance-and-streaming-architecture.md` and is designed to support the LLM consumption patterns described in `chunking-for-llm-consumption.md`. + +Each page object in both modes carries: + +- `page_number` (1-based) +- `extraction_method`: one of `"vector"`, `"ocr"`, `"hybrid"`, `"assisted_ocr"` +- `classification_signals`: the raw signals from Stage 3 (image coverage fraction, character validity rate, operator counts) +- `reading_order_algorithm`: `"struct_tree"`, `"xy_cut"`, or `"docstrum"` +- `readability_score`: composite 0.0–1.0 for the page +- `blocks`: ordered array of text blocks with spans +- `warnings`: page-level warning array + +**Exit code semantics.** After all pages are processed, the pipeline computes the worst-case quality across pages. If all pages have readability score above the clean threshold, exit code `0` is returned. If any page emits warnings (OCR fallback triggered, low-confidence spans, unsupported features), exit code `1` is returned. If any page fails extraction entirely or contains errors, exit code `2` is returned. This allows shell pipelines and CI systems to gate on extraction quality without parsing the output JSON. + +--- + +## Summary: Stage Ordering and Data Flow + +``` +Input (file path / bytes) + │ + ▼ +Stage 1: File opening, xref, decryption, page tree index + │ + ▼ +Stage 2: Document metadata (XMP, /Info, outline, page labels) + │ + ▼ +Stage 3: Per-page classification → PageClass × confidence + │ + ▼ +Stage 4: Content extraction (rayon parallelism across pages) + ├─ Vector → graphics state machine → raw glyphs + ├─ OCR → raster render → Tesseract → raw spans + ├─ Hybrid → Vector regions + OCR regions → merged spans + └─ BrokenVector → position hints + OCR → spans + │ + ▼ (from Vector path) +Stage 5: Font pipeline → Unicode + confidence per glyph + │ + ▼ +Stage 6: Span + block assembly → reading order → zone labels + │ + ▼ +Stage 7: Normalization → readability scoring → OCR fallback → correction + │ + ▼ +Stage 8: Forms, annotations, attachments, images (conditional) + │ + ▼ +Stage 9: JSON / NDJSON serialization → exit code +``` + +Each stage boundary is a well-defined data contract. Stages 1–2 produce document-scoped structures shared across all pages. Stage 3 produces per-page `PageClass` values that gate Stage 4 sub-path selection. Stages 4–7 are the per-page pipeline and are the primary targets for parallelism and optimization. Stages 8–9 are sequential post-processing passes over the fully assembled extraction result. diff --git a/docs/research/linearized-pdf-and-streaming.md b/docs/research/linearized-pdf-and-streaming.md new file mode 100644 index 0000000..783b22a --- /dev/null +++ b/docs/research/linearized-pdf-and-streaming.md @@ -0,0 +1,242 @@ +# Linearized PDF and Streaming Extraction + +## Overview + +Linearized PDFs (also called "web-optimized" PDFs) are files reorganized so that a conforming reader can display the first page after receiving only the first portion of the file. For `pdftract`, this structure provides two distinct opportunities: fast first-page extraction without reading the full file, and demand-driven page-by-page streaming when extracting from a remote URL. + +--- + +## 1. What Linearization Is + +A standard PDF places the cross-reference table (xref) at the end of the file. A reader must download the entire file, seek to the `startxref` offset, parse the xref, then locate any object. Linearization (PDF specification §F) reorders the file so that all objects needed to render the first page appear first, enabling a single HTTP request covering only the initial byte range to produce a renderable first page. + +The file layout for a linearized PDF is: + +1. **Linearization dictionary** — the first object in the file. +2. **First-page xref and trailer** — a small xref covering only first-page objects. +3. **First-page content objects** — page dictionary, content streams, fonts, and resources used on page 1. +4. **Primary hint stream** — page offset and shared object tables. +5. **Remaining pages and shared objects** — in page order. +6. **Main xref and trailer** — covers all objects, at the end of the file. + +The linearization dictionary is a regular PDF dictionary object with the key `Linearized` set to the real number `1.0` (or `1` in some implementations). Its required entries are: + +| Key | Meaning | +|-----|---------| +| `L` | Total file length in bytes | +| `H` | Array of two or four integers: `[offset, length]` of the primary hint stream, optionally followed by `[offset, length]` of the overflow hint stream | +| `O` | Object number of the first page's page object | +| `E` | Byte offset of the end of the first page section | +| `N` | Number of pages in the document | +| `T` | Byte offset of the main xref table (the one at the end of the file) | + +Contrast with a non-linearized file: its only xref is at the end, and objects are stored in creation order with no guarantees about the first page appearing first. + +--- + +## 2. Hint Streams + +The primary hint stream (located at the byte range given by `H[0]` and `H[1]`) contains two sub-tables serialized in a compact binary format. + +**Page offset hint table.** One entry per page, each containing: + +- The byte offset of the first object belonging to that page. +- The total byte length of all objects on that page. + +These offsets are relative to the start of the file and are sufficient to compute the exact `Range: bytes=N-M` request needed to fetch any specific page's raw object data without reading the xref for that page. + +**Shared object hint table.** Objects shared across multiple pages (a common font embedded once but referenced from every page, a logo image) are listed separately. Each entry contains the object number and file offset of the shared object. When fetching an arbitrary page, the extractor must also fetch any shared objects that page references; the shared object hint table makes this a direct seek rather than an xref lookup. + +For files larger than 2 GB, the hint stream offsets may overflow 32-bit integers. The spec accommodates this with an optional overflow hint stream at `H[2]`/`H[3]` that contains corrected 64-bit offsets for any entry that overflowed. + +Parsing the hint stream requires reading a bitfield-packed binary structure: the spec defines a table of `nSharedObjects` entries, each encoded with a fixed bit width recorded at the top of the table. This is not length-prefixed text — it requires a bit-level reader that tracks the current bit position within the decompressed stream buffer. + +--- + +## 3. First-Page Xref + +Linearized files contain two xref structures: + +- **First-page xref**: immediately follows the linearization dictionary. It is a conventional xref table (or cross-reference stream for PDF 1.5+) covering only the objects needed for the first page. Its trailer has a `Size` entry equal to the count of first-page objects. +- **Main xref**: at the end of the file, covering all objects. Its trailer contains the standard `Size`, `Root`, `Info`, and optional `Prev` (for incremental updates) entries. + +Parsing strategy for `pdftract`: + +1. Read the first 1 KB (or up to `E`, whichever is smaller) to locate the linearization dictionary. +2. Validate the dictionary (see §5). +3. Parse the first-page xref. This xref is sufficient to extract page 1 without any further I/O. +4. Defer parsing the main xref until a non-first-page object is requested. + +This lazy strategy means that for a request extracting only the first page (common in preview generation), the main xref — which may be many megabytes into a large file — is never read. + +--- + +## 4. Streaming Extraction for HTTP Delivery + +Consider extracting text from a 500-page PDF hosted at a remote URL. Waiting for a full download before beginning extraction is wasteful in both latency and peak memory. + +**Protocol.** HTTP/1.1 and HTTP/2 both support `Range: bytes=N-M` requests. A `HEAD` request first confirms `Accept-Ranges: bytes` and retrieves `Content-Length` (needed to validate the `L` key and compute the total file size). + +**Fetch sequence for a linearized file:** + +1. Fetch bytes `0` through `E` (the end-of-first-page offset from the linearization dict). This yields the linearization dictionary, first-page xref, first-page objects, and the hint stream. +2. Parse and emit page 1 immediately. +3. For each subsequent page `i`, compute the byte range from the page offset hint table: `[page_offset[i], page_offset[i] + page_length[i])`. Fetch that range plus any referenced shared object ranges. +4. Parse and emit page `i`. + +Using `reqwest` with range requests: + +```rust +let response = client + .get(&url) + .header(RANGE, format!("bytes={}-{}", start, end)) + .send() + .await?; +let bytes = response.bytes().await?; +``` + +Each range fetch is independent. For pages whose shared object dependencies are already cached from a prior fetch, no additional request is needed. A local `HashMap>` cache keyed by object number avoids re-fetching shared fonts and images. + +For non-linearized remote files, fall back to: fetch the last 1 KB to read `startxref`, fetch the main xref, then fetch individual pages using xref offsets. + +--- + +## 5. Detecting and Validating Linearization + +**Detection.** The linearization dictionary must be the first indirect object in the file and must carry `Linearized 1.0` (or `1`). In practice, many tools emit linearized-looking files that fail validation. Check: + +1. The first object in the file is a dictionary with the `Linearized` key. +2. `L` matches `file.metadata()?.len()` exactly. If the file length does not match, the linearization is stale. +3. `H`, `O`, `E`, `N`, and `T` are all present and within file bounds. + +**Invalid linearization.** Incremental updates (see §6) are the most common cause. If `L` does not match the actual file size, the hint stream offsets are unreliable. Fall back to standard xref parsing: seek to `startxref` at the end of the file, parse the main xref chain, and process the document normally. Log a structured warning at the `tracing::debug!` level. + +**False positives.** Some non-linearized PDFs happen to have their first object numbered 1 and start with a dictionary. Confirm the `Linearized` key is present and the value is a number equal to 1 before treating the file as linearized. + +--- + +## 6. Incremental Update Interaction + +When a linearized PDF is updated incrementally — a common operation for annotation tools, form fillers, and digital-signature workflows — the update is appended at the end of the file. This invalidates the `L` key (file is now longer) and renders all hint stream offsets stale for any updated object. + +The hint streams still reflect the original layout. For first-page extraction on a file that has not had its first page modified, the hint stream may still be usable. However, this is difficult to determine without reading the full incremental update delta. + +**Safe strategy:** + +- Use the linearization structure only for detecting that the first-page xref is available; read and extract page 1 from the first-page xref as long as `L` is consistent with the original (pre-update) length. +- For any non-first-page content, or any file where `L` mismatches, follow the full xref chain from the end of the file. The last trailer's `Prev` pointer chains back through all prior xref sections. The last xref in the chain is authoritative for all object locations including updates. +- Never trust hint stream page offsets for updated files. + +--- + +## 7. Memory-Efficient Streaming Output + +For large documents, accumulating the full extraction result in memory before writing output is not viable. `pdftract` supports NDJSON (newline-delimited JSON) streaming output: each page's `PageExtraction` is serialized and written to stdout before the next page is fetched or parsed. + +```rust +let stdout = std::io::stdout(); +let mut writer = BufWriter::new(stdout.lock()); + +for page_result in extractor.pages() { + let page = page_result?; + serde_json::to_writer(&mut writer, &page)?; + writer.write_all(b"\n")?; + writer.flush()?; +} +``` + +`BufWriter` amortizes the flush cost across many small writes. The `flush()` after each page ensures the consumer receives complete objects as they are produced rather than waiting for the buffer to fill. + +**Tradeoff.** Streaming output precludes any feature requiring a full-document pass before emitting output: assembling the document outline, resolving cross-page table structures, or applying page labels to page numbers. These features require either a pre-pass (§8) or a second pass over already-extracted data. If neither is acceptable, emit those features at the end of the stream as a final summary object. + +--- + +## 8. Pre-Pass for Document-Level Features + +When streaming output is requested, a lightweight pre-pass fetches the document catalog and a small set of document-level structures before per-page streaming begins: + +- **Document catalog**: contains `Outlines`, `PageLabels`, `AcroForm`, `Metadata` (XMP), and `MarkInfo` references. The catalog object is listed in the main trailer's `Root` entry — fetch it with a single seek using the main xref. +- **Outline tree**: the `Outlines` dictionary with its full `First`/`Next`/`Last` child chain. For typical documents this is a few dozen objects; fetch them all upfront. +- **Page labels**: a small number tree in `PageLabels`; fetch and resolve once. +- **XMP metadata**: a single stream object referenced from `Metadata`. + +For linearized files, the hint stream's shared object table may include the catalog's dependents. For non-linearized files, these objects are clustered near the main xref at the end of the file and can be fetched in a single range request covering the last 64 KB. + +After the pre-pass, emit one NDJSON line containing a `DocumentMetadata` object, then begin per-page streaming. + +--- + +## 9. Partial File Extraction + +For truncated downloads or interrupted network reads, `pdftract` extracts all pages whose object byte ranges fall within the available bytes. + +Detection using the hint stream page offset table is direct: for page `i`, if `page_offset[i] + page_length[i] <= available_bytes`, the page is extractable. Iterate until the condition fails. + +Output metadata for partial extractions: + +```json +{"type": "metadata", "partial": true, "pages_extracted": 12, "pages_total": 500} +``` + +For linearized files, page 1 is always available from any file at least `E` bytes long. A file truncated to its first few kilobytes still yields the first page. For non-linearized files, page availability depends entirely on xref accessibility; if `startxref` is missing (file truncated before the end), attempt to reconstruct the xref by scanning for `obj` keywords — but this falls under malformed PDF recovery territory, not linearization handling. + +--- + +## 10. Implementation: Lazy Page Iterator + +The Rust API for streaming extraction exposes a lazy iterator: + +```rust +pub struct PdfExtractor { /* ... */ } + +impl PdfExtractor { + pub fn pages(&mut self) -> PageIter<'_>; +} + +pub struct PageIter<'a> { + extractor: &'a mut PdfExtractor, + page_index: usize, + xref: ParsedXref, + graphics_state: GraphicsState, +} + +impl<'a> Iterator for PageIter<'a> { + type Item = Result; + + fn next(&mut self) -> Option { + if self.page_index >= self.extractor.page_count() { + return None; + } + self.graphics_state.reset(); + let result = self.extractor.extract_page(self.page_index, &self.xref, &self.graphics_state); + self.page_index += 1; + Some(result) + } +} +``` + +Key design points: + +- `graphics_state.reset()` at each page boundary discards font state and CTM from the prior page; graphics state does not persist across PDF pages unless explicitly inherited via resource inheritance. +- For linearized files, `extract_page` uses the hint stream to compute the byte range for each page, issuing a range fetch (or a seek on a local file) on demand. +- For standard files, `extract_page` uses xref offsets directly. +- The iterator holds a `&mut PdfExtractor` rather than `Arc>` to avoid lock contention in the single-threaded path. + +**Parallel extraction.** `rayon`'s `par_bridge()` converts any `Iterator` into a parallel iterator with preserved output order: + +```rust +use rayon::iter::{ParallelBridge, ParallelIterator}; + +extractor.pages() + .par_bridge() + .map(|page_result| page_result.map(render_page)) + .collect::>>()?; +``` + +`par_bridge()` preserves ordering by numbering tasks internally. For I/O-bound extraction (remote range fetches), parallelism here is limited by HTTP connection reuse; prefer async concurrency with `tokio::join!` over rayon for the HTTP case. For CPU-bound extraction (complex content streams from a local file), rayon's thread pool is appropriate. + +--- + +## Summary + +Linearized PDFs expose byte-level structure that enables three extraction optimizations: first-page extraction from the initial byte range alone, demand-driven page fetching via hint stream offsets, and partial-file extraction from truncated downloads. The critical implementation discipline is validating the `L` key before trusting any hint offset, falling back to main xref parsing when linearization is stale, and always treating the last xref in the incremental update chain as authoritative. The lazy `PageIter` API makes these optimizations composable with NDJSON streaming output and optional document-level pre-pass metadata.