diff --git a/docs/research/benchmark-and-test-methodology.md b/docs/research/benchmark-and-test-methodology.md new file mode 100644 index 0000000..81956d1 --- /dev/null +++ b/docs/research/benchmark-and-test-methodology.md @@ -0,0 +1,165 @@ +# Benchmark and Test Methodology for PDF Text Extraction + +## 1. Why Benchmarking Matters + +PDF text extraction has no agreed-upon standard benchmark. Without one, it is impossible to compare extraction strategies objectively, communicate quality guarantees to users, or detect when a code change causes a regression. A library can claim "high accuracy" while measuring only on clean born-digital PDFs and silently failing on scanned documents or complex table layouts. + +A complete benchmark must cover multiple orthogonal quality dimensions: + +- **Character accuracy** — are the correct Unicode codepoints recovered? +- **Word accuracy** — are word boundaries preserved after ligature expansion and whitespace reconstruction? +- **Reading order correctness** — does the extracted sequence match human reading order, not PDF paint order? +- **Table structure accuracy** — are row/column relationships preserved across merged cells? +- **Form field extraction** — are AcroForm field names, values, and types correctly recovered? +- **Metadata correctness** — does the XMP/DocInfo metadata round-trip without truncation or encoding errors? + +The risk of single-metric optimization is real. Tuning for character error rate on clean PDFs often involves aggressive Unicode normalization that destroys mathematical symbols or CJK ideographs. Tuning for table extraction can introduce extraneous whitespace that degrades WER on prose documents. A benchmark suite must surface these trade-offs rather than hide them. + +--- + +## 2. Ground Truth Corpus Construction + +Ground truth can be obtained through four approaches, each with distinct tradeoffs. + +**Synthetic PDFs from known text.** A PDF library (e.g., `printpdf`, `lopdf`, or Python's `reportlab`) generates PDFs programmatically from a UTF-8 source string. Because the source is known exactly, comparison is unambiguous and deterministic. Synthetic documents are cheap to generate at scale and cover arbitrary scripts and layouts. Their weakness is that they do not capture real-world PDF quirks: embedded CMaps with broken ToUnicode entries, overlapping glyphs, scanned images masquerading as text layers. + +**Manually verified human-labeled PDFs.** A human reads the PDF and produces a ground-truth text file, recording the expected extraction character-for-character. This captures real documents but is expensive: expert annotators typically label 2–5 pages per hour for dense academic material. Inter-annotator agreement for ambiguous whitespace or hyphenation decisions is rarely above 95%, introducing irreducible noise into the ground truth. + +**Round-trip from source documents.** When the authoring source is available (LaTeX `.tex` files, Word `.docx`, LibreOffice `.odt`), the plain-text content can be derived from the source rather than re-annotated. LaTeX is particularly clean: stripping macros and math yields the expected prose. The limitation is that PDF layout engines can reflow, hyphenate, and kern text differently from the source, so extracted text is legitimately different from source text without being wrong. + +**Crowd-sourced annotation.** Platforms like Amazon Mechanical Turk or Label Studio can produce annotations at scale with majority-vote aggregation. Quality is lower than expert annotation but suitable for coarse WER measurement on large corpora. Reject outlier annotators with high per-document disagreement. + +**Minimum corpus size.** For CER/WER to have 95% confidence intervals narrower than ±1 percentage point, a corpus of 500–1000 pages across diverse categories is the practical minimum. Fewer pages produce wide intervals that make small improvements statistically indistinguishable from noise. + +--- + +## 3. Corpus Categories + +Different document types stress different extraction code paths. A representative corpus must include: + +- **Academic papers** — multi-column layouts, inline math, reference lists with dense hyperlinking, footnotes interleaved with body text. +- **Financial filings** — SEC 10-K/10-Q documents with nested tables, numerical columns, boilerplate legal paragraphs, and XBRL-tagged inline content. +- **Legal documents** — dense prose, numbered exhibits as appendix PDFs, footnotes with hierarchical numbering, redacted (blacked-out) regions. +- **Scanned historical documents** — OCR-rendered image-only PDFs, degraded scan quality, skewed pages, handwritten marginalia. +- **Forms** — AcroForm with checkboxes, radio buttons, combo boxes, text fields, digital signature widgets. +- **Technical manuals** — figures with captions, sidebars offset from main text flow, numbered step lists, code blocks rendered as images. +- **Multilingual documents** — Arabic/Hebrew right-to-left text, CJK ideographs with vertical typesetting options, mixed-script documents. +- **Born-digital word processor output** — PDFs exported from Word, LibreOffice, or Google Docs, representing the dominant document type in enterprise use. + +--- + +## 4. Character Error Rate (CER) + +CER is the standard metric inherited from OCR research. It is defined as: + +``` +CER = (S + D + I) / N +``` + +where S is substitutions, D is deletions, I is insertions at the character level, and N is the number of characters in the ground-truth string. This is the normalized Levenshtein edit distance between the extracted and reference character sequences. + +Before computing CER, normalize whitespace: collapse runs of spaces and newlines into a single space, strip leading/trailing whitespace per paragraph, and optionally Unicode-normalize both strings to NFC. Failing to normalize causes inflated CER from formatting differences rather than extraction errors. + +For efficient computation over long documents, use the `rapidfuzz` algorithm (available in the Python `rapidfuzz` crate via FFI, or implement the Wagner-Fischer DP with O(min(m,n)) space). For a 10,000-character document page, naive O(mn) DP is fast enough; for full-document comparisons exceeding 100,000 characters, partition by paragraph and sum. + +Report CER broken down by corpus category and compute a weighted overall CER where each category is weighted by its share of the corpus page count. A single overall CER hides category-specific failures. + +--- + +## 5. Word Error Rate (WER) + +WER tokenizes both extracted and reference text into word tokens and computes the edit distance at the word level: + +``` +WER = (S_w + D_w + I_w) / N_w +``` + +WER is more meaningful than CER for downstream NLP pipelines (named entity recognition, summarization, retrieval) because word-level errors map directly to missed or corrupted tokens. + +Tokenization decisions matter. Punctuation attached to words (`"end."`) should be stripped or split into a separate token before comparison — otherwise a missing period inflates WER by creating a substitution (`end.` → `end`). A consistent tokenization scheme must be documented and applied identically to both extracted and ground-truth text. + +For CJK scripts (Chinese, Japanese, Korean), word boundaries are not marked by whitespace. WER is undefined without a word segmenter (e.g., MeCab for Japanese, jieba for Chinese). Use CER only for CJK content. For Arabic and Hebrew, apply a morphological tokenizer if available; otherwise use whitespace tokenization with appropriate caveats noted in the report. + +--- + +## 6. Reading Order Accuracy + +Extracting correct text is necessary but insufficient if that text appears in the wrong sequence. A PDF stores content streams in paint order, which frequently diverges from reading order in multi-column layouts, sidebars, or documents with footnotes. + +The ground truth encodes an explicit word ordering: a sequence `w_1, w_2, ..., w_n` in human reading order. The extractor produces its own sequence `e_1, e_2, ..., e_m`. To measure alignment, compute **Kendall's τ** rank correlation between the ground-truth position of each word and its position in the extracted sequence. τ = 1.0 indicates perfect order; τ = 0 indicates random order; τ = −1.0 indicates fully reversed order. + +For documents where word identity is ambiguous (repeated words), use a longest-common-subsequence alignment to match ground-truth words to extracted words before computing rank correlation. + +Report per-page reading order τ, and flag pages with τ < 0.8 as layout failures. Two-column academic papers are the canonical hard case and should constitute at least 20% of the reading order sub-corpus. + +--- + +## 7. Table Extraction Metrics + +Tables require structure metrics beyond string edit distance. The standard is **TEDS (Tree Edit Distance based Similarity)**: + +1. Represent each table as a tree: the root is the table node, children are rows, each row's children are cells. Cells carry `rowspan` and `colspan` attributes and a text payload. +2. Compute the normalized tree edit distance between the extracted tree and the ground-truth tree using the Zhang-Shasha algorithm. +3. `TEDS = 1 − (tree_edit_distance / max(|T_gt|, |T_extracted|))` where `|T|` is the node count. + +TEDS ranges from 0 to 1, with 1 indicating perfect structural and content match. + +Report TEDS alongside two supplementary metrics: + +- **Cell-level text accuracy** — for cells matched by structural alignment, compute CER on cell contents. This separates structural errors from text extraction errors within correctly located cells. +- **Header detection precision/recall** — label which rows are headers in the ground truth, and measure how accurately the extractor identifies them. False-positive header detection (promoting body rows) is the most common failure mode. + +--- + +## 8. Regression Testing Infrastructure + +The benchmark corpus is too large to run on every commit. The regression suite is a fast-path subset: 50–100 deterministic PDFs (synthetic PDFs covering edge cases plus a curated set of real PDFs with stable known output) with expected JSON stored in the repository. + +Each test case produces structured output: + +```json +{ + "pages": [...], + "metadata": {...}, + "tables": [...], + "form_fields": [...] +} +``` + +Use the `insta` crate for snapshot testing. On first run, `insta` captures the JSON output as a committed snapshot file. On subsequent runs, any deviation causes the test to fail and `cargo insta review` presents a diff for human approval. This prevents silent regressions while allowing intentional changes to be reviewed and accepted explicitly. + +CI integration uses the Argo Workflows system. The workflow step runs `cargo test` and `cargo insta test --unreferenced=error`, failing the build on any unreviewed snapshot change. The full benchmark suite (all corpus categories, all metrics) runs nightly rather than per-commit, with results posted to a persistent store for trend visualization. + +--- + +## 9. Existing Public Test Corpora + +Several public datasets provide ready-made ground truth for specific document categories: + +- **PDF Association test suite (pdfa.org/test-suite)** — conformance tests for PDF specification compliance; useful for metadata and structure correctness, not extraction quality. +- **PRImA Layout Analysis Dataset** — scanned newspaper and magazine pages with ground-truth layout regions and reading order. Strong for multi-column layout and region segmentation evaluation. +- **FUNSD** — 199 noisy scanned forms with field-level annotations. Small but directly applicable to form extraction evaluation; free for research use. +- **PubLayNet** — 360,000 academic paper pages from PubMed with region-level annotations (text, title, list, figure, table). Token-level text is not included, but layout regions are. +- **DocBank** — 500,000 academic paper pages from arXiv with token-level annotations extracted by aligning LaTeX source to PDF rendering. The best available resource for reading order and fine-grained text annotation. +- **DeepForm** — 1,500 financial disclosure forms (SEC filings) with field-level ground truth. Useful for financial document extraction and form field accuracy, though the extraction targets are specific named fields rather than full-document text. + +Each dataset has limitations: PubLayNet lacks text content; DocBank is academic-only; FUNSD is small and noisy; DeepForm covers a narrow financial niche. A production benchmark corpus should draw from all of them and supplement with synthetically generated documents to fill gaps. + +--- + +## 10. Performance Benchmarks + +Extraction quality metrics are necessary but not sufficient. A library that achieves 99% CER at 0.1 pages/second is not production-viable. Track throughput and memory alongside accuracy. + +**Metrics to track:** + +- **Pages/second** — primary throughput metric; measure on a fixed corpus of representative PDFs. +- **MB/second** — file size throughput; useful for comparing against I/O overhead. +- **Peak RSS per document** — critical for large PDF handling; a document should not require more than 10× its file size in memory. +- **Time-to-first-page** — for streaming APIs; measures latency before any output is available. + +Use the `criterion` crate for statistically rigorous benchmarking. Criterion runs each benchmark function multiple times, discards warm-up iterations, and computes mean and confidence intervals. Store benchmark results in a JSON history file (committed or artifact-stored) and compare each run against the baseline commit. + +Define acceptable regression thresholds: a throughput drop greater than 5% on the representative corpus triggers mandatory investigation before merge. Memory regressions greater than 10% on any document category also block merge. These thresholds should be enforced in CI by a script that reads Criterion's comparison output and exits non-zero on threshold violation. + +Benchmark PDFs must be fixed and versioned — using randomly selected documents introduces variance across runs. Commit a set of 10–20 representative PDFs (covering each corpus category) as binary fixtures in the repository, kept small enough (total < 10 MB) that checkout time is not impacted. diff --git a/docs/research/embedded-files-and-portfolios.md b/docs/research/embedded-files-and-portfolios.md new file mode 100644 index 0000000..141e420 --- /dev/null +++ b/docs/research/embedded-files-and-portfolios.md @@ -0,0 +1,242 @@ +# Embedded Files and PDF Portfolios + +**Project:** pdftract — Rust PDF text extraction library +**Scope:** Handling PDFs that carry embedded files (attachments) and PDF Portfolio/Collection containers + +--- + +## 1. Embedded File Streams + +An embedded file is a PDF stream object with `/Subtype /EmbeddedFile`. It carries the raw bytes of the attached file in its compressed stream body, exactly like any other stream object. + +**Stream dictionary keys of interest:** + +| Key | Type | Description | +|-----|------|-------------| +| `/Subtype` | name | MIME type string, e.g. `application/xml`, `text/plain`. Optional but common. | +| `/Params` | dictionary | File metadata: `/Size` (integer, uncompressed byte count), `/CreationDate` and `/ModDate` (PDF date strings), `/CheckSum` (MD5 digest as a 16-byte string). | +| `/Filter` | name or array | Decompression chain applied to the stream. Typically `/FlateDecode`; may be a multi-stage chain like `[/ASCII85Decode /FlateDecode]`. | +| `/Length` | integer | Compressed byte count in the file. Always present. | + +To read the raw bytes of an embedded file: locate the stream object, apply the `/Filter` chain in sequence (outermost filter first in array order), and the result is the uncompressed file payload. The `/Params` `/Size` entry should match the decompressed length — use it as a sanity check. + +--- + +## 2. File Specification Dictionaries + +A **Filespec dictionary** wraps an embedded file stream and carries filename and relationship metadata. It is the object that the rest of the document refers to, not the raw stream directly. + +``` +<< + /Type /Filespec + /F (invoice.xml) % PDFDocEncoding filename + /UF % UTF-16BE Unicode filename (preferred) + /Desc (Factur-X invoice XML) % Human-readable description + /AFRelationship /Data % Semantic relationship to the document (PDF 2.0) + /EF << + /F 12 0 R % EmbeddedFile stream for /F name + /UF 12 0 R % May point to the same or different stream + >> +>> +``` + +**`/EF` sub-dictionary** maps the platform filename keys (`/F`, `/UF`, `/DOS`, `/Mac`, `/Unix`) to indirect references to EmbeddedFile stream objects. In modern PDFs, `/F` and `/UF` point to the same stream; the platform-specific keys are legacy. + +**`/AFRelationship`** (PDF 2.0, ISO 32000-2 §7.11.3) declares how the embedded file relates to the containing PDF: + +| Value | Meaning | +|-------|---------| +| `Source` | The embedded file is the source for generating this PDF. | +| `Data` | Structured data that this PDF was generated from (e.g. XML invoice). | +| `Alternative` | Alternative representation of document content. | +| `Supplement` | Supplemental information. | +| `EncryptedPayload` | Encrypted payload, used in PDF encryption workflows. | +| `FormData` | Data submitted from a form. | +| `Schema` | An XSD or schema file describing another attachment. | +| `Unspecified` | No declared relationship. | + +--- + +## 3. Document-Level Attachments — The `EmbeddedFiles` Name Tree + +Document-level attachments are registered in the document catalog under `/Names` → `EmbeddedFiles`. This is a PDF **name tree** — a B-tree structure mapping UTF-8 (or PDFDocEncoding) string keys to Filespec dictionary references. + +**Name tree structure:** + +- **Leaf node:** `<< /Names [ (key1) ref1 (key2) ref2 ... ] >>` +- **Intermediate node:** `<< /Limits [ (first-key) (last-key) ] /Kids [ ref ... ] >>` + +To iterate all entries, walk the tree depth-first: if a node has `/Kids`, recurse into each child; if a node has `/Names`, extract the key/value pairs in sequence. Keys are the filename strings used for display; values are indirect references to Filespec dictionaries. + +Navigation from catalog: + +``` +Catalog → /Names → /EmbeddedFiles → (name tree root node) +``` + +A PDF can have document-level attachments without being a Portfolio. Check for the `/Collection` key (§6 below) to distinguish the two cases. + +--- + +## 4. Annotation-Based Attachments + +A `FileAttachment` annotation attaches a file to a specific page location. It appears in the page's `/Annots` array with `/Subtype /FileAttachment`. + +**Relevant annotation keys:** + +| Key | Description | +|-----|-------------| +| `/FS` | Indirect reference to a Filespec dictionary. | +| `/Name` | Icon style hint: `PushPin`, `Graph`, `Paperclip`, `Tag`. | +| `/Contents` | Tooltip/description string shown on hover. | + +The Filespec under `/FS` is identical in structure to document-level Filespecs. To extract the file bytes, follow `/FS` → Filespec → `/EF` → EmbeddedFile stream → decompress. + +When iterating pages for text extraction, collect all `/Annots` entries and filter for `/Subtype /FileAttachment`. Record the page number as the `page` field in the attachment output (see §9). + +--- + +## 5. Associated Files (PDF 2.0) + +PDF 2.0 (ISO 32000-2) introduces the `/AF` (associated files) array. Unlike `EmbeddedFiles`, which is a document-wide name tree, `/AF` arrays appear directly on specific objects and express a tighter structural association. + +**Where `/AF` can appear:** + +- Document catalog (document-wide association) +- Page dictionaries (page-specific file) +- Content stream dictionaries +- XObject dictionaries +- Structure element dictionaries (tagged PDF) + +Each entry in an `/AF` array is an indirect reference to a Filespec dictionary carrying an `AFRelationship`. + +**Key use cases:** + +- **PDF/A-3** (ISO 19005-3): requires that embedded source data be declared via `/AF` with `AFRelationship /Data` or `/Source`. PDF/A-3 is the conformance level that permits arbitrary embedded files; lower levels forbid them. +- **ZUGFeRD / Factur-X**: a PDF/A-3 invoice with an embedded XML file referenced via `/AF` on the catalog and also present in `EmbeddedFiles`. Both access paths should be checked. +- **PDF/UA-2**: MathML representations of mathematical content can be attached via `/AF` on structure elements with `AFRelationship /Alternative`. + +When processing a document, collect `/AF` arrays from all of these object types. Deduplicate by object number — the same Filespec may be referenced from multiple `/AF` arrays. + +--- + +## 6. PDF Portfolio / Collection + +A PDF Portfolio (PDF 1.7+, ISO 32000-1 §12.3.5) is a PDF whose catalog contains a `/Collection` dictionary. The Portfolio container acts as a navigator shell; the actual attached documents are Filespecs in the `EmbeddedFiles` name tree. + +**Detection:** `Catalog → /Collection` present → this is a Portfolio. + +**`/Collection` dictionary keys:** + +| Key | Description | +|-----|-------------| +| `/Schema` | Dictionary of column field definitions (name, type, visibility order) for the Portfolio UI. Each entry maps a field name to `<< /E (label) /T /S|D|N|F /O integer /V bool >>`. | +| `/D` | The default document to display — a string key into the `EmbeddedFiles` name tree, or a string naming the cover sheet. | +| `/View` | Preferred initial view: `D` (details), `T` (tile), `H` (hidden). | +| `/Navigator` | Indirect reference to a Filespec pointing to a Navigator PDF (a separate PDF providing the portfolio UI shell). | +| `/Sort` | Specifies the default sort column and order. | + +**Portfolio items** are regular Filespecs in the `EmbeddedFiles` name tree. The `/Collection` dictionary's `/Schema` assigns metadata field names; individual Filespecs carry those field values in their `/CI` (collection item) dictionary. + +**Navigator PDF:** Some Portfolios embed a separate PDF as the visual container/shell. This Navigator PDF is itself an embedded file referenced from `/Collection /Navigator`. It should be treated as infrastructure rather than a content document — do not recurse into it for text extraction unless explicitly requested. + +**Distinguishing Portfolio from a PDF with attachments:** if `/Collection` is present in the catalog, treat all `EmbeddedFiles` entries as Portfolio items (each is a first-class document). Without `/Collection`, `EmbeddedFiles` entries are attachments supplementary to the parent document's content. + +--- + +## 7. Recursive Extraction + +When an embedded file's MIME type is `application/pdf` (declared in `/Subtype`) or detected as PDF by magic bytes (`%PDF-`), the file can be extracted and parsed as a standalone PDF. + +**Depth limiting:** maintain a recursion depth counter; enforce a configurable maximum (default: 3). Beyond the limit, record the file in the attachment list with a `recursion_limit_reached` flag but do not parse it. This prevents pathological inputs with circular or deeply nested PDFs from exhausting memory. + +**Binary file handling:** identify the MIME type from `/Subtype` and confirm with magic byte inspection. Files whose MIME type is not `application/pdf` and whose content does not begin with `%PDF-` are binary attachments — extract bytes and metadata only, do not attempt PDF parsing. + +**Object streams and cross-reference:** each recursively parsed PDF is an independent document with its own cross-reference table and object numbering. Do not share object caches across recursion levels. + +--- + +## 8. ZUGFeRD / Factur-X + +ZUGFeRD (DE) and Factur-X (FR) are electronic invoicing standards that encode a human-readable PDF invoice (conforming to PDF/A-3) with an embedded XML invoice payload conforming to EN 16931 / UN/CEFACT CII Cross Industry Invoice. + +**Detection pattern:** + +1. Catalog has `/AF` array (PDF/A-3 conformance). +2. `EmbeddedFiles` name tree contains an entry with filename `factur-x.xml` (Factur-X) or `zugferd-invoice.xml` / `ZUGFeRD-invoice.xml` (ZUGFeRD). +3. The Filespec's `AFRelationship` is `/Data`. +4. MIME type is `application/xml` or `text/xml`. + +**Extraction value:** for financial document processing pipelines, the embedded XML carries structured data (line items, amounts, VAT, party identifiers) that is far more reliable than text extracted from the PDF rendering. pdftract should expose this XML in the `attachments` array output alongside the extracted text, enabling callers to consume both without reparsing the PDF. + +A secondary indicator is the PDF/A conformance metadata in the XMP stream (`pdfaid:part = 3`, `pdfaid:conformance = B` or `U`). + +--- + +## 9. Extraction Policy and Output + +**JSON output schema for attachments:** + +```json +{ + "attachments": [ + { + "filename": "factur-x.xml", + "mime_type": "application/xml", + "size_bytes": 14230, + "description": "Factur-X structured invoice", + "af_relationship": "Data", + "page": null, + "data_b64": "" + } + ] +} +``` + +| Field | Notes | +|-------|-------| +| `filename` | From `/UF` if present, otherwise `/F`. | +| `mime_type` | From EmbeddedFile `/Subtype`; `null` if absent. | +| `size_bytes` | From `Params /Size`; `null` if absent. | +| `description` | From Filespec `/Desc`; `null` if absent. | +| `af_relationship` | String value of `AFRelationship`; `null` if not PDF 2.0. | +| `page` | 0-based page index for annotation-sourced attachments; `null` for document-level. | +| `data_b64` | Present only when `extract_attachments: true` in caller options. | + +**Caller options:** + +```rust +pub struct ExtractionOptions { + /// Include attachment byte payloads (base64) in output. + pub extract_attachments: bool, + /// Recursively extract text from PDF attachments. + pub recursive: bool, + /// Maximum recursion depth for recursive PDF extraction. + pub max_recursion_depth: u8, +} +``` + +Regardless of `extract_attachments`, always populate the metadata fields (`filename`, `mime_type`, `size_bytes`, etc.) so callers can make informed decisions about which files to retrieve. + +--- + +## 10. Security Considerations + +Embedded files are opaque byte payloads from potentially untrusted sources. pdftract must extract bytes without executing or interpreting them. + +**Execution surface:** never pass embedded file bytes to a shell, interpreter, or OS loader. The extraction pipeline is: locate stream → apply `/Filter` decompression chain → write raw bytes. No further processing. + +**MIME type verification:** the declared `/Subtype` in the EmbeddedFile stream is author-controlled and untrustworthy. Perform independent MIME detection from the first 512 bytes of the decompressed payload using magic byte patterns. If the detected type differs materially from the declared type, record both and emit a warning. + +**High-risk MIME types to flag:** + +- `application/javascript`, `text/javascript` +- `application/x-executable`, `application/x-elf`, `application/x-mach-binary`, `application/x-msdownload` +- `application/x-sh`, `application/x-bat`, `application/x-powershell` +- `application/x-java-archive` (JAR files can contain auto-run manifests) + +**Output field:** add `"security_flags": ["mime_mismatch", "high_risk_type"]` to the attachment JSON object when either condition is detected. + +**Decompression safety:** the `/Filter` chain may cause decompression bombs. Enforce a maximum decompressed size (configurable, default 256 MB per file). Abort decompression and record a `decompression_limit_exceeded` flag if the limit is reached. Streaming decompression (rather than full buffer allocation) is preferred — decompress incrementally and track bytes written. + +**Name traversal:** Filespec `/F` and `/UF` filenames may contain path separators or `..` sequences. When writing extracted files to disk (if pdftract exposes a save-to-disk API), sanitize filenames by stripping directory components and rejecting names that resolve outside the target directory. diff --git a/docs/research/form-fields-and-annotations.md b/docs/research/form-fields-and-annotations.md new file mode 100644 index 0000000..a573d21 --- /dev/null +++ b/docs/research/form-fields-and-annotations.md @@ -0,0 +1,170 @@ +# Form Fields and Annotations: AcroForm, XFA, and Annotation Text Extraction + +## 1. AcroForm Overview + +The document catalog (`/Type /Catalog`) may contain an `/AcroForm` dictionary. This dictionary is the root of all interactive form machinery in the document. Its primary entries are: + +- **`Fields`** — an array of indirect references to field dictionaries that are direct children of the field hierarchy (the root fields). Fields not referenced here are reachable only through their parent's `Kids` array. +- **`DA`** — a document-level default appearance string, used when a field lacks its own `DA`. +- **`DR`** — a resource dictionary shared across all fields; typically contains the `/Font` sub-dictionary mapping font names used in appearance strings. +- **`NeedAppearances`** — a boolean flag; when `true`, viewers must regenerate appearance streams before rendering. A library performing text extraction should not depend on pre-generated appearances being present. +- **`XFA`** — present only in XFA documents; see section 5. + +The field hierarchy is a tree. Non-terminal (intermediate) nodes group related fields and act as inheritance sources; they carry no value themselves. Terminal fields are the leaf nodes — they define a field type and hold a value. The distinction is made by the presence or absence of the `/FT` (field type) entry: a terminal field has `/FT`; a non-terminal node may omit it, inheriting from a parent or leaving it undefined. + +**Inherited attributes.** A terminal field that lacks `DA`, `FT`, `Ff` (field flags), or `DV` inherits those values by walking up the `Parent` chain until a value is found or the chain is exhausted. When extracting field data, the implementation must perform this walk for each attribute independently. + +## 2. Field Types + +### 2.1 Text Fields (`/FT /Tx`) + +A text field stores a user-entered string. Key entries: + +- **`V`** — the current value, a string object (PDFDocEncoding or UTF-16BE with BOM). +- **`DV`** — the default value, same encoding rules as `V`. +- **`MaxLen`** — maximum number of characters permitted. +- **`DA`** — default appearance string: a content-stream fragment such as `/Helv 12 Tf 0 g`. The font name is resolved against the `DR` dictionary in `/AcroForm`. +- **`Ff` bit 13** (`Multiline`) — when set, the field accepts multiple lines of text. Extraction should preserve embedded newlines in `V`. +- **`Ff` bit 14** (`Password`) — the value should be treated as sensitive; some implementations may redact it. + +### 2.2 Button Fields (`/FT /Btn`) + +Three subtypes are distinguished by `Ff`: + +- **Pushbutton** (`Ff` bit 17 set) — carries no persistent value; its purpose is to trigger actions. No text value to extract. +- **Checkbox** — `V` holds the current appearance state name (e.g., `/Yes` or `/Off`); `DV` holds the default. The `AS` entry in the widget annotation mirrors the checked state and is the authoritative indicator when rendering; extraction should prefer `V` on the field, cross-referencing `AS` to confirm. +- **Radio group** — a non-terminal field node whose `Kids` are individual radio button widgets. Each kid widget has an `AS` entry whose value matches the export value when selected. The parent's `V` holds the export value of the currently selected option. To find the selected label, match `V` against the `AS` values of all kids. + +### 2.3 Choice Fields (`/FT /Ch`) + +- **`Opt`** — an array of options. Each element is either a string (the export value equals the display value) or a two-element array `[export_value, display_string]`. +- **`V`** — a string (single selection) or array of strings (multi-select when `Ff` bit 22 is set). Contains the export value of the selected option(s). +- **`TI`** — top index; the first visible option in a scrollable listbox. +- **`Ff` bit 18** — when set, the field is a combo box rather than a listbox. + +To extract the display text for a selection, locate the entry in `Opt` whose export value matches `V`. + +### 2.4 Signature Fields (`/FT /Sig`) + +`V` is a signature dictionary, not a string. Text extraction is out of scope for signature fields; record the field name and type, but emit no text value. + +## 3. Field Value Extraction + +**String decoding.** A string value in a field is encoded in either PDFDocEncoding or UTF-16BE. The BOM `\xFE\xFF` at the start of the byte sequence signals UTF-16BE; otherwise, assume PDFDocEncoding. Implement a lookup table for the 39 PDFDocEncoding code points that differ from Latin-1. + +**Text fields.** Read `V` directly. If `V` is null or absent, fall back to `DV`. If both are absent, emit an empty string. + +**Checkboxes.** Read `V` from the field; the value is a name object. Any value other than `/Off` (the conventional unchecked state) indicates a checked state. Confirm against `AS` in the widget annotation. + +**Radio buttons.** Read `V` from the parent field. Walk `Kids`; the selected kid is the one whose `AS` matches `V`. Emit the matching export value or display string from `Opt` if present. + +**Choice fields.** Read `V` (string or array). For each selected export value, find the corresponding display string in `Opt`. If `Opt` contains plain strings, export value equals display string. + +## 4. Widget Annotations + +Every terminal field is associated with one or more widget annotations. A field may merge with its single widget (the same dictionary object serves both roles) or the field may have a `Kids` array of separate widget dictionaries, each with `/Subtype /Widget`. Widgets carry: + +- **`Rect`** — a four-element array `[x1 y1 x2 y2]` in default user space units, giving the bounding box of the field on the page. This is the `bbox` used in output. +- **`P`** — indirect reference to the page object on which the widget appears. +- **`AP`** — appearance dictionary with up to three sub-dictionaries: `N` (normal), `R` (rollover), `D` (down). Each entry is either a Form XObject stream or a sub-dictionary keyed by appearance state names (used for checkboxes and radio buttons). + +**Extracting text from appearance streams.** When `V` is absent or when the document sets `NeedAppearances false` and has pre-generated streams, the `N` appearance stream is a Form XObject containing a content stream. This stream can be processed identically to a page content stream: extract text operators (`Tj`, `TJ`, `'`, `"`) using the font resources in the stream's own `/Resources` dictionary. This is the fallback path for fields whose value is encoded only in the rendered appearance. + +## 5. XFA Forms + +The `/XFA` entry in `/AcroForm` contains the XFA form data. Its value is either a single stream (the complete XFA document as XML) or an array of alternating name/stream pairs representing named XFA packets: + +``` +[ /xdp:xdp stream /template stream /datasets stream /config stream ... ] +``` + +XFA versions range from 2.0 through 3.3; the version is declared in the `xdp:xdp` root element's namespace URI. + +**Relevant packets:** + +- **`template`** — defines the form structure: field names, types, binding expressions, and layout. Field names follow XPath-like dot-notation (`form1.page1.subform1.field1`). +- **`datasets`** — contains the actual data bound to the template. The `xfa:data` element holds a tree of XML elements whose tag names and text content correspond to field values. + +**Extraction algorithm for XFA.** Parse the `datasets` XML; walk the element tree depth-first. For each leaf text node, construct its full path by joining ancestor element names with `.`. Emit `(path, text_content)` pairs. For structured arrays, the XFA spec uses sibling elements with the same tag name; track occurrence indices. + +**Hybrid XFA documents.** Some PDFs contain both `/XFA` and an AcroForm `Fields` array. The AcroForm fields serve as a compatibility layer for viewers that do not support XFA. When `/XFA` is present, prefer XFA data extraction; the AcroForm values may be stale or absent. + +## 6. Annotation Types Relevant to Text Extraction + +Annotations are listed in the `Annots` array of a page dictionary. Each annotation dictionary has `/Type /Annot` and a `/Subtype` that determines its semantics. + +- **`Text`** (sticky note) — `Contents` holds the annotation text; `T` holds the author name; `RC` may hold rich text. +- **`FreeText`** — text rendered directly on the page surface. `Contents` is the plain text; `DA` and `DS` control styling; `RC` may carry formatted content. +- **Markup annotations** (`Highlight`, `Underline`, `Squiggly`, `StrikeOut`) — these reference existing page text via `QuadPoints`, an array of 8n numbers defining n quadrilaterals over the marked text. `Contents` carries the reviewer's comment. +- **`Link`** — `Contents` may hold descriptive text; the `A` entry holds an action dictionary (`/S /URI` with a `URI` string, or `/S /GoTo` with a destination). +- **`Stamp`** — `Contents` is the stamp text (e.g., "Approved"). +- **`Popup`** — associated with a markup annotation via `Parent`; `Contents` mirrors the parent's comment. Skip independently; capture through the parent. + +## 7. Rich Text (`RC` Field) + +The `RC` entry in both annotation dictionaries and text field dictionaries holds an XHTML-like string defined by PDF spec §12.7.3.4. The markup uses a restricted subset: ``, `

`, `` elements with inline style attributes (`font-family`, `font-size`, `font-weight`, `font-style`, `color`, `text-decoration`). + +**Plain-text extraction.** Parse the XML, discard all tags, and concatenate text node content. `

` boundaries map to newlines. `
` within a paragraph maps to a newline. + +**Formatted extraction.** For callers that want span metadata, capture each `` with its computed style. The style attribute follows a semicolon-separated CSS-like syntax; parse it into a key-value map. Relevant keys: `font-weight: bold`, `font-style: italic`, `color: #rrggbb` or `color: rgb(r,g,b)`. + +When both `RC` and `Contents` are present, `RC` is the richer source. When `RC` is absent, fall back to `Contents`. + +## 8. Extracting Annotation Text + +**Iteration.** For each page, read the `Annots` array. Each element is an indirect reference to an annotation dictionary. Resolve each reference and filter by `Subtype`. + +**Fields to extract per annotation:** + +| Entry | Meaning | +|---|---| +| `Contents` | Primary text content | +| `RC` | Rich text override (parse for plain text) | +| `T` | Author / title | +| `Subtype` | Annotation kind | +| `Rect` | Bounding box on the page | +| `QuadPoints` | Highlighted region (markup annotations only) | + +**Spatial ordering.** To interleave annotation text with body text, compute the center of `Rect` (or the centroid of all `QuadPoints` quads) and sort annotations by their vertical position (descending `y`) then horizontal position (ascending `x`), matching the reading-order convention used for body text. + +**Markup annotation text recovery.** For `Highlight`, `Underline`, `Squiggly`, and `StrikeOut`, the `QuadPoints` array identifies the page content already extracted by the main text extraction pipeline. A library can optionally resolve these quads against the extracted glyph positions to return the marked span as a first-class excerpt, in addition to the `Contents` comment. + +## 9. Output Representation + +**Form fields.** Emit a top-level `form_fields` array. Each entry is a struct: + +```rust +pub struct FormField { + pub name: String, // fully qualified field name (dot-joined) + pub field_type: FieldType, // Tx | Btn | Ch | Sig + pub value: Option, // decoded current value + pub default_value: Option, + pub page: Option, // 0-indexed page number from widget P entry + pub bbox: Option<[f32; 4]>,// [x1, y1, x2, y2] from widget Rect +} +``` + +**Annotations.** Emit an `annotations` array per page: + +```rust +pub struct Annotation { + pub kind: AnnotationKind, // Text | FreeText | Highlight | ... + pub contents: Option, + pub rich_text: Option, // raw RC XML if present + pub author: Option, + pub bbox: [f32; 4], + pub quad_points: Vec<[f32; 8]>, // populated for markup annotations +} +``` + +**Caller-controlled inclusion.** Expose boolean flags on the extraction configuration: + +```rust +pub struct ExtractionOptions { + pub extract_forms: bool, + pub extract_annotations: bool, + pub prefer_xfa: bool, // when XFA present, skip AcroForm field scan +} +``` + +When `extract_forms` is `false`, skip the AcroForm traversal entirely. When `extract_annotations` is `false`, skip the `Annots` array on each page. Both default to `true`. When `prefer_xfa` is `true` and `/XFA` is present, use XFA dataset extraction and suppress AcroForm field output to avoid duplicates. diff --git a/docs/research/image-and-figure-extraction.md b/docs/research/image-and-figure-extraction.md new file mode 100644 index 0000000..a8dcc90 --- /dev/null +++ b/docs/research/image-and-figure-extraction.md @@ -0,0 +1,182 @@ +# Image and Figure Extraction in PDF + +## 1. Image XObjects + +PDF images are most commonly embedded as **Image XObjects**. An XObject is an indirect object whose dictionary contains `/Type /XObject` and `/Subtype /Image`. It is invoked from a content stream using the `Do` operator: + +``` +/ImageName Do +``` + +where `ImageName` is a name key in the current resource dictionary's `/XObject` subdictionary that maps to the image's indirect reference. + +The XObject dictionary must contain: + +| Key | Type | Description | +|---|---|---| +| `Width` | integer | Pixel width of the raster | +| `Height` | integer | Pixel height of the raster | +| `ColorSpace` | name or array | Color space of the image samples | +| `BitsPerComponent` | integer | Bits per color component (1, 2, 4, 8, or 16) | +| `Filter` | name or array | Compression filter(s) applied to the stream | + +The stream body following the dictionary contains the raw (filtered) image data. BitsPerComponent is omitted when the image uses a mask or JBIG2 encoding where bit depth is implied. + +### Positioning via the CTM + +The `Do` operator renders the image into a 1×1 unit square anchored at the origin. The **current transformation matrix (CTM)** at the point of `Do` invocation maps that unit square into page space. A canonical image placement looks like: + +``` +q +72 0 0 96 144 432 cm +/Im1 Do +Q +``` + +The `cm` operator concatenates `[72 0 0 96 144 432]` onto the CTM: this scales the unit square to 72×96 points and translates its origin to (144, 432) in page coordinates. The rendered bounding box in page units is thus derived from the CTM columns — specifically, the x-extent is the length of the first column vector and the y-extent is the length of the second column vector. When the matrix contains rotation or shear, the bounding box must be computed as the convex hull of the four transformed corners: `(0,0)`, `(1,0)`, `(0,1)`, `(1,1)`. + +## 2. Inline Images + +Inline images embed pixel data directly into the content stream using a three-operator sequence: + +``` +BI + /W 320 /H 240 /CS /RGB /BPC 8 /F /DCT +ID + +EI +``` + +The `BI` (Begin Image) operator introduces the inline dictionary. Key abbreviations are standardized: + +| Abbreviation | Full key | +|---|---| +| `/W` | `Width` | +| `/H` | `Height` | +| `/CS` | `ColorSpace` | +| `/BPC` | `BitsPerComponent` | +| `/F` | `Filter` | +| `/DP` | `DecodeParms` | + +`ID` (Image Data) marks the transition from dictionary to binary payload. The parser must switch to raw byte mode immediately after the whitespace following `ID`. The payload ends at the next unescaped `EI` token. Reliably detecting `EI` requires either tracking the filter's expected byte count or scanning for `EI` preceded by a whitespace character. + +Inline images are limited to simpler use cases: they cannot be referenced by name, cannot be reused across content streams, and are typically restricted to JPEG, CCITT, or uncompressed data. They carry no indirect object overhead but complicate stream parsing significantly. + +## 3. Filter Decoding + +Both XObject streams and inline images may be compressed with one or more filters listed in the `/Filter` key (a name for a single filter, or an array for a chain). Filters are applied in array order during encoding; decoding reverses the chain. + +**Common filters:** + +- **`DCTDecode`** — JPEG (ISO 10918). The stream is a complete JFIF/JPEG file. `DecodeParms` may specify `ColorTransform` (0 = no transform, 1 = YCbCr→RGB, -1 = automatic). Standard JPEG decoders handle the DCT coefficients, quantization, and Huffman decoding. +- **`JPXDecode`** — JPEG 2000 (ISO 15444). The stream is a complete JP2 or J2C codestream. Color space information may be embedded in the JP2 container; when present it overrides the PDF `/ColorSpace` key. +- **`JBIG2Decode`** — Bi-level (1 bpp) compression. `DecodeParms` may contain a `JBIG2Globals` key whose value is a stream containing the global segment data that must be prepended before decoding. Requires a full JBIG2 decoder (e.g., the `jbig2dec` library via FFI). +- **`CCITTFaxDecode`** — Group 3 or Group 4 fax encoding. `DecodeParms` specifies `K` (0=Group3 1D, -1=Group4, positive=Group3 2D with K rows between EOL), `Columns`, `Rows`, `BlackIs1`, `EncodedByteAlign`. +- **`FlateDecode`** — zlib/deflate (RFC 1950 wrapper). `DecodeParms` may specify a PNG predictor via `Predictor` (10–15 for PNG filter types None/Sub/Up/Average/Paeth applied row-by-row). After inflation, the predictor must be undone row by row. +- **`LZWDecode`** — LZW compression. `DecodeParms` supports `EarlyChange` (1 by default, meaning the code size increases one code early). The LZW variant matches the TIFF LZW convention, not GIF. +- **`RunLengthDecode`** — PackBits run-length encoding. Each control byte `n` signals either `(257-n)` copies of the next byte (if n > 128) or `(n+1)` literal bytes (if n < 128). Byte 128 signals end-of-data. + +For chained filters — e.g., `[/ASCII85Decode /FlateDecode]` — the decoder applies ASCII85 first to produce binary, then inflates the result. + +## 4. Color Spaces + +The `/ColorSpace` value determines how to interpret decoded sample bytes. + +**Device spaces** are the simplest: `DeviceGray` (1 component, 0=black, 1=white), `DeviceRGB` (3 components), `DeviceCMYK` (4 components, subtractive). + +**Calibrated spaces** embed a viewing condition: `CalGray` and `CalRGB` specify a `WhitePoint`, optional `BlackPoint`, and `Gamma`/`Matrix`. `Lab` uses the CIE L*a*b* model with `WhitePoint` and `Range` bounds on a* and b*. + +**`ICCBased`** references an embedded ICC profile stream. The profile's `N` value gives the component count. ICC profiles provide a precise device-independent color path; for sRGB output, apply the ICC forward transform to XYZ and then the sRGB matrix. + +**`Indexed`** defines a palette: `[/Indexed base hival lookup]`. The base space specifies the color model of palette entries; `hival` is the maximum index (palette has `hival+1` entries); `lookup` is either a string or stream of `(hival+1) * N` bytes where N is the component count of the base space. Each 8-bit sample is a palette index. + +**`Separation`** addresses a single named colorant: `[/Separation name alternateSpace tintTransform]`. The tint transform (a PDF function) maps a tint value [0,1] to the alternate space. When the target device does not support the colorant, apply the tint transform as a fallback to the alternate space (which is typically `DeviceCMYK` or `DeviceRGB`). + +**`DeviceN`** generalizes Separation to multiple colorants: `[/DeviceN names alternateSpace tintTransform attributes]`. Each channel maps to a named colorant; the tint transform maps the N-component input to the alternate space. + +For pipeline output, convert all color spaces to sRGB: device spaces use standard matrices (CMYK to RGB: `R=1-min(1,C+K)`, `G=1-min(1,M+K)`, `B=1-min(1,Y+K)`); calibrated/ICC spaces go through XYZ intermediate. + +## 5. Image Geometry + +The CTM at `Do` invocation is a 3×3 affine matrix stored as six values `[a b c d e f]`, representing: + +``` +| a b 0 | +| c d 0 | +| e f 1 | +``` + +The rendered width in page units is `sqrt(a² + b²)` and the rendered height is `sqrt(c² + d²)`. Rotation angle is `atan2(b, a)`. Shear is present when the dot product `(a·c + b·d)` is nonzero. + +DPI is computed as: + +``` +dpi_x = Width_px / (rendered_width_pts / 72.0) +dpi_y = Height_px / (rendered_height_pts / 72.0) +``` + +For rotated images, the bounding box in axis-aligned page coordinates is the AABB of the four corners produced by transforming `(0,0)`, `(1,0)`, `(0,1)`, `(1,1)` through the CTM. + +## 6. Form XObjects + +A `/Subtype /Form` XObject is a self-contained reusable content stream — not an AcroForm widget. It may embed text, images, paths, and other XObjects (including nested Form XObjects). The parser must recurse into each Form XObject encountered via `Do`. + +The Form XObject dictionary includes: + +- `/BBox` — the bounding box of the form in its own local coordinate system. +- `/Matrix` (optional) — a transformation applied before the form's content stream executes, in addition to the invoking CTM. +- `/Resources` — its own resource dictionary, independent of the page's. + +When a `Do` operator names a Form XObject, the renderer pushes the current graphics state, concatenates the form's `/Matrix` onto the CTM, clips to `/BBox`, then executes the content stream. The combined CTM (page CTM × form matrix) must be tracked at every `Do` invocation inside the form to correctly compute image geometry. + +## 7. Soft Masks and Transparency + +Images participate in the PDF transparency model in three ways: + +- **`ImageMask`** (boolean) — when `true`, the image is a 1-bpp stencil. Samples with value 0 paint the current color; samples with value 1 are transparent. `ColorSpace` and `BitsPerComponent` are not used. +- **`/Mask`** — either a color key mask (an array of `[min max]` pairs per component defining transparent ranges) or a reference to a 1-bpp image stream serving as a hard mask. +- **`/SMask`** — a reference to a grayscale image stream interpreted as an alpha channel (0=fully transparent, 255=fully opaque). The SMask stream is itself an Image XObject with its own filter, width, height, and `ColorSpace` of `DeviceGray`. + +For figure detection purposes, an image that is pure stencil (`ImageMask=true`) or has very low average alpha may be a decorative overlay rather than a content figure. + +## 8. Detecting Figure Regions + +Building the figure inventory for a page: + +1. Walk the content stream, tracking the CTM stack. At each `Do` invocation for an Image XObject, compute the AABB bounding box in page coordinates and record the image metadata. +2. Apply size thresholds: images smaller than a minimum area threshold (e.g., 1% of page area) are likely icons or decorative glyphs; images covering more than 90% of the page are likely scanned-page backgrounds. +3. Apply position heuristics: watermarks are typically centered and semi-transparent; logos appear near page margins and are small; content figures appear in the body region with substantial rendered area. +4. Caption association: scan for text runs within a vertical proximity band (e.g., ±2× line-height) below or above the image bounding box. Text beginning with "Figure", "Fig.", or a numeric pattern is a strong caption signal. Associate the nearest qualifying text run as the figure's caption. +5. Detect full-page images: rendered size within 5% of the page's MediaBox and positioned at or near the origin — flag as `scanned_page`. + +## 9. Output Representation + +Each detected figure is emitted as a JSON object in the extraction output: + +```json +{ + "kind": "figure", + "page": 3, + "bbox": [72.0, 300.0, 540.0, 600.0], + "width_px": 1200, + "height_px": 900, + "dpi": 144.0, + "color_space": "DeviceRGB", + "filter": ["DCTDecode"], + "caption": "Figure 4. Loss curves over 100 training epochs.", + "image_b64": "" +} +``` + +`bbox` is `[x_min, y_min, x_max, y_max]` in PDF page units (origin at bottom-left per PDF convention, or converted to top-left depending on the caller's coordinate system preference). `dpi` is the effective horizontal DPI rounded to two decimal places. `filter` is the original filter chain as decoded from the XObject dictionary. `caption` is null when no caption is detected. + +## 10. Extraction Use Cases and Caller Options + +Decoding image bytes is expensive — allocating a decompressed raster for a 300 DPI full-page image at A4 size requires ~25 MB. The extraction pipeline exposes two caller-controlled options: + +- **`extract_images: bool`** — when `false`, only metadata (`bbox`, `width_px`, `height_px`, `dpi`, `color_space`, `filter`) is emitted. The image stream is not decompressed. This is the default for text-extraction workflows where image content is not needed. +- **`max_image_dpi: u32`** — when `extract_images` is `true`, images whose effective DPI exceeds this threshold are downsampled before encoding. The downsampled dimensions are `round(rendered_width_pts / 72.0 * max_image_dpi)` × `round(rendered_height_pts / 72.0 * max_image_dpi)`. A common default is 150 DPI for document previews or 300 DPI for archival quality. + +For very large images (e.g., 10,000×10,000 px TIFF-equivalent embedded as FlateDecode), the decoder should process row-by-row rather than inflating the entire stream into a contiguous buffer. FlateDecode with PNG predictors naturally supports row-granularity streaming: inflate one PNG-filtered row, unfilter it (applying Sub/Up/Average/Paeth as indicated), emit to the output buffer, then continue. This keeps peak memory bounded to `2 × stride_bytes` regardless of image height. + +JBIG2 and JPEG 2000 streams require external codec libraries; callers without FFI dependencies available should fall back to emitting raw stream bytes under a `raw_stream_b64` key rather than failing. The `filter` field in the output indicates which codec is needed for the caller to decode the bytes independently. diff --git a/docs/research/invisible-and-hidden-text.md b/docs/research/invisible-and-hidden-text.md new file mode 100644 index 0000000..3991fbd --- /dev/null +++ b/docs/research/invisible-and-hidden-text.md @@ -0,0 +1,158 @@ +# Invisible and Hidden Text in PDFs + +## Overview + +PDF files routinely contain text that is present in the byte stream but not visually rendered to a reader. This occurs through several independent mechanisms: the text rendering mode operator, color matching with the page background, zero-opacity graphics states, clip-path suppression, and near-zero scaling. For a text extraction library, invisible text is often the most valuable content on the page — particularly in scan-based PDF/A files where an OCR layer carries the only machine-readable text. This document covers detection algorithms for each invisibility mechanism and the output policy `pdftract` should apply. + +--- + +## 1. Text Rendering Modes (`Tr`) + +The PDF specification (ISO 32000-2 §9.3.6) defines the `Tr` (text rendering mode) operator, which controls how glyph outlines are applied to the page. The argument is an integer 0–7: + +| Mode | Name | Fill | Stroke | Clip | +|------|------|------|--------|------| +| 0 | Fill | yes | no | no | +| 1 | Stroke | no | yes | no | +| 2 | Fill then stroke | yes | yes | no | +| 3 | Invisible | no | no | no | +| 4 | Fill + clip | yes | no | yes | +| 5 | Stroke + clip | no | yes | yes | +| 6 | Fill + stroke + clip | yes | yes | yes | +| 7 | Clip only | no | no | yes | + +Mode 3 is the canonical invisible text mechanism. The glyph is processed by the text engine — Unicode mapping, advance width, and spacing operators all apply normally — but nothing is painted. This is the mechanism used by scan-based PDF/A files to overlay OCR output. Mode 7 is similarly invisible but accumulates the glyph outline into the current clip path. + +During content stream parsing, the current `Tr` value must be tracked as part of the graphics state. It defaults to 0 at the start of each page content stream and is reset by `q`/`Q` pushes and pops along with the rest of the graphics state. Every text span extracted should carry the rendering mode at the time of its `Tj`, `TJ`, `'`, `"`, or similar text-showing operator. + +--- + +## 2. Invisible Text Over Scans (PDF/A Pattern) + +The dominant real-world source of mode-3 text is the OCR-over-scan pattern used in PDF/A-3 and related archival formats. The structure is: + +1. A raster image XObject is placed on the page via `Do`, covering substantially the full page area (typically the entire MediaBox). +2. A sequence of mode-3 text spans is overlaid at positions that correspond to the OCR engine's bounding box output for each word or glyph. + +**Detection heuristic.** Flag a page as using this pattern when: +- At least one image XObject with an area ≥ 80% of the page MediaBox is present. +- At least one text span with `Tr == 3` exists on the same page. +- The text spans cluster within the image bounding box bounds. + +When this pattern is detected, the mode-3 text spans are the authoritative extraction result. Re-running OCR on the raster would be redundant and potentially lower quality. Mark these spans with `source: "ocr_invisible_layer"` so callers can distinguish them from normally rendered text. The raster image itself should not be forwarded to an OCR pipeline when invisible text is already present. + +**Coordinate correspondence.** OCR layers typically place each word or character at the correct position on the page coordinate system. Verify plausibility by checking that the text spans, when rendered at their specified positions, fall within the image XObject's bounding box. Spans placed outside the image area are likely artifacts and should be flagged separately. + +--- + +## 3. White Text on White Background + +Text whose fill color matches the page background is visually hidden even at `Tr 0`. Detecting this requires tracking the current fill color through the content stream and comparing it against the effective background. + +**Color tracking operators.** The current fill color is set by: +- `rg r g b` — DeviceRGB fill color (values 0.0–1.0) +- `RG r g b` — DeviceRGB stroke color +- `k c m y k` — DeviceCMYK fill color +- `K c m y k` — DeviceCMYK stroke color +- `g gray` — DeviceGray fill +- `G gray` — DeviceGray stroke +- `cs name` — set fill color space to a named space +- `CS name` — set stroke color space +- `sc`/`scn` — set fill color components in current fill color space +- `SC`/`SCN` — set stroke color components in current stroke color space + +The graphics state stack (`q`/`Q`) must save and restore the full color state including both the current color space and the current color value vector. + +**White in each color space.** The canonical white values are: +- DeviceGray: `1.0` +- DeviceRGB: `1.0 1.0 1.0` +- DeviceCMYK: `0.0 0.0 0.0 0.0` +- CalRGB, CalGray, ICCBased: requires converting to a perceptual space (e.g., CIELAB) and checking L* ≥ 95. + +**Background color determination.** The page background is ambiguous. The PDF viewer default is white, but a content stream may paint a filled rectangle covering the MediaBox with an arbitrary color before placing text. The most reliable approach is to build a simple z-order list of opaque filled rectangles that cover each point of the page, then for any text glyph center point, walk the z-order list downward from the text to find the topmost background element. If the background is an image XObject, extracting the background color at a point requires sampling the image raster — a heavier operation. In practice, comparing the fill color against `white` (per-color-space definition above) catches the overwhelming majority of white-on-white cases without full compositing. + +--- + +## 4. Zero-Opacity and Transparency + +PDF transparency (ISO 32000-2 §11) introduces alpha values separate from the color operators. + +**Graphics state alpha.** The `gs` operator references an ExtGState resource dictionary. The relevant keys: +- `ca` — constant alpha for non-stroking (fill) operations; float 0.0–1.0 +- `CA` — constant alpha for stroking operations; float 0.0–1.0 + +A text span with `ca == 0.0` (or effectively zero, e.g., < 0.01) at `Tr 0` is invisible. At `Tr 1`, invisibility is governed by `CA`. At `Tr 2`, both `ca` and `CA` must be checked. Track the current `ca` and `CA` values as part of the graphics state, initializing them to 1.0 per the PDF default. + +**Soft masks.** A soft mask (`SMask` in the ExtGState dictionary) may reduce effective alpha further. An `SMask` of type `Luminosity` or `Alpha` applied to a transparency group containing text can render that text invisible even if `ca` is nonzero. Full soft mask evaluation requires compositing the transparency group, which is expensive. For detection purposes, flag any text span inside a content stream with an active `SMask` (i.e., `SMask` is not `/None`) as potentially invisible and emit it with `visibility_confidence: low`. + +--- + +## 5. Clipped-Away Text + +The clip path operators `W` (nonzero winding rule) and `W*` (even-odd rule) modify the current clipping region by intersecting it with the current path. Text rendered when the clip region has zero or negligible area is visually absent. + +**Clip path tracking.** The clipping region is part of the graphics state and is saved/restored by `q`/`Q`. It starts as the page MediaBox. Each `W` or `W*` narrows it by intersecting with the path constructed by the preceding `m`/`l`/`c`/`re` operators. The current transformation matrix (`cm`) transforms subsequent coordinates and must be applied to path coordinates before intersection. + +**Detection.** For each text glyph, compute its bounding box in default user space (using the current text matrix, font metrics, and font size). Intersect this rectangle with the current clip region. If the intersection area is below a threshold (e.g., < 0.01 square points), mark the glyph as clipped-invisible. + +Exact clip path intersection for arbitrary Bézier paths is expensive. A practical approximation: represent the clip path as an axis-aligned bounding box (AABB) at each step. This will produce false negatives for concave clip paths but catches the common case of clipping to a zero-width or zero-height rectangle. + +--- + +## 6. Text Scaled to Near-Zero + +A font size of 0.0 or near-zero renders glyphs at sub-pixel scale, making them invisible: + +- `Tf fontname size` — if `size < 0.1`, the rendered glyph height is negligible. +- `Tz scale` — horizontal scaling as a percentage; `Tz 0` collapses all glyph advance widths to zero, stacking all characters at a single point. + +**Detection thresholds.** Flag a text span as size-invisible when: +- The effective font size (after applying the current transformation matrix scale factor) is < 0.1 points, or +- `Tz` is < 1.0 (1% horizontal scaling). + +The effective font size must account for the CTM. Compute the scale factor as `sqrt(a² + b²)` from the current CTM `[a b c d e f]` and multiply by the `Tf` size argument. + +--- + +## 7. Color Space Detection for Fills + +Determining whether a fill is white requires correctly resolving the current color space. The fill color space is established by `cs` and defaults to DeviceGray in early content streams or DeviceRGB in most modern PDFs. Color space names resolve through the page's `Resources/ColorSpace` dictionary. The four categories: + +- **Device spaces** (DeviceGray, DeviceRGB, DeviceCMYK): white values are fixed as above. +- **CIE-based spaces** (CalGray, CalRGB, Lab): convert the color value to CIE L*a*b* and check L* ≥ 95, |a*| ≤ 5, |b*| ≤ 5. +- **ICCBased**: requires loading and evaluating the embedded ICC profile. For extraction purposes, inspect the `Alternate` entry in the ICCBased stream dictionary as a fallback color space and apply its whiteness rule. +- **Indexed**: the color value is a table index; look up the base color and apply the base space rule. +- **Pattern** and **Separation/DeviceN**: too complex for simple whiteness detection; flag as `visibility_confidence: low`. + +--- + +## 8. Intentional Obfuscation and DRM + +Some PDFs deliberately exploit text extraction to prevent accurate copying while maintaining visual fidelity: + +**Position shuffling.** Individual characters are placed at arbitrary positions via separate `Tj` or `TJ` operators with large kerning adjustments, making the logical reading order in the byte stream non-sequential. Visually, the PDF renderer draws the correct text because the positions are meticulously computed. Extraction that reads characters in byte-stream order produces gibberish. Detection: flag pages where the average glyph-center-to-glyph-center distance divided by glyph advance width exceeds a threshold (e.g., > 5.0), suggesting non-linear character placement. + +**Deliberate CMap corruption.** The `ToUnicode` CMap in the font dictionary maps glyph IDs to Unicode code points. An adversarial PDF may install a ToUnicode CMap where the mappings are deliberately wrong — e.g., all glyphs map to `U+0041` (A), or the CMap is omitted entirely. The visual rendering uses the actual glyph outlines and is correct; extraction using ToUnicode returns nonsense. Detection: compare the extracted Unicode string entropy against the expected entropy for the detected language. A string of all-identical characters or a very low-entropy sequence over a full paragraph is a strong signal. `pdftract` has no reliable recovery path for this case; it should document the limitation and report `extraction_quality: obfuscated`. + +--- + +## 9. Output Policy + +**Default behavior.** Extract all text spans regardless of rendering mode or computed visibility. This is the most useful default for search indexing and RAG pipelines, which benefit from invisible OCR layers. + +**Span metadata.** Each extracted `TextSpan` should carry: + +```rust +pub struct TextSpan { + pub text: String, + pub rendering_mode: u8, // Tr value 0–7 + pub visible: bool, // false if any invisibility mechanism applies + pub visibility_flags: VisibilityFlags, // bitfield: INVISIBLE_TR | WHITE_COLOR | ZERO_ALPHA | CLIPPED | NEAR_ZERO_SIZE + pub source: SpanSource, // Normal | OcrInvisibleLayer | Unknown + pub visibility_confidence: Confidence, // High | Low (low when SMask or DeviceN color) +} +``` + +**Caller filtering.** Provide an extraction option `visible_only: bool` that filters the output to spans where `visible == true`. This is appropriate for display-faithful extraction. Default: `false`. + +**OCR invisible layer.** Spans with `rendering_mode == 3` on a page matching the scan-pattern heuristic are assigned `source: SpanSource::OcrInvisibleLayer`. These spans should not be deduplicated against OCR pipeline output — they are the preferred result. diff --git a/docs/research/malformed-pdf-repair-and-recovery.md b/docs/research/malformed-pdf-repair-and-recovery.md new file mode 100644 index 0000000..8d52bd3 --- /dev/null +++ b/docs/research/malformed-pdf-repair-and-recovery.md @@ -0,0 +1,168 @@ +# Malformed PDF Repair and Recovery + +**Project:** pdftract — Rust PDF text extraction library +**Scope:** Graceful handling of corrupt, truncated, and malformed PDF files + +--- + +## 1. Prevalence and Categories of Malformed PDFs + +Production PDF extraction cannot assume well-formed input. Malformed PDFs arrive from several distinct failure modes. + +**Truncated downloads** are among the most common: a file fetched over HTTP where the connection dropped mid-transfer produces valid PDF prefix bytes followed by an abrupt EOF. The cross-reference table and trailer, which appear at the end of a standard PDF, are typically lost entirely. + +**Disk write failures** produce files where the last few kilobytes were never flushed — a power loss or filesystem error after the application finished writing page content but before it wrote the xref. The byte count at `startxref` then points to an offset containing garbage or nothing. + +**Buggy authoring tools** contribute a large share of structurally malformed but visually correct PDFs. Microsoft Word's PDF export historically produces incorrect `/Length` entries in stream dictionaries, off by one or two bytes due to CR/LF normalization mismatches. LibreOffice edge cases include object dictionaries with duplicate keys (last-value-wins is the correct resolution per ISO 32000-1 §7.3.7), missing `endobj` tokens on the final object in a file, and xref tables with incorrect byte offsets when the file was written on a platform with different newline conventions. + +**Aggressive compression** can produce xref streams (PDF 1.5+) whose compressed payload, when decompressed, is shorter than the dictionary's `/W` field widths imply, causing out-of-bounds reads if the parser trusts the field counts blindly. + +**Incremental update corruption** occurs when a PDF viewer appends an update section (new xref + trailer + `%%EOF`) but the process was interrupted. The appended section may be syntactically incomplete, yet the original body of the file remains intact. + +**Legacy pre-ISO PDFs** (pre-1.0 through PDF 1.3 from the mid-1990s) use non-standard comment syntax, allow object numbers starting at values other than 1, and sometimes omit the `%%PDF-` header entirely. Some PostScript-derived exporters embed raw PostScript fragments as PDF stream data with no proper dictionary wrapper. + +A production extractor must handle all of these rather than surfacing a hard error to the caller. The cost of failure is high: in document processing pipelines, a single corrupt file that panics or returns an opaque error can stall an entire batch. + +--- + +## 2. Cross-Reference Table Recovery + +The standard parse path reads `startxref` by scanning backward from `%%EOF`, then seeks to that offset to read the xref section. Recovery proceeds in stages when this fails. + +**Stage 1 — Backward scan for `startxref`.** Read the last 1024 bytes of the file. Search backward for the literal token `startxref` followed by a decimal integer on the next line. If the stated offset is within the file bounds and the bytes there begin with `xref` (for traditional xref tables) or match an indirect object header `N G obj` (for xref streams), proceed normally. + +**Stage 2 — Full-file object scan.** If stage 1 yields an offset pointing to garbage, scan the entire file byte-by-byte for the pattern `\d+ \d+ obj`. For each match, record `(object_number, generation, byte_offset)`. This reconstructed table is used as a fallback xref. Scanning must handle the case where the bytes `obj` appear inside a stream — use the heuristic that a valid object header is preceded by a newline or is at file start, and that the object and generation numbers are plausible (object number > 0, generation number typically 0 or 1). + +**Multiple `%%EOF` markers** appear in linearized PDFs (one near the front for first-page delivery, one at the end) and in every incrementally updated file. The parser must not stop at the first `%%EOF` it encounters when scanning backward — it must collect all `%%EOF` positions and process xref sections anchored to each. + +**Object number conflicts** arise when the same object number appears in multiple xref sections. For incremental updates, the correct rule (ISO 32000-1 §7.5.6) is last-definition-wins: the xref section closest to the end of the file takes precedence. During recovery from a full-file object scan, if two `obj` tokens claim the same object number, prefer the one at the higher byte offset, consistent with the incremental update semantics. + +--- + +## 3. Object Stream Recovery + +PDF 1.5 introduced xref streams, which replace the plaintext xref table with a compressed binary stream embedded in an indirect object. When this stream is itself corrupt, the parser must fall back to the Stage 2 object scan described above. + +Within object streams (`/Type /ObjStm`), multiple objects are packed sequentially. The stream dictionary's `/N` field states the object count and `/First` gives the byte offset of the first object within the decompressed payload. If decompression fails or the `/First` offset exceeds the decompressed length, attempt to extract whatever objects are readable from the start of the decompressed data, stopping at the first parse error rather than discarding all objects in the stream. + +--- + +## 4. Stream Length Repair + +The `/Length` entry in a stream dictionary specifies how many bytes to read before `endstream`. This value is wrong frequently enough that every parser needs a repair path. + +**Algorithm:** + +1. Seek to the start of stream data (the byte immediately after the newline following the `stream` keyword). +2. Read exactly `/Length` bytes. +3. Scan the next 32 bytes for the `endstream` token, allowing for leading whitespace and CR/LF variants. +4. If `endstream` is found within that window, the length was correct. Continue. +5. If not found, the stated length is wrong. Scan forward from the start of stream data for the literal bytes `endstream` preceded by a newline. Use the byte count from stream start to that newline as the actual length. Log a warning with the offset, the stated length, and the actual length. +6. If `/Length` is missing entirely, scan for `endstream` from the start of stream data immediately. A missing `/Length` is a hard spec violation but appears in real files from legacy exporters. +7. The `endobj` token serves as a hard upper boundary: if `endstream` is not found before `endobj`, the stream data is truncated. Extract what is available and mark the stream as partial. + +--- + +## 5. Syntax Error Tolerance + +**Missing `endobj`.** If the parser encounters the object header of object N+1 while still parsing object N, treat the boundary as an implicit `endobj`. This covers the common LibreOffice case where the final object in a file has no terminator. + +**Unbalanced `q`/`Q` in content streams.** The graphics state stack must not overflow or underflow. Track depth; on underflow (extra `Q`), ignore the operator and log a warning. On EOF with nonzero depth (unclosed `q`), synthesize the missing `Q` operators before returning from stream parsing. + +**Invalid object references.** A reference to object 0 is always invalid (object 0 is the head of the free list). A reference to an object number not in the xref is a dangling reference. In both cases, return a null object rather than an error, consistent with how PDF readers handle missing optional entries. + +**Non-integer generation numbers.** If the generation field in an object header is non-numeric, treat it as generation 0 and continue. + +**Dictionary keys without values.** If a dictionary contains a name token immediately followed by another name token (the first has no value), insert a null value for the key-less entry and continue parsing the dictionary. This prevents the parser from misaligning all subsequent key-value pairs. + +--- + +## 6. Linearization Failures + +A linearized PDF places a linearization parameter dictionary as the first object, followed by a first-page xref section. When the linearization dictionary's `/L` (file length) field does not match the actual file size, treat the file as non-linearized and parse from the end using the main xref. + +When the hint tables (referenced by `/H` in the linearization dictionary) are corrupt or point past EOF, skip hint table processing entirely. The hint tables are an optimization for byte-range requests; ignoring them does not affect completeness. + +False linearization — where the first object claims `/Linearized` but the file structure is actually a standard non-linearized layout — is detected by checking whether the first-page xref section at the declared `/T` offset is present and valid. If not, fall back to end-of-file xref processing unconditionally. + +--- + +## 7. Incremental Update Repair + +Each incremental update appends: updated objects, a new xref section, a new trailer dictionary, and `%%EOF`. The trailer's `/Prev` field chains back to the previous xref offset. + +When following `/Prev` chains, a corrupt intermediate update presents as an xref section at the chained offset that fails to parse. The repair strategy is to abandon chain-following at that point and instead scan the entire file for all `xref` or xref-stream markers (Stage 2), then sort them by byte offset ascending. Process them in ascending order, applying each xref section's entries to the object table, with later entries overwriting earlier ones. This produces the correct last-definition-wins semantics even when the `/Prev` chain is broken. + +A degenerate case is a cyclic `/Prev` chain (offset A's trailer points to B, B's trailer points back to A). Detect cycles by tracking visited offsets in a `HashSet` and breaking on revisit. + +--- + +## 8. Content Stream Error Recovery + +Content stream parsing should be operator-by-operator. On encountering an error, the parser skips to the next operator boundary (next newline or whitespace-separated token that is a known operator or the start of an operand sequence) and resumes. + +**Unknown operators** — skip to the next newline and continue. Emit an info-level log entry. + +**Unmatched `BT`/`ET`.** A missing `ET` at EOF of the stream: synthesize `ET` before returning, preserving any accumulated text. A spurious `ET` with no preceding `BT`: ignore it. + +**Wrong operand count.** If a `Tf` operator receives one operand instead of two, skip the operator. Do not attempt to infer missing operands — the result would be garbage text. + +**Corrupt glyph data in `Tj` or `TJ`.** If a string operand contains byte sequences that do not map to any glyph in the current font's encoding, emit a replacement character (U+FFFD) for each unmappable byte and continue. Do not abort the text object. + +--- + +## 9. Partial File Extraction + +When a file is truncated mid-stream, extraction proceeds over all pages whose objects are fully recoverable. The extractor tracks the highest page index for which all required content streams and resources were available. + +Output metadata includes: + +```json +{ + "partial": true, + "pages_recovered": 14, + "pages_total_claimed": 20, + "truncation_offset": 1048576 +} +``` + +`partial: true` signals to callers that the output is incomplete. `pages_recovered` is the count of pages for which text was extracted. `pages_total_claimed` reflects the page count in the document catalog, which may itself be in the corrupt region (in which case it is omitted). `truncation_offset` is the byte offset at which the first unrecoverable structure was encountered. + +--- + +## 10. Error Reporting + +Every recovery action is logged as a structured entry alongside the extracted content. The top-level output object contains a `warnings` array: + +```json +{ + "warnings": [ + { + "severity": "warning", + "offset": 204800, + "object": 42, + "error_type": "wrong_stream_length", + "stated_value": 1024, + "actual_value": 1031, + "recovery": "scanned_for_endstream" + }, + { + "severity": "error", + "offset": 819200, + "object": null, + "error_type": "xref_corrupt", + "recovery": "full_file_object_scan" + } + ] +} +``` + +**Severity levels:** + +- `info` — a deviation that was resolved without ambiguity (e.g., missing `endobj` at end of file where the next object header was unambiguous). +- `warning` — a deviation that required a heuristic recovery; the extracted content is likely correct but not guaranteed (e.g., wrong `/Length` corrected by `endstream` scan). +- `error` — a structural failure that caused partial loss of content (e.g., an xref section that could not be reconstructed, resulting in unreachable objects). + +The `object` field is null when the error occurs in a structural region (xref, trailer) rather than within a specific object. The `recovery` field uses a fixed vocabulary of strategy identifiers so callers can programmatically assess extraction quality without parsing human-readable strings. + +Callers should treat any `error`-severity entry as grounds for flagging the output for human review, while `warning`-severity entries indicate likely-correct extractions from imperfect input. diff --git a/docs/research/optional-content-groups.md b/docs/research/optional-content-groups.md new file mode 100644 index 0000000..14d221f --- /dev/null +++ b/docs/research/optional-content-groups.md @@ -0,0 +1,265 @@ +# Optional Content Groups (Layers) in PDF Extraction + +## 1. OCG Overview + +Optional Content Groups (OCGs) are named, independently togglable layers defined in ISO 32000-2 §8.11. Each OCG represents a logical grouping of PDF content — text, graphics, images, or annotations — that can be rendered or suppressed as a unit. Layers were introduced to support use cases such as technical drawings (where construction lines appear on a separate layer), multilingual documents (one layer per locale), and print-vs-screen variants. + +From an extraction standpoint, OCGs are critical because content on an off layer **must not** be treated as visible text. A PDF viewer filters out off-layer content before rendering; an extraction library that ignores OCG state will silently include invisible text — watermarks, alternate-language duplicates, or suppressed annotations — mixed with the visible body text. + +OCGs are registered in the document catalog under `/OCProperties` (ISO 32000-2 §8.11.4). The structure is: + +``` +/OCProperties << + /OCGs [ ref1 ref2 ref3 ] % all OCGs in the document + /D << ... >> % default configuration dictionary + /Configs [ << ... >> ] % optional additional named configurations +>> +``` + +`/OCGs` lists every OCG indirect reference in the document. A conforming reader must process this array to build the initial visibility table before parsing any content stream. + +--- + +## 2. OCG Dictionary + +Each OCG is an indirect object of the form: + +``` +<< /Type /OCG + /Name (English Text) + /Intent /View + /Usage << ... >> +>> +``` + +- `/Type /OCG` — required; marks the object as an OCG. +- `/Name` — required; a UTF-16BE or PDFDocEncoding string giving a human-readable layer name (e.g., `"Background"`, `"English"`, `"HeaderFooter"`). Used for display in layer panels and, in pdftract, as the `ocg_name` tag on extracted spans. +- `/Intent` — optional; a name or array of names (`/View`, `/Design`, or application-defined). `/View` means the OCG governs visibility for screen rendering and, by convention, for extraction. `/Design` means it governs visibility in design tools. If absent, treat as `/View`. +- `/Usage` — optional dictionary; machine-readable context hints that drive automatic state computation from the `/AS` (auto-state) rules in the default configuration. + +OCGs are referenced from content streams via the property list mechanism (see §6), from XObject dictionaries via `/OC`, and from annotation dictionaries via `/OC`. + +--- + +## 3. Usage Dictionary + +The `/Usage` dictionary on an OCG supplies structured metadata for automatic visibility determination. The relevant subkeys are: + +- **`CreatorInfo`** — `<< /Creator (ApplicationName) /Subtype /Technical >>`. Informational; identifies the originating application and layer purpose. +- **`Language`** — `<< /Lang (fr-CA) /Preferred /ON >>`. The `/Lang` value is a BCP 47 language tag. `/Preferred` specifies `ON` or `OFF` — whether this is the preferred language layer when automatic language selection is active. Critical for multilingual PDFs: an auto-state rule with event `/View` applied to `/Language` will turn on only the preferred-language layers and turn off others. +- **`Export`** — `<< /ExportState /ON >>`. Controls layer state when the document is exported (saved as PDF). Values: `/ON` or `/OFF`. +- **`Zoom`** — `<< /min 0.5 /max 2.0 >>`. The layer is visible only when the zoom factor is within `[min, max]`. For extraction, zoom is conventionally treated as 1.0 unless the caller specifies otherwise. +- **`Print`** — `<< /Subtype /Watermark /PrintState /ON >>`. Governs layer state when printing. `/Subtype` can be `/Watermark` or application-defined. Watermark layers visible only on print should be excluded from extraction by default. +- **`View`** — `<< /ViewState /ON >>`. Explicitly sets layer state for on-screen viewing. This is the primary signal for extraction; a `/ViewState /OFF` layer is invisible on screen and should be excluded. +- **`User`** — `<< /Type /Ind /Name [(Alice)] >>`. User-based visibility; category is `/Ind` (individual) or `/Grp` (group). Rarely relevant for extraction. +- **`PageElement`** — `<< /Subtype /HF >>`. Marks the layer as containing page elements of a specific functional type. `/HF` (Header/Footer) is the defined value. Extraction policy for HF layers is covered in §9. + +--- + +## 4. Optional Content Membership Dictionary (OCMD) + +An OCMD expresses a boolean combination of OCG states. It is not an OCG itself but acts as a computed visibility gate: + +``` +<< /Type /OCMD + /OCGs [ ref1 ref2 ] + /P /AnyOn +>> +``` + +- `/OCGs` — a single OCG reference or an array. If a single reference, the OCMD is equivalent to a direct OCG reference (with policy `/AnyOn`). +- `/P` — the policy applied to the `/OCGs` set: + - `/AllOn` — visible iff every listed OCG is on. + - `/AllOff` — visible iff every listed OCG is off. + - `/AnyOn` — visible iff at least one listed OCG is on. + - `/AnyOff` — visible iff at least one listed OCG is off. +- `/VE` — optional visibility expression array (PDF 2.0, §8.11.2.3); a recursive boolean expression using `And`, `Or`, `Not` operators over OCG references. Implement `/VE` evaluation as a tree walk; fall back to `/P`+`/OCGs` if `/VE` is absent. + +Resolving OCMD state: + +```rust +fn resolve_ocmd(ocmd: &Ocmd, states: &HashMap) -> bool { + let ocg_states: Vec = ocmd.ocgs.iter() + .map(|id| *states.get(id).unwrap_or(&true)) + .collect(); + match ocmd.policy { + Policy::AllOn => ocg_states.iter().all(|&s| s), + Policy::AllOff => ocg_states.iter().all(|&s| !s), + Policy::AnyOn => ocg_states.iter().any(|&s| s), + Policy::AnyOff => ocg_states.iter().any(|&s| !s), + } +} +``` + +--- + +## 5. Default Viewing State + +The `/D` entry of `/OCProperties` is the default configuration dictionary. It establishes the initial OCG visibility table: + +``` +/D << + /Name (Default) + /BaseState /OFF + /ON [ ref1 ref2 ] + /OFF [ ref3 ] + /AS [ << /Event /View /OCGs [ ref1 ref2 ] /Category [ /View /Zoom ] >> ] + /Order [ ... ] + /RBGroups [ ... ] + /Locked [ ... ] +>> +``` + +**Computing initial visible set:** + +1. Set all OCGs to the `/BaseState` value (`ON`, `OFF`, or `Unchanged`; for the `/D` entry, `Unchanged` is equivalent to `ON`). +2. Apply the `/ON` array: set each listed OCG to on. +3. Apply the `/OFF` array: set each listed OCG to off. `/ON` and `/OFF` take explicit precedence over `/BaseState`. +4. Process `/AS` (auto-state) entries. Each entry specifies an event (e.g., `/View`), a set of OCGs, and usage categories. For each category, read the corresponding key from the OCG's `/Usage` dictionary and apply the state. For extraction, process only entries with `/Event /View`. + +`/RBGroups` defines radio-button groups — only one OCG in the group may be on at a time. Honor this constraint when applying `/AS` overrides. + +`/Locked` lists OCGs whose state may not be changed by the user; treat locked OCGs as fixed at their computed state. + +--- + +## 6. Content Stream Marking + +OCGs gate content in content streams through the **Marked Content** mechanism (ISO 32000-2 §14.6). The operator pair is `BDC` / `EMC`. When an OCG or OCMD governs a content region, the marking takes the form: + +``` +/OC /Lyr1 BDC + ... text operators ... +EMC +``` + +where `/Lyr1` is a name that resolves via the page's `/Resources /Properties` dictionary to an OCG or OCMD indirect reference: + +``` +/Resources << + /Properties << + /Lyr1 ref_to_ocg_or_ocmd + >> +>> +``` + +Alternatively, the OCG dictionary can be inlined directly in the `BDC` property list: + +``` +/OC << /Type /OCG /Name (English) >> BDC +``` + +though inline objects are rare in well-formed PDFs. + +**Nesting.** `BDC`/`EMC` pairs can be nested. A content stream may have an outer OCG marking a section and inner OCG markings for subsections. The visibility rule is conjunctive: a span is visible only if **all** enclosing OCG contexts are visible. Implement this as a stack: + +```rust +struct OcgStack(Vec); + +impl OcgStack { + fn push(&mut self, visible: bool) { self.0.push(visible); } + fn pop(&mut self) { self.0.pop(); } + fn is_visible(&self) -> bool { self.0.iter().all(|&v| v) } +} +``` + +On each `BDC` with an `/OC` property, resolve the referenced OCG or OCMD to a boolean and push it. On `EMC`, pop. Text operators encountered while `is_visible()` returns `false` are discarded. + +--- + +## 7. XObject and Annotation OCG References + +**Form XObjects** — a Form XObject (stream with `/Subtype /Form`) may carry an `/OC` entry: + +``` +<< /Type /XObject /Subtype /Form /OC ref_to_ocg ... >> +``` + +Before descending into the XObject's content stream to extract text, resolve the `/OC` entry. If the referenced OCG or OCMD is off, skip the entire XObject. This check is independent of any `BDC`/`EMC` marking inside the XObject itself; both must be satisfied for content to be visible. + +**Annotations** — annotation dictionaries also support `/OC`: + +``` +<< /Type /Annot /Subtype /Widget /OC ref_to_ocg ... >> +``` + +For annotations with appearance streams (`/AP`), the appearance stream text is visible only if the annotation's `/OC` resolves to on. Text from invisible annotation appearances must be excluded. + +--- + +## 8. Multilingual Layer Pattern + +A common authoring pattern places translations of the same document on separate OCGs, one per locale, with identical layout geometry. The Usage dictionary's `/Language` subkey carries the BCP 47 tag: + +``` +OCG_EN: /Usage << /Language << /Lang (en) /Preferred /ON >> >> +OCG_FR: /Usage << /Language << /Lang (fr) /Preferred /OFF >> >> +OCG_ES: /Usage << /Language << /Lang (es) /Preferred /OFF >> >> +``` + +The `/AS` entry in the default configuration fires on `/Event /View` with `/Category [/Language]`, turning on the preferred language layer and turning off others. + +For pdftract, extraction policy options: + +- **Default locale extraction** — compute the visible set from `/D` (including `/AS` processing); only extract text from the resulting on-layers. The caller gets clean, single-language output. +- **Target locale extraction** — caller specifies a BCP 47 tag; the library overrides the state table to enable only OCGs whose `/Usage/Language/Lang` matches (exact or prefix match per BCP 47 §4.4) and disables others before extraction. +- **All-layers extraction** — extract all layers regardless of state; tag each span's `ocg_name` with the layer's `/Name` value. The caller can then filter by locale post-extraction. + +When all-layers extraction is active, spans at the same position from different language layers will have overlapping bounding boxes. The caller must deduplicate or select by `ocg_name`. + +--- + +## 9. PageElement HF Layers + +The `PageElement` usage subtype `/HF` explicitly designates a layer as containing headers and/or footers. This is the PDF specification's own semantic label for running header/footer content. + +``` +/Usage << /PageElement << /Subtype /HF >> >> +``` + +Extraction policy for HF layers: + +- **Default:** exclude HF-layer content from the primary body text stream; emit it in a separate `headers_footers` bucket or label spans with `zone: HeaderFooter`. +- **Explicit inclusion:** caller opts in via an extraction flag to include HF content merged into body text (rarely useful but sometimes required for form extraction). +- **Detection fallback:** if a layer has no `PageElement` usage entry but its `/Name` matches heuristics like `"Header"`, `"Footer"`, `"Running Head"`, log a warning rather than auto-excluding — only the Usage dictionary is normative. + +--- + +## 10. Extraction Policy + +### Default behavior + +Extract only content on layers that are **on** in the default viewing state (computed per §5). This matches what a conforming viewer displays. No `ocg_name` metadata is emitted on spans; OCG structure is transparent to the caller. + +### Extraction modes + +| Mode | Description | `ocg_name` on span | +|---|---|---| +| `DefaultVisible` | Only on-layers per `/D` | absent | +| `TargetLayer(name)` | Only the named OCG by `/Name` match | absent | +| `TargetLocale(lang)` | Only OCGs matching BCP 47 tag in `/Language` | absent | +| `AllLayers` | All layers regardless of state | present | +| `AllLayersVisible` | Only on-layers, but tagged | present | + +### Span metadata + +When `ocg_name` tagging is active, each span carries: + +```rust +pub struct Span { + pub text: String, + pub bbox: Rect, + pub ocg_name: Option, // None if not inside any OCG marking + // ... other fields +} +``` + +`ocg_name` reflects the **innermost** named OCG in the `BDC` stack at the point the span was extracted. If a span is covered by multiple nested OCG markings, the innermost `/Name` is used; all enclosing states must be on for the span to be included in non-`AllLayers` modes. + +### Implementation notes + +- Build the OCG state table once per document from `/OCProperties/D`; cache it. +- Reuse the same table for all pages — OCG state is document-scoped, not page-scoped. +- The `/Configs` array provides alternative named configurations (e.g., "Print", "Screen"). Expose these to callers who need to extract against a non-default configuration. +- When `/OCProperties` is absent, treat all content as unconditionally visible (the document has no layers). +- Log unresolvable `/OC` references (dangling indirect refs) as warnings; do not silently discard the content — treat it as visible to avoid false negatives. diff --git a/docs/research/page-geometry-and-document-structure.md b/docs/research/page-geometry-and-document-structure.md new file mode 100644 index 0000000..03e1ebe --- /dev/null +++ b/docs/research/page-geometry-and-document-structure.md @@ -0,0 +1,207 @@ +# Page Geometry and Document Structure + +## Scope + +This document covers the structural and geometric elements of the PDF specification that a Rust text extraction library must correctly model: the page tree, box hierarchy, coordinate system, rotation handling, page labels, outlines, named destinations, the document catalog, the resources dictionary, and viewer preferences. Correct handling of these elements is a prerequisite for placing extracted glyphs in meaningful, reading-order coordinates. + +--- + +## 1. Page Tree + +The PDF page tree is rooted at the `Pages` object referenced from the document catalog. The tree is composed of two node types: + +- **Intermediate nodes** (`/Type /Pages`): have a `Kids` array of indirect references to child nodes, and a `Count` integer giving the total number of leaf pages in the subtree. They may carry inheritable attributes. +- **Leaf nodes** (`/Type /Page`): represent individual pages and hold or inherit all page attributes required for rendering. + +The `Count` at each intermediate node enables **O(log n) random-access lookup** without traversing every leaf. To locate page index *k*, inspect `Count` at each child in `Kids` and descend into the subtree whose cumulative count covers *k*. Implementing sequential traversal is simpler but produces O(n) cost per lookup, which is unacceptable for large documents. + +**Inherited attributes** propagate from ancestor intermediate nodes to descendant pages unless a descendant overrides them. The inheritable keys are: + +| Key | Default if absent | +|---|---| +| `MediaBox` | Required on the root `Pages` node; no PDF default | +| `CropBox` | Equals `MediaBox` | +| `Rotate` | `0` | +| `Resources` | Empty dictionary | +| `UserUnit` | `1` (PDF 1.6+) | + +Inheritance resolution: when building the page object for extraction, walk from the leaf upward through each `Parent` reference, collecting values for keys not yet set on the leaf. Stop at the root `Pages` node. Do this once per page and cache the result; never re-traverse the parent chain for each attribute access. + +--- + +## 2. Page Boxes + +All boxes are arrays of four numbers `[x0 y0 x1 y1]` in default user space units (points at 1/72 inch per unit unless `UserUnit` is set). The values represent the lower-left and upper-right corners; the specification does not require `x0 < x1` or `y0 < y1`, so normalize to `(min, max)` when reading. + +| Box | Key | Defaults to | +|---|---|---| +| **MediaBox** | `MediaBox` | Required | +| **CropBox** | `CropBox` | MediaBox | +| **BleedBox** | `BleedBox` | CropBox | +| **TrimBox** | `TrimBox` | CropBox | +| **ArtBox** | `ArtBox` | CropBox | + +For text extraction, `CropBox` is the correct extraction boundary. Content outside the CropBox is not visible to the user and should be clipped before including glyphs in output. BleedBox, TrimBox, and ArtBox carry print-production semantics and are generally irrelevant to text extraction, but should be exposed in the library's page metadata API for callers that need them. + +**`UserUnit`** (PDF 1.6, optional): a positive number specifying the size of one default user-space unit in units of 1/72 inch. Default is `1`. Multiply all box coordinates and glyph positions by `UserUnit` to convert to points before any further geometry work. Most documents set `UserUnit` to `1`; documents generated for large-format printing may set it to values like `4` or `72` (the latter making 1 unit = 1 inch). + +--- + +## 3. Page Rotation + +The `Rotate` key is an integer, one of `{0, 90, 180, 270}`, specifying **clockwise rotation** applied during rendering. A page with `MediaBox [0 0 612 792]` and `Rotate 90` is rendered as a landscape page 792 units wide and 612 units tall, with the origin at the bottom-left of the rotated view. + +Rotation does not change any coordinates stored in the content stream. The coordinate system in the stream is always in the page's unrotated space. When extraction is complete and glyph positions are in unrotated page space, apply the inverse transform to produce display-space coordinates: + +- **0°**: no transform. +- **90° CW** (Rotate=90): display point `(x', y') = (y, W - x)` where `W` is the unrotated MediaBox width. +- **180°** (Rotate=180): `(x', y') = (W - x, H - y)`. +- **270° CW** (Rotate=270): `(x', y') = (H - y, x)`. + +The effective page width and height in display space also swap for 90° and 270°: + +``` +if rotate in {90, 270}: + display_width = media_height + display_height = media_width +else: + display_width = media_width + display_height = media_height +``` + +Apply the rotation transform after inverting the y-axis (see Section 4), not before. The correct order is: extract glyphs in content-stream coordinates → invert y for reading order → apply rotation to map to display space. + +--- + +## 4. Coordinate System Origin + +PDF default user space has the **origin at the bottom-left corner of the page**, with x increasing rightward and y increasing upward. Human reading order is top-to-bottom. To convert a glyph's PDF y-coordinate to reading-order y: + +``` +reading_y = page_height - pdf_y +``` + +where `page_height` is the height of the CropBox (or MediaBox if CropBox equals MediaBox). Apply this inversion to every bounding box edge: a box `[x0, y0, x1, y1]` in PDF space becomes `[x0, page_height - y1, x1, page_height - y0]` in reading-order space (the vertical extents swap because the top of the original box is at `y1`, which maps to the smaller reading_y). + +For **rotated pages**, the effective page height used in the inversion is the height of the display-space page, not the unrotated MediaBox height. Concretely: after computing the display_width/height swap from Section 3, use `display_height` as `page_height` in the inversion formula. Implement the full pipeline as: (1) apply CTM and text matrix to obtain unrotated page coordinates, (2) invert y to get reading-order coordinates in unrotated space, (3) apply the rotation matrix to get display-space reading-order coordinates. + +--- + +## 5. Page Labels + +The `PageLabels` entry in the document catalog is a number tree mapping page indices (zero-based) to label range dictionaries. Each entry marks the start of a new labeling range. A range entry may contain: + +| Key | Description | +|---|---| +| `S` | Numbering style: `D` (decimal), `r` (lowercase roman), `R` (uppercase roman), `a` (lowercase alpha), `A` (uppercase alpha) | +| `P` | Prefix string (any PDF string) | +| `St` | Starting value (integer ≥ 1, default `1`) | + +To compute the human-readable label for physical page index *i*: + +1. Find the greatest key in the number tree that is ≤ *i*. That key is the range start *r*. +2. Offset within the range: `offset = i - r`. +3. Numeric value: `n = St + offset` (default `St = 1`). +4. Format *n* according to `S`; prepend `P` if present. + +If no `S` key is present, the page has only the prefix (or is unlabeled). Documents with front matter commonly use lowercase roman numerals for the first several pages and decimal for the body; the labeled numbers therefore do not match the physical page order. The library must expose both the zero-based physical index and the string label independently. + +--- + +## 6. Document Outline (Bookmarks) + +The `/Outlines` entry in the catalog references the root outline dictionary. Each outline item dictionary contains: + +- `Title`: a PDF string in either PDFDocEncoding or UTF-16BE (detected by the BOM `0xFE 0xFF`). +- `First` / `Last`: references to the first and last child items. +- `Next` / `Prev`: sibling links for items at the same level. +- `Count`: if present and positive, the item is open with that many visible descendants; negative means closed. +- `Dest` or `A`: a destination array/string or an action dictionary. + +Traverse the tree with a recursive descent: for each item, process `Title` and destination, then recurse into `First` child, then follow `Next` siblings. When `A` is present and its `S` key is `/GoTo`, the `D` entry within `A` is the destination. When `Dest` is a string, resolve it via named destinations (Section 7). When `Dest` is an array, parse it directly. + +Expose the outline as a flat or nested table-of-contents structure, each entry carrying the title string (decoded to Rust `String`), nesting depth, and resolved zero-based page index. + +--- + +## 7. Named Destinations + +Named destinations are stored in the document catalog under `Names` → `Dests` (a name tree) or, in older documents, directly as a `Dests` dictionary under the catalog. In either case, a name maps to a destination array. + +Destination array formats and their semantics: + +| Format | Meaning | +|---|---| +| `[page /XYZ left top zoom]` | Specific position on page | +| `[page /Fit]` | Fit entire page in viewport | +| `[page /FitH top]` | Fit page width, scroll to `top` | +| `[page /FitV left]` | Fit page height, scroll to `left` | +| `[page /FitR l b r t]` | Fit rectangle | +| `[page /FitB*]` | Variants of bounding-box fit | + +In all cases, the first element is an indirect reference to a `/Page` object. Resolve this reference to a page index by walking the page tree to find the matching object number. Cache the object-number-to-index mapping after the first full tree traversal. + +--- + +## 8. Document Catalog + +The document catalog is reached via `trailer → Root`. Its entries relevant to text extraction: + +| Key | Purpose | +|---|---| +| `Pages` | Root of the page tree | +| `Outlines` | Root of the outline tree | +| `Names` | Name trees including `Dests`, `EmbeddedFiles`, etc. | +| `PageLabels` | Number tree for page labeling | +| `AcroForm` | Interactive form fields | +| `Metadata` | Stream containing XMP metadata | +| `MarkInfo` | Indicates tagged PDF; `Marked: true` signals reading order is in StructTree | +| `StructTreeRoot` | Root of the logical structure tree | +| `Lang` | BCP 47 language tag for the document | +| `OCProperties` | Optional content (layers) configuration | + +`Lang` should be used as the fallback language when no glyph-level or span-level language is specified. `MarkInfo` determines whether to prefer structure-tree reading order over geometric order (covered in the tagged-PDF research document). `OCProperties` affects which content streams are active; for extraction, treat all optional content as visible unless the caller specifies otherwise. + +--- + +## 9. Resources Dictionary + +Resources provide the named objects (fonts, images, graphics states, etc.) referenced in a content stream. A `Resources` dictionary has sub-dictionaries keyed by resource category: + +| Key | Contents | +|---|---| +| `Font` | Map of resource name → font dictionary reference | +| `XObject` | Map of resource name → XObject stream reference | +| `ExtGState` | Map of resource name → graphics state parameter dictionary | +| `ColorSpace` | Map of resource name → color space definition | +| `Pattern` / `Shading` | Pattern and shading resources | +| `ProcSet` | Legacy array, ignore for extraction | + +Resource names in content streams (e.g., `/F1` in `Tf`) are resolved against the active `Resources` dictionary. For a page's main content stream, use the page-level `Resources`; if absent, use the inherited resources resolved per Section 1. For Form XObjects and Type 3 fonts, each has its own `Resources` dictionary that takes precedence within its content stream. + +Resolution is always strictly local: a resource name in a Form XObject is looked up in that XObject's own `Resources`, not the parent page's. Implement resource resolution as a stack that pushes the current stream's dictionary on entry and pops on exit. + +--- + +## 10. Viewer Preferences and Page Layout + +`ViewerPreferences` (in the catalog) and `PageLayout` affect multi-page presentation but not individual page content. Relevant keys: + +| Key | Values | Extraction relevance | +|---|---|---| +| `PageLayout` | `SinglePage`, `OneColumn`, `TwoColumnLeft`, `TwoColumnRight`, `TwoPageLeft`, `TwoPageRight` | Two-column/two-page layouts imply pages are displayed as spreads; expose to caller for spread-aware output | +| `Direction` (in `ViewerPreferences`) | `L2R` (default), `R2L` | R2L affects which page is the left page in a spread; relevant for logical page ordering in output | +| `DisplayDocTitle` | boolean | Whether the viewer shows the document title from `Info` or the filename; informational only | + +For extraction, `Direction: R2L` means that in a two-page spread, the higher-numbered page is on the left. A library consumer assembling pages into a multi-column layout should expose this flag and let the caller decide how to reorder output. At the single-page extraction level, `Direction` and `PageLayout` have no effect on glyph coordinates. + +--- + +## Implementation Notes + +- Build the object-number-to-page-index map eagerly on document open; it is used by destination resolution, outline traversal, and link annotation handling. +- Normalize all box arrays to `(x_min, y_min, x_max, y_max)` at parse time. +- Resolve inherited attributes into a flat `PageAttributes` struct at page-open time; do not re-traverse the parent chain during glyph extraction. +- Apply `UserUnit` scaling before any geometry comparison or coordinate inversion. +- Store the raw `Rotate` value from the resolved page dictionary; apply the transform matrix as the last step after all content-stream coordinate math is complete. +- Decode `Title` strings in outline items by checking for the UTF-16BE BOM; fall back to PDFDocEncoding (ISO Latin-1 with PDF-specific replacements for the 0x80–0x9F range) if the BOM is absent. diff --git a/docs/research/pdf-encryption-and-security.md b/docs/research/pdf-encryption-and-security.md new file mode 100644 index 0000000..7c4db08 --- /dev/null +++ b/docs/research/pdf-encryption-and-security.md @@ -0,0 +1,191 @@ +# PDF Encryption and Security + +## Purpose + +This document describes how `pdftract` detects, decrypts, and processes encrypted PDFs to enable text extraction. It covers the Standard security handler across all revisions, key derivation algorithms, object-level decryption, permission flag semantics, and the implementation approach using the RustCrypto ecosystem. + +--- + +## 1. Encryption Detection + +Before parsing any content objects, `pdftract` must inspect the trailer dictionary for the `/Encrypt` entry. The trailer is located by following the `startxref` offset, and its dictionary is parsed before any cross-reference resolution. If the trailer contains an `/Encrypt` key, the file is encrypted. + +The `/Encrypt` value is an indirect reference to the encryption dictionary. The `Filter` name within that dictionary identifies the security handler: + +- `/Standard` — password-based encryption (all PDF versions) +- `/Adobe.PubSec` — certificate-based encryption using public-key cryptography + +If `/Encrypt` is present and no password has been supplied by the caller, `pdftract` must fail fast with `EncryptionError::PasswordRequired` before attempting object decoding. Proceeding without decryption produces garbage text and silent data corruption. + +Cross-reference streams (PDF 1.5+) are themselves not encrypted even in encrypted documents. The trailer and cross-reference data remain plaintext so the reader can locate the encryption dictionary before decrypting anything else. + +--- + +## 2. Standard Security Handler + +The Standard security handler (`/Filter /Standard`) is the ubiquitous password-based scheme. Its revision history maps directly to the cryptographic strength available: + +| V | R | Algorithm | Key size | PDF version | +|---|----|----------------------|----------|-------------| +| 1 | 2 | RC4 | 40-bit | 1.1–1.3 | +| 2 | 3 | RC4 | 128-bit | 1.4 | +| 3 | 4 | RC4 or AES-128 | 128-bit | 1.5 | +| 4 | 5 | AES-256 (SHA-256) | 256-bit | 1.7 ext3 | +| 5 | 6 | AES-256 (SHA-512) | 256-bit | PDF 2.0 | + +The `/Encrypt` dictionary contains: + +- `Filter` — `/Standard` +- `SubFilter` — optional; specifies a more specific handler +- `V` — encryption algorithm version (integer) +- `R` — revision of the Standard handler +- `O` (32 bytes for R2–R4; 48 bytes for R5/R6) — owner password verifier or encrypted intermediate key +- `U` (32 bytes for R2–R4; 48 bytes for R5/R6) — user password verifier or encrypted intermediate key +- `OE` / `UE` (32 bytes; R5/R6 only) — file encryption key encrypted under the owner/user intermediate key +- `P` — signed 32-bit integer encoding permission flags +- `Length` — key length in bits (R3/R4; default 40 for R2) +- `EncryptMetadata` — boolean; if false, the XMP metadata stream is not encrypted (default true) +- `CF` / `StmF` / `StrF` — crypt filter table and per-stream/string filter names (R4+) + +--- + +## 3. Key Derivation — R2, R3, R4 + +The file encryption key is derived via MD5. The algorithm follows these steps precisely: + +1. **Password padding.** Take the user-supplied password (up to 32 bytes, zero-padded or truncated), then append bytes from the canonical 32-byte padding string defined in the specification until the combined length reaches 32 bytes. + +2. **Hash construction.** Initialize MD5 and feed it: + - The 32-byte padded password + - The 32-byte `O` entry from the encryption dictionary + - The 4-byte `P` value in little-endian order + - The first 16 bytes of the file identifier (the first element of the `/ID` array in the trailer) + - If `R >= 4` and `EncryptMetadata` is false, the 4-byte sequence `0xFF 0xFF 0xFF 0xFF` + +3. **Iteration (R3+).** For revisions 3 and above, repeat the MD5 hash 50 times, each time hashing the previous result, restricted to `n` bytes where `n = Length / 8`. + +4. **Truncation.** The file encryption key is the first `n` bytes of the final MD5 output. For R2, `n = 5` (40-bit key). For R3/R4, `n = Length / 8` (up to 16 bytes). + +**User password verification.** To verify that the supplied password is correct for R2, RC4-encrypt the 32-byte padding string with the derived key and compare against the first 16 bytes of the `U` entry. For R3/R4, additionally hash the padding string with the file identifier, encrypt with the file key and then encrypt the result 19 more times with modified keys (key bytes XORed with the iteration counter 1–19), and compare against the first 16 bytes of `U`. + +**Owner password.** The owner password encrypts the user password and stores it in the `O` entry. To check the owner password: derive an MD5-based key from the padded owner password (same padding step, no O/P/ID involvement), RC4-decrypt the `O` entry to recover the user password, then run the user key derivation with the recovered password. + +--- + +## 4. Key Derivation — R5 and R6 (PDF 2.0) + +R5 and R6 replace MD5 with SHA-based hashing and use a two-stage key structure. The file encryption key is a random 256-bit value stored encrypted inside the `OE` and `UE` entries. + +**R5 (deprecated).** The 48-byte `U` entry consists of a 32-byte hash (SHA-256 of the padded password concatenated with an 8-byte validation salt and an 8-byte key salt) followed by the 16-byte validation salt and key salt. To verify the user password: compute SHA-256 of `password || validation_salt` and compare against the first 32 bytes of `U`. To derive the intermediate key: compute SHA-256 of `password || key_salt`. Decrypt `UE` using AES-256-CBC with this intermediate key (zero IV) to obtain the 32-byte file encryption key. R5 is deprecated because its single SHA-256 round provides insufficient password hashing strength and is vulnerable to GPU-accelerated brute force. + +**R6 (PDF 2.0).** R6 uses a more complex iterative hash function. The intermediate key computation replaces the single SHA-256 with an adaptive loop: starting with SHA-256, it repeatedly hashes a sequence of `password || round_input || user_key_salt` (using SHA-256, SHA-384, or SHA-512 based on the last byte of the previous hash output modulo 3), continuing until a termination condition based on the last byte of the hash is met. This makes the function significantly more resistant to precomputation. The structure of `U`, `UE`, `O`, and `OE` is otherwise analogous to R5. + +The owner variants (`O`, `OE`) additionally include the 48-byte `U` value in the hash input, binding the owner key to the specific user key entry. + +--- + +## 5. Object-Level Decryption + +**R2–R4.** Each encrypted string and stream body uses a per-object key derived from the file encryption key. The derivation: + +1. Take the file encryption key bytes. +2. Append the 3 low-order bytes of the object number in little-endian order. +3. Append the 2 low-order bytes of the generation number in little-endian order. +4. For AES streams (R4 with `StmF` specifying AES), additionally append the 4-byte sequence `0x73 0x41 0x6C 0x54` ("sAlT"). +5. Compute MD5 of this concatenation and take the first `min(n + 5, 16)` bytes as the per-object key. + +RC4 is a stateless stream cipher: apply it directly to the ciphertext. AES-128 uses CBC mode with a 16-byte random IV prepended to the ciphertext; strip the IV before decryption. + +**R5/R6.** No per-object key derivation. AES-256-CBC is used for all strings and streams with a 16-byte random IV prepended to each individual ciphertext. The same file encryption key applies to every object. + +**Crypt filter opt-out.** A stream may include a `Crypt` filter with `/Name /Identity` in its filter pipeline to declare that it is not encrypted. This mechanism is used for streams that must be readable before the encryption context is established. `pdftract` must check for this before applying decryption to any stream. + +**Cross-reference streams** are never encrypted, regardless of handler or revision. + +--- + +## 6. Permission Flags + +The `P` entry is a signed 32-bit integer. Bits are numbered from 1 (LSB). Bits 1–2 are reserved and must be zero in the key derivation. The semantically significant bits: + +| Bit | Meaning | +|-----|------------------------------------------| +| 3 | Print the document | +| 4 | Modify the document | +| 5 | Copy or extract text and graphics | +| 6 | Add or modify annotations and forms | +| 9 | Fill interactive form fields | +| 10 | Extract text for accessibility | +| 11 | Assemble the document | +| 12 | Print at high fidelity | + +A bit value of 0 means the permission is denied. Bits not listed are reserved. + +**Bit 5 (extract text)** is the primary flag governing `pdftract`'s core operation. By default, `pdftract` should respect this flag and return `EncryptionError::ExtractionNotPermitted` if bit 5 is clear. Bit 10 (extract for accessibility) provides an alternative grant; if bit 10 is set and bit 5 is clear, accessibility-mode extraction may be allowed. `pdftract` should expose an `allow_accessibility_extraction: bool` option in `ExtractionConfig` to give callers explicit control over this behavior, particularly for screen readers and assistive tooling. + +--- + +## 7. Public-Key Encryption + +When `Filter` is `/Adobe.PubSec`, the file is encrypted for one or more specific certificate holders. The encryption dictionary contains a `Recipients` array; each entry is a CMS `EnvelopedData` structure containing the file encryption key wrapped for a specific recipient's X.509 certificate using the recipient's RSA public key. + +To decrypt, the caller supplies their private key. `pdftract` iterates the recipient list, attempts decryption of each `EnvelopedData` blob, and uses the unwrapped key as the file encryption key. The actual content encryption algorithm (RC4 or AES at various key sizes) is specified within the CMS structure. + +Public-key encryption is not required for the initial implementation. The detection path must be present: if `Filter` is `/Adobe.PubSec`, `pdftract` returns `EncryptionError::UnsupportedSecurityHandler` with a descriptive message identifying the handler name. + +--- + +## 8. Encrypted Metadata + +When `EncryptMetadata` is `false` (not the default), the XMP metadata stream (if present as an indirect object) is excluded from encryption. Its stream data can be decoded without a password. This is relevant for applications that need document metadata (title, author, creation date) without performing full content decryption. + +The `/Info` dictionary, however, is always encrypted in Standard-handler documents. Its string values (title, subject, keywords, etc.) require the file encryption key to decode. `pdftract` must apply per-object decryption to `/Info` strings whenever the file is encrypted, regardless of `EncryptMetadata`. + +--- + +## 9. Implementation Approach + +The RustCrypto ecosystem provides all necessary primitives: + +- `md-5` — MD5 for R2–R4 key derivation and per-object key computation +- `sha2` — SHA-256/384/512 for R5/R6 intermediate key derivation +- `rc4` — RC4 stream cipher for R2/R3/R4 string and stream decryption +- `aes` + `cbc` — AES-128-CBC (R4) and AES-256-CBC (R5/R6) decryption +- `cbc` — CBC mode wrapper (from `cipher` crate) + +Structure the decryptor as a `Decryptor` type that wraps the parsed encryption dictionary and holds the derived file encryption key. The object parser passes raw ciphertext bytes through `Decryptor::decrypt_string(obj_num, gen_num, ciphertext)` and `Decryptor::decrypt_stream(obj_num, gen_num, ciphertext)` before returning parsed objects to higher layers. This keeps decryption transparent to the text extraction layer. + +Cache the file encryption key on the `Decryptor` after first derivation. Per-object keys for R2–R4 are cheap to compute (one MD5 per object) and need not be cached; computing them inline avoids the memory overhead of a per-object map. + +--- + +## 10. Error Handling + +Define an `EncryptionError` enum in the public API: + +```rust +pub enum EncryptionError { + /// Encrypted document; no password was provided. + PasswordRequired, + /// The supplied password did not match the document's owner or user password. + WrongPassword, + /// The encryption dictionary is missing required entries or is structurally invalid. + InvalidEncryptionDictionary(String), + /// The revision or V value is not supported. + UnsupportedRevision { v: u8, r: u8 }, + /// The security handler is not supported (e.g., Adobe.PubSec). + UnsupportedSecurityHandler(String), + /// The permission flags deny text extraction. + ExtractionNotPermitted, + /// Decryption of a specific object failed (truncated ciphertext, bad IV, etc.). + ObjectDecryptionFailed { obj_num: u32, gen_num: u16 }, +} +``` + +**Wrong password** is detected at key derivation time by comparing the derived key's output against the `U` entry as described in Section 3. Return `WrongPassword` immediately rather than allowing garbage decryption to propagate. + +**Corrupted encryption dictionary** — missing `V`, `R`, `O`, `U`, `P`, or `/ID` — should return `InvalidEncryptionDictionary` with a message identifying the missing field. + +**Unsupported revision** — any `V`/`R` combination outside the table in Section 2 — returns `UnsupportedRevision`. This handles future revisions gracefully without panicking. + +**Partially encrypted files** are rare but real: some producers mark the file as encrypted while leaving certain streams unencrypted. If decryption of an individual object's ciphertext fails (e.g., AES block size mismatch), return `ObjectDecryptionFailed` and continue processing other objects, allowing partial text extraction with a warning in the result set. diff --git a/docs/research/performance-and-streaming-architecture.md b/docs/research/performance-and-streaming-architecture.md new file mode 100644 index 0000000..64992b6 --- /dev/null +++ b/docs/research/performance-and-streaming-architecture.md @@ -0,0 +1,175 @@ +# Performance and Streaming Architecture + +## Overview + +Handling large PDFs (100 MB+, 1000+ pages) efficiently requires deliberate architectural decisions at every layer: file I/O, object parsing, content stream processing, output serialization, and concurrency. This document specifies the performance-critical patterns for `pdftract` and the rationale behind each choice. + +--- + +## 1. Memory-Mapped File Access + +Use `memmap2::Mmap` rather than `std::fs::read()` or `BufReader`. Reading the entire file into a `Vec` allocates contiguous heap memory proportional to file size, which is unacceptable for 500 MB+ inputs. With `mmap`, the kernel maps the file's pages into the virtual address space; physical RAM is allocated only when pages are accessed, and unused pages are evicted under memory pressure without any application code involvement. + +The critical advantage for PDF parsing is **random access without sequential read cost**. The cross-reference table at the end of the file maps object numbers to byte offsets throughout the file. With `mmap`, seeking to object offset `0x1A3F00` is a pointer addition — `&mmap[offset..]` — with no syscall. The OS page fault mechanism fetches only the 4 KB page containing that offset. + +On 64-bit Linux, the virtual address space is 128 TB; mapping a 1 GB PDF consumes one entry in the process's VMA table and a trivial amount of page table space until pages are touched. The 32-bit limitation (4 GB VA space) is not a concern for any modern deployment target. + +**Sequential vs. random access tradeoff:** For a sequential single-pass parse (linearized PDFs, reading content streams in order), `BufReader` with a 64–128 KB buffer can match or exceed `mmap` throughput because the kernel's readahead prefetches pages ahead of the cursor. For the dominant PDF use case — random access to objects scattered across the file — `mmap` is superior. A practical hybrid: open with `mmap`, and call `madvise(MADV_SEQUENTIAL)` on regions known to be read linearly (e.g., large content streams). + +```rust +use memmap2::MmapOptions; +use std::fs::File; + +let file = File::open(path)?; +let mmap = unsafe { MmapOptions::new().map(&file)? }; +// Treat &mmap[..] as &[u8] for all subsequent parsing +``` + +--- + +## 2. Lazy Object Loading + +A PDF's xref table (or xref stream in PDF 1.5+) provides a complete map from `(object_number, generation)` to byte offset. Parse this table eagerly at open time — it is compact relative to the file — but defer parsing all objects until first access. + +Maintain an object cache as `HashMap`. For documents with thousands of objects, bound the cache with an LRU eviction policy (`lru` crate). A capacity of 4096 entries handles the working set of any realistic page range without unbounded growth. + +**Object streams (`/ObjStm`)**: PDF 1.5 compresses groups of objects into a single stream (FlateDecode, typically). When any object from a given `/ObjStm` is requested, decompress the entire stream once, parse all contained objects, and insert them all into the cache. The decompressed bytes can be stored in a `Bytes` handle (from the `bytes` crate) to allow zero-copy slicing across multiple parsed objects from the same stream. + +```rust +struct ObjectCache { + xref: HashMap, + parsed: LruCache>, + objstm_cache: HashMap>, // decompressed stream bytes +} +``` + +--- + +## 3. Streaming Page Output + +Accumulating extraction results for a 1000-page document into a single `Vec` before serializing is prohibitive in memory. Instead, emit NDJSON (newline-delimited JSON): one JSON object per line, flushed to the output `io::Write` as each page is processed. + +`serde_json`'s streaming API via `serde_json::Serializer::new(writer)` writes directly to any `io::Write` without an intermediate `String` allocation. Wrap the output in a `BufWriter` to amortize `write` syscalls. + +**Tradeoff**: Streaming output is incompatible with features requiring a full document pass: +- **Outline (bookmark) building**: PDF outlines reference destination pages; the full outline tree must be resolved before any page is emitted if outline data is included per-page. +- **Page label resolution**: `/PageLabels` is a document-level number tree; it can be parsed once before streaming begins. +- **Cross-page table detection**: Table cells spanning page breaks require buffering multiple pages. This feature must be opt-in and implies non-streaming mode. + +Default to streaming mode; expose `--no-stream` for use cases requiring full-document analysis. + +--- + +## 4. Parallel Page Processing + +Each page's content stream is self-contained: it references resources (fonts, XObjects) by name within its resource dictionary, resolves them via the document's shared object graph, but produces output independent of other pages. This makes page processing embarrassingly parallel. + +Use `rayon::par_iter()` over a range of page indices. Shared mutable state must be wrapped in `Arc>`: +- `Arc>` — read locks dominate on cache hits; contention is low if the cache is warm. +- `Arc>` — keyed by font object reference; write locks occur only on first use of each font. +- Image XObject cache — keyed by XObject reference, same pattern. + +**Avoiding lock contention on the hot path**: Do not hold a `RwLock` read guard across the content stream parse loop. The pattern is: acquire the lock, clone the `Arc` for the needed font, release the lock immediately, then use the unguarded `Arc` for the duration of parsing. Font data and CMap tables should be `Arc`-wrapped immutable structs — once written, never mutated. + +**Output ordering**: `rayon` does not guarantee ordering. Collect `(page_index, PageResult)` pairs, sort by index, then stream in order. For memory efficiency with large documents, process in chunks (e.g., 64 pages at a time) and stream each chunk's sorted output before beginning the next. + +--- + +## 5. Content Stream Parsing Performance + +PDF content streams are sequences of operands followed by operator names (e.g., `(Hello) Tj`, `10 0 0 10 72 720 cm`). Parsing is dominated by the tokenizer. + +A hand-rolled byte-level tokenizer over `&[u8]` outperforms regex-based approaches by 5–10x for this workload: there is no regex engine overhead, no capture group allocation, and no UTF-8 validation on the raw stream. Validate to UTF-8 only when constructing text output from string operands. + +Operator names are short ASCII strings. Match them against a static lookup table (a `phf::Map<&[u8], Operator>` built at compile time) to avoid heap allocation for operator dispatch. For the ~70 PDF operators, a perfect hash or simple match on a `&[u8]` slice is O(1). + +**Parser combinator crates**: `winnow` (the successor to `nom`) offers a clean combinator API with competitive performance and good error recovery. It operates on `&[u8]` natively. For content streams, a hand-rolled state machine may still win on throughput because content stream tokens are regular enough that the overhead of combinator composition is visible in profiles. Use `winnow` for the structural parser (cross-reference streams, object syntax) where correctness matters more than raw throughput, and a hand-rolled tokenizer for content streams. + +--- + +## 6. Font and Glyph Caching + +Font objects (Type1, TrueType, CIDFont) are referenced by resource name within each page but backed by document-level indirect objects. The same font object is typically used across hundreds of pages. Cache at the object reference level, not the resource name level. + +Per font entry, cache: +- The decoded `ToUnicode` CMap as a `HashMap` (or `Vec<(u16, char)>` sorted for binary search when the map is dense and ordered). +- The encoding vector (256-entry `[Option; 256]` for simple fonts). +- The glyph width table as `Vec` indexed by character code, used for text position tracking. + +CMap parsing — especially for CIDFont CMaps with `beginbfrange` sections covering thousands of code points — is the most expensive per-font operation. Wrap the parsed result in `Arc` and store in the font cache. Worker threads clone the `Arc`, not the data. + +Glyph-to-Unicode lookup must be O(1) on the hot path. Use `HashMap` for sparse CMaps (CID fonts with sparse mappings) and a direct-index `Vec` for dense simple-font encodings. + +--- + +## 7. Image Decoding Performance + +PDF image XObjects use several compression filters: + +- **FlateDecode**: `flate2` with the `miniz_oxide` backend. Fast, pure Rust, no FFI overhead. Suitable for in-process decoding. +- **DCTDecode (JPEG)**: Prefer `zune-jpeg` over `jpeg-decoder` — benchmarks show 20–40% higher throughput for typical PDF-embedded JPEGs. Both are pure Rust. +- **JPEG2000 (JPXDecode)**: No mature pure-Rust decoder exists. Use OpenJPEG via FFI (`openjpeg-sys`) or defer to a subprocess. This is a correctness requirement for scanned PDFs from certain scanners. +- **JBIG2**: Used in scanned document PDFs. The only production-grade decoder is `jbig2dec` (C). Invoke via `jbig2dec-sys` FFI bindings or a subprocess. Do not block the rayon thread pool on subprocess I/O — use a dedicated blocking thread pool (`tokio::task::spawn_blocking` or `std::thread::spawn`). +- **CCITTFaxDecode**: Pure Rust implementation is feasible; a reference exists in the `pdf` crate ecosystem. + +Cache decoded image data in an `Arc` keyed by XObject reference. A page that places the same image 50 times (e.g., a watermark) should decode once. + +--- + +## 8. Benchmarking Methodology + +Measure at multiple granularities: + +- **Throughput**: pages/second and MB of PDF input/second, end-to-end. +- **Memory**: peak RSS via `/proc/self/status` snapshots, and heap allocations via `dhat` (compile with `dhat` feature, profile with `dhat-viewer`). +- **Latency distribution**: tail latency (p99) matters for the HTTP server mode. + +Representative corpus categories: +- Academic papers (LaTeX-generated, many Type1/TrueType fonts, dense text). +- Financial filings (SEC EDGAR PDFs: forms, tables, mixed fonts). +- Scanned documents (rasterized pages, JBIG2/JPEG images, minimal text layer). +- Technical manuals (large page counts, complex layouts, embedded vector graphics). +- PDF forms (AcroForm, interactive fields — primarily object graph stress test). + +Use `criterion` for microbenchmarks of hot functions (tokenizer, CMap lookup, FlateDecode). For end-to-end benchmarks, drive with `hyperfine` against a fixed corpus. Profile with `cargo flamegraph` (wraps `perf record` + `inferno`) to identify throughput bottlenecks. Use `dhat` specifically for allocation hotspots — it attributes each allocation to its call stack, which is essential for finding unnecessary `String` clones in the parse path. + +--- + +## 9. Binary Size and Startup Time + +Full feature compilation (font handling, JBIG2 FFI, JPEG2000, Tesseract OCR) produces a binary well over 50 MB. Mitigate with Cargo feature flags: + +```toml +[features] +default = ["flate", "jpeg"] +jbig2 = ["jbig2dec-sys"] +jpeg2000 = ["openjpeg-sys"] +ocr = ["tesseract-sys"] +``` + +Apply LTO and size optimization for release builds: + +```toml +[profile.release] +lto = "thin" # "fat" for maximum but slow; "thin" is a good default +opt-level = "z" # minimize binary size; switch to "3" if throughput is more important +codegen-units = 1 +``` + +Use `cargo-bloat` (`cargo bloat --release --crates`) to identify which crates dominate binary size. Common offenders: `regex`, `unicode-data` tables, and statically linked C libraries. Link Tesseract dynamically (`tesseract-sys` supports this) to keep the binary distributable without embedding the full OCR runtime. + +Avoid `lazy_static!` or `once_cell::sync::Lazy` initializations on the startup critical path for the CLI. Prefer computing lookup tables (`phf::Map`) at compile time. + +--- + +## 10. HTTP Server Mode Performance + +Use `axum` for the `pdftract serve` endpoint: ergonomic handler composition, `tower` middleware ecosystem, and `tokio` integration. Key performance considerations: + +**Request-level memory bounding**: A naive implementation that buffers the full multipart body before parsing can OOM under concurrent large-PDF submissions. Stream the multipart body into a temporary file (via `axum::extract::Multipart` + `tokio::fs::File`), then open the temp file with `mmap` for parsing. This limits in-flight memory per request to roughly the working set of one PDF parse. + +**Concurrency control**: Bound concurrent extraction jobs to `num_cpus::get()` with a `tokio::sync::Semaphore`. Requests beyond this limit queue with a configurable timeout. Without this, four simultaneous 500 MB PDFs can saturate RAM before any job completes. + +**Connection keep-alive**: Enable HTTP/1.1 keep-alive (axum default) and consider HTTP/2 for high-throughput callers. HTTP/2 multiplexing allows the client to pipeline multiple extraction requests on one connection without head-of-line blocking. + +**Response streaming**: Use `axum::response::Body::from_stream()` with a `tokio_stream::wrappers::ReceiverStream` to stream NDJSON output as pages complete, rather than buffering the full extraction result before sending the first byte. This reduces time-to-first-byte significantly for large documents. diff --git a/docs/research/raster-ocr-pipeline.md b/docs/research/raster-ocr-pipeline.md new file mode 100644 index 0000000..03b6c9f --- /dev/null +++ b/docs/research/raster-ocr-pipeline.md @@ -0,0 +1,196 @@ +# Raster OCR Pipeline for PDF Text Recovery + +## Overview + +Not all PDF pages carry extractable vector text. Scanned documents, image-only PDFs, and PDFs with corrupt or dummy text layers require OCR to recover readable content. This document describes the full pipeline from trigger detection through output alignment, as it applies to `pdftract`. + +--- + +## 1. When to Trigger OCR + +### Detection Signals + +Four independent signals indicate that a page requires OCR: + +**No text operators.** A content stream parse that yields zero `Tj`, `TJ`, `'`, `"`, or `Do` (form XObject with text) operators is the strongest indicator. If the page contains only image XObjects and path operators, OCR is mandatory. + +**Suspiciously low character density.** Compute the ratio of character glyph bounding box area to page area. Body text pages should yield densities above roughly 0.03 (3%). A page with a large raster image and a handful of stray characters (OCR artifacts from a prior tool, or a page number alone) falls below this threshold and warrants re-examination. + +**Bounding box misalignment (fake text layer).** Some scanned PDFs carry an invisible text layer placed by a prior OCR pass. Validate each character's glyph bounding box against the underlying raster. Render the page and sample pixel intensity in the region each glyph should occupy. If the dominant pixel value is white (background) in >80% of sampled glyphs, the text layer is synthetic and untrustworthy. The character positions may also be zero-width or all positioned at a single coordinate, which is another reliable indicator of a dummy layer. + +**Below-threshold extraction confidence.** If the PDF uses a Type3 or CIDFont with missing ToUnicode entries, character codes cannot be mapped reliably. Track the fraction of unmapped characters per page; above 25% missing, confidence in vector extraction is too low and OCR should take over. + +### Decision Algorithm + +Use a **vector-first with OCR fallback** strategy. Attempt vector extraction; if any of the above signals fire, queue the page for OCR. Do not run both in parallel by default — OCR is expensive and the result comparison logic is non-trivial. The parallel-and-compare approach is justified only when assisted OCR (section 4) is in use and you need to resolve conflicts between the two sources. In that case, run the two passes concurrently and arbitrate at merge time. + +--- + +## 2. Image Preprocessing Pipeline + +Raw page rasters fed directly into Tesseract produce poor results. A deterministic preprocessing chain is essential. + +### Rasterization DPI + +Render the PDF page to a raster using a PDF rendering backend (e.g., `pdfium-render` or `mupdf` bindings). Use **300 DPI minimum** for standard body text. For pages with font sizes below 8pt or fine print, use **400 DPI**. Higher DPI yields better Tesseract accuracy up to roughly 600 DPI; beyond that, gains plateau and memory cost dominates. + +Store the raster as a grayscale or 8-bit image. Color channels add no accuracy benefit for Latin-script OCR and increase memory pressure. + +### Deskewing + +Scanned pages are rarely axis-aligned. Two reliable methods: + +- **Hough line transform on text baselines.** Apply a Canny edge detector, then accumulate Hough votes for near-horizontal lines (angles within ±10° of horizontal). The mode angle of the dominant cluster is the skew angle. Rotate the image by the negative of that angle before OCR. +- **Projection profile maximization.** For each candidate rotation angle in a sweep (e.g., -10° to +10° in 0.1° steps), compute the horizontal projection profile (sum of white pixels per row). Text baselines produce sharp peaks; maximize the variance of this profile across candidate angles to find true horizontal alignment. + +The projection profile method is more robust for low-resolution or lightly printed pages; the Hough approach is faster for clean scans. + +### Binarization + +Convert grayscale to binary (black text on white background): + +- **Otsu thresholding** works well for uniformly lit pages with bimodal intensity histograms. It minimizes intra-class variance and requires no tuning. +- **Sauvola local adaptive thresholding** is essential for pages with uneven illumination (e.g., curved book spines, shadow gradients). It computes a per-pixel threshold from a local window mean and standard deviation: `T(x,y) = mean * (1 + k * (std/R - 1))` where `k ≈ 0.5` and `R = 128`. Window size of 15–31 pixels at 300 DPI is typical. + +Prefer Sauvola for physical scans; prefer Otsu for digital-origin documents printed and re-scanned at a consistent exposure. + +### Denoising and Morphological Cleanup + +After binarization: +- Apply a **median filter** (3×3 or 5×5 kernel) to suppress salt-and-pepper noise without blurring character strokes. +- Apply **morphological opening** (erosion then dilation) with a 1×1 structuring element to remove isolated single-pixel noise blobs. +- Do not apply closing (dilation then erosion) before OCR — it merges character strokes and degrades accuracy. + +### Contrast Normalization + +Before binarization, stretch the grayscale histogram so that the 2nd percentile maps to 0 and the 98th percentile maps to 255. This compensates for faded or overexposed scans. Apply this before Sauvola to ensure the local statistics are computed on a well-conditioned input. + +--- + +## 3. Tesseract Integration + +### Engine and API Mode + +Tesseract exposes three OEM (OCR Engine Mode) values: +- `OEM_TESSERACT_ONLY` (0): legacy cube engine; fast, lower accuracy. +- `OEM_LSTM_ONLY` (1): LSTM-based neural engine; best accuracy for most scripts. +- `OEM_TESSERACT_LSTM_COMBINED` (2): runs both and combines; marginally better, significantly slower. + +Use `OEM_LSTM_ONLY` (1) as the default. Fall back to `OEM_TESSERACT_LSTM_COMBINED` only if LSTM alone produces below-threshold confidence on a page. + +### Page Segmentation Mode + +PSM selection critically affects accuracy: +- `PSM_AUTO` (3): default; suitable for full pages with mixed content. +- `PSM_SINGLE_BLOCK` (6): a single uniform block of text; use when the page is a known body-text region. +- `PSM_SINGLE_LINE` (7): use when processing a single text line extracted from a larger region. +- `PSM_SINGLE_COLUMN` (4): multi-size text in a single column; useful for narrow document columns. +- `PSM_SPARSE_TEXT` (11): page with scattered text, no assumed reading order; use for form fields or tables with isolated cells. +- `PSM_VERTICAL` (5): vertical CJK text (see section 6). + +For full-page OCR, start with `PSM_AUTO`. For region-level OCR (where bounding boxes are already known), use `PSM_SINGLE_BLOCK` or `PSM_SINGLE_LINE` depending on region height. + +### `tesseract-rs` Crate Interface + +The `tesseract` crate (wrapping `leptonica` + `libtesseract`) exposes a Rust-safe interface. Key initialization: + +```rust +let mut api = tesseract::Tesseract::new(Some("/usr/share/tessdata"), Some("eng")) + .set_page_seg_mode(tesseract::PageSegMode::PsmAuto) + .set_variable("tessedit_char_whitelist", "")?; // empty = all characters +``` + +Pass a pre-binarized image rather than letting Tesseract binarize internally. Tesseract's internal Otsu implementation ignores Sauvola-style adaptation, which degrades accuracy on uneven scans. Use `SetImage` with a Leptonica `PIX*` allocated from your preprocessed raster. + +### Language Packs and Confidence + +Confidence scores are available at two granularities: +- `GetMeanTextConf()`: page-level mean confidence, 0–100. +- Per-word `Confidence()` from the `ResultIterator` at `RIL_WORD` level. + +A page-level confidence below 60 signals that OCR failed and a fallback (different preprocessing, different PSM, or marking the page as unextractable) is needed. Per-word confidence is used to tag individual spans (section 9). + +--- + +## 4. Assisted OCR (Vector Hints) + +When vector text is partially valid (low-confidence but spatially correct), use it to guide OCR rather than discarding it. Two mechanisms: + +**`SetRectangle` per known word region.** If vector extraction produced bounding boxes for individual words, crop the raster to each word's bounding box (with a small margin, e.g., 5px), set PSM to `PSM_SINGLE_LINE`, and run Tesseract on each crop independently. This restricts the LSTM's attention to a known region and avoids segmentation errors on surrounding noise. + +**HOCR alignment.** Run full-page OCR with HOCR output, then match HOCR word boxes against vector word boxes using IoU (intersection over union). Where IoU > 0.7 and vector confidence is above threshold, prefer the vector text (it carries the correct encoding from the font). Where IoU > 0.7 but vector confidence is below threshold, prefer the OCR text. Unmatched OCR words (no corresponding vector box) are accepted as new content. + +Conflict resolution rule: when vector and OCR produce different strings for the same box, prefer OCR if vector confidence < 0.4, prefer vector if OCR word confidence < 50, and flag the span as ambiguous otherwise. + +--- + +## 5. HOCR Output and Coordinate Alignment + +Tesseract's HOCR output is an HTML document with a hierarchy of classed elements: `ocr_page`, `ocr_carea` (content area/block), `ocr_par`, `ocr_line`, `ocr_word`. Each element's `title` attribute contains a `bbox x0 y0 x1 y1` value in raster pixel coordinates (origin top-left). + +To map back to PDF coordinate space: +1. Divide pixel coordinates by the raster DPI to get inches. +2. Multiply by 72 to get PDF user units (points). +3. Flip the Y axis: `pdf_y = page_height_pts - (pixel_y / dpi * 72)`. + +Parse the HOCR with a SAX or DOM HTML parser. Extract `ocr_word` elements, reading `bbox`, `x_wconf` (confidence), and text content. Map each to a `Span` in the same schema used for vector-extracted content, setting the `source` and `confidence` fields accordingly. + +Group `ocr_word` spans into lines using the `ocr_line` parent, then into blocks using `ocr_carea`. This mirrors the block/line/span hierarchy produced by vector extraction. + +--- + +## 6. Multi-Language OCR + +Before invoking Tesseract with a language pack, detect the dominant script per page region: + +- Sample character glyph bitmaps and classify by Unicode block after a first-pass OCR with `osd` (orientation and script detection) mode (`PSM_OSD_ONLY`). Tesseract's OSD returns a script name string. +- Split the page into regions with differing scripts (e.g., a header in Latin, body in Arabic) and process each with the appropriate language pack. + +For **mixed-script pages**, segment regions by script first using OSD on sub-regions, then pass each region with its own `Tesseract` instance initialized to the correct language. + +**CJK vertical text** requires `PSM_SINGLE_BLOCK_VERT_TEXT` (5) and the `chi_sim_vert`, `chi_tra_vert`, `jpn_vert`, or `kor_vert` language packs. Vertical glyph metrics differ from horizontal; do not reuse a horizontal-mode session. + +--- + +## 7. JBIG2 and CCITT Encoded Scans + +Scanned PDFs predominantly use two image compression formats for bitonal (black-and-white) rasters: + +**CCITT Group 4** (T.6 fax compression): lossless, row-by-row 2D encoding. Decoding is exact; the raster recovered is pixel-identical to the original scan. No quality loss affects OCR. Most PDF rasterization backends decode CCITT natively. + +**JBIG2**: an adaptive dictionary-based bitonal compressor. Standard JBIG2 (lossless mode) is also exact. However, **lossy JBIG2** substitutes visually similar symbol bitmaps from a shared dictionary — a glyph that "looks like" another is silently replaced. This is a known issue that can cause OCR character substitutions that are invisible to visual inspection but corrupt extraction. When the PDF stream dictionary has `/Filter /JBIG2Decode` and the JBIG2 global segments contain a lossy-mode marker, log a warning and consider elevating OCR confidence thresholds or flagging output as potentially degraded. Use `jbig2dec` or equivalent for decoding. + +--- + +## 8. Performance Considerations + +OCR throughput is limited by CPU and, secondarily, by rasterization cost. + +**Caching.** Cache rasterized page images keyed by `(pdf_hash, page_index, dpi)`. If the same document is processed repeatedly (e.g., during development or re-extraction), rasterization is the dominant cost and can be eliminated on repeat runs. + +**Parallelism.** OCR pages in parallel using a thread pool (`rayon` is appropriate). Each `Tesseract` instance is not `Send`; initialize one instance per thread using thread-local storage. A pool of 4–8 threads is typical; beyond that, memory pressure from holding multiple full-page rasters simultaneously may become the bottleneck. + +**GPU acceleration.** Tesseract supports CUDA via its LSTM implementation when compiled with `--with-cuda`. GPU acceleration yields 3–5× throughput improvement for LSTM OCR. However, CUDA adds a large build dependency; expose it as an optional Cargo feature (`ocr-gpu`) that links against `libtesseract` compiled with CUDA support. + +**DPI/accuracy tradeoff.** For documents known to have large font sizes (e.g., presentation slides), 200 DPI is sufficient and halves raster memory. For documents with mixed font sizes, use 300 DPI and accept the overhead. + +**Skip conditions.** Skip OCR entirely for: (a) PDF pages with `Encrypt` dictionaries that restrict content copying if the restriction is enforced; (b) pages the user has explicitly marked as skip via configuration; (c) pages where the page area is below a minimum threshold (e.g., < 1 cm²), which are likely decorative or separator elements. + +--- + +## 9. Confidence and Provenance Tagging + +Every span in `pdftract`'s output model carries a `source` field. OCR-derived spans must be tagged: + +```rust +pub struct OcrProvenance { + pub source: &'static str, // "ocr" + pub engine: String, // "tesseract-5.3.1" + pub dpi: u32, // rasterization DPI used + pub word_confidence: f32, // 0.0–1.0, from Tesseract per-word Confidence() + pub page_confidence: f32, // 0.0–1.0, from GetMeanTextConf() + pub preprocessing: Vec, // ["deskew", "sauvola", "median_filter"] +} +``` + +Distinguish OCR spans from vector spans at the consumer level: downstream tools (chunkers, classifiers) may apply different trust weights. Never silently merge OCR and vector text without recording which source each span came from. When assisted OCR (section 4) resolves a conflict by preferring OCR over a vector candidate, record both the original vector text and the OCR text in the span's `alternatives` field so the consumer can audit the decision. diff --git a/docs/research/text-readability-validation.md b/docs/research/text-readability-validation.md new file mode 100644 index 0000000..e2a396a --- /dev/null +++ b/docs/research/text-readability-validation.md @@ -0,0 +1,197 @@ +# Text Readability Validation + +## Overview + +Extracting bytes from a PDF font stream and producing a sequence of Unicode codepoints is necessary but not sufficient. A PDF can encode every character correctly at the byte level while still emitting text that is semantically unreadable — because the font has no ToUnicode map, because a custom encoding overlaps with a standard encoding at the wrong offset, or because the renderer selected the wrong code path. This document defines the algorithms and data structures that `pdftract` uses to detect and remediate unreadable output before it reaches a caller. + +--- + +## 1. Failure Modes: What "Unreadable" Looks Like + +Unreadable extraction output falls into several distinct categories, each with a different root cause and remediation path. + +**Mojibake** occurs when bytes are decoded with the wrong code page. The classic form is Latin-1 interpreted as UTF-8 (or vice versa), producing sequences like `é` for `é` or `’` for `'`. These are valid Unicode codepoints, but they are wrong ones. + +**Replacement characters** (U+FFFD) appear when a decoder encounters byte sequences that are invalid in the target encoding. A high density of U+FFFD is an unambiguous signal of encoding mismatch. + +**Private Use Area codepoints** (U+E000–U+F8FF) are legitimately used in some PDF fonts to encode glyphs that have no standard Unicode assignment, but a prose span where more than a small fraction of codepoints are PUA almost certainly reflects a missing or incorrect ToUnicode map. + +**Control characters** in the range U+0000–U+001F (excluding U+0009 TAB and U+000A LF) should never appear in prose extracted from a document. Their presence indicates that glyph IDs are being emitted directly without Unicode mapping. + +**Symbol font bleed-through** happens when a font that uses Zapf Dingbats, Symbol, or a custom pi font is decoded as if it were a text font. The result is runs of symbols — ♦ ♣ ♥ ♠ — where letters should be. + +**Impossible character sequences** for the detected language include strings like `xzqbvw` in English or `aeiouaeiou` in Czech. Natural languages have strong constraints on consonant/vowel alternation and on which n-grams can appear adjacently. + +**Mixed-directionality fragments** without a Unicode Bidirectional Algorithm context marker produce visually disordered text when a span mixes Arabic or Hebrew runs with Latin runs and the bidi embedding levels are absent. + +**Zero-width characters** — U+200B ZERO WIDTH SPACE, U+200C/D ZWNJ/ZWJ, U+FEFF BOM used mid-stream — should be rare in extracted prose; dense runs of them indicate malformed CMap output. + +--- + +## 2. Character-Level Validity Checks + +Character-level checks are the first filter in the validation pipeline. They operate per-span in O(n) time with no external data dependencies. + +**U+FFFD density:** compute `replacement_ratio = fffd_count / total_codepoints`. Flag the span as `"garbled"` if `replacement_ratio > 0.10` and `"low"` quality if `replacement_ratio > 0.02`. + +**PUA density:** compute `pua_ratio` over U+E000–U+F8FF and Supplementary PUA (U+F0000–U+FFFFF). Flag as `"garbled"` if `pua_ratio > 0.40`. A small PUA ratio (< 0.05) may be acceptable for documents using custom ligature glyphs. + +**Control character scan:** iterate codepoints; any U+0000–U+0008, U+000B–U+001F (excluding 0x09, 0x0A) in a prose span is an immediate `"low"` flag and adds `"control_chars"` to `quality_signals`. + +**Combining character orphans:** a sequence of combining characters (Unicode category M) not preceded by a base character (category L, N, or P) indicates CMap corruption. Detect runs of three or more consecutive combining characters. + +**Anomalous Unicode block concentration:** compute the fraction of codepoints falling in Mathematical Alphanumeric Symbols (U+1D400–U+1D7FF) or Enclosed Alphanumerics (U+2460–U+24FF). Values above 0.15 in a prose context indicate symbol font confusion. + +**Unicode category distribution:** for a valid English paragraph, the dominant categories are `Ll` (lowercase letter), `Lu` (uppercase letter), `Nd` (decimal digit), `Po` (other punctuation), and `Zs` (space separator). Compute category histograms and compare against expected priors. A span where `Lo` (other letter — CJK, Arabic, etc.) exceeds 0.60 but the document language is detected as Latin-script warrants a `"medium"` flag. + +--- + +## 3. Word-Level Validity + +Word-level checks require tokenizing the span on whitespace and punctuation boundaries, then evaluating each token. + +**Bloom filter word list:** maintain a Bloom filter over a ~500,000-word corpus (one per supported language) stored in approximately 3 MB per language at a 0.1% false positive rate using 10 hash functions. The filter supports O(1) probabilistic membership queries. In Rust, the `bloomfilter` crate or a hand-rolled implementation over `xxhash` works well. Load the filter lazily per detected language. + +**Real-word ratio:** `real_word_ratio = (dictionary_hits + numeric_tokens) / total_tokens`. Require `real_word_ratio >= 0.60` for `"high"` quality. Values in [0.35, 0.60) map to `"medium"`. Below 0.35, flag `"low"`. + +**Consonant/vowel ratio:** for Latin-script languages, compute the ratio of consonant letters to vowel letters in the span. English prose clusters around 1.4–1.8. A ratio above 5.0 or below 0.3 is anomalous. This check catches both garbled encoding and accidental extraction of phoneme tables. + +**Character n-gram plausibility:** build a bigram or trigram presence set from a reference corpus (compactly encoded as a sorted array of 16-bit hashes). For each character trigram in the extracted text, check membership. If more than 20% of trigrams are absent from the reference set, add `"ngram_anomaly"` to `quality_signals`. Trigrams like `fsqz`, `bxwk`, or `qzjv` have near-zero frequency in English and their presence is diagnostic. + +--- + +## 4. Entropy-Based Detection + +Shannon entropy provides a language-agnostic, O(n) garble detector. Compute character-level entropy over a span as: + +``` +H = -Σ p(c) * log2(p(c)) +``` + +where `p(c)` is the empirical frequency of codepoint `c` in the span. + +Expected entropy ranges: +- English prose: 4.0–5.0 bits/char +- Random Unicode glyphs: 7.0–8.0 bits/char +- Repeated patterns or single-glyph runs: < 1.5 bits/char +- Base64 or hex strings: 5.0–6.0 bits/char + +Spans with `H > 6.5` are likely garbled; spans with `H < 1.5` are likely repeated/template noise. Both conditions add `"entropy_anomaly"` to `quality_signals` and reduce quality to at most `"medium"`. + +Per-block entropy scoring: divide each page block into 128-codepoint windows and compute entropy per window. A bimodal distribution within a single block (some windows normal, some high entropy) indicates interleaved readable and garbled content, which may call for span-level rather than block-level remediation. + +Entropy alone cannot distinguish high-entropy valid content (technical identifiers, URLs, code snippets) from garble. It is a necessary but not sufficient signal; always pair with word-level and n-gram checks. + +--- + +## 5. Language Model Perplexity Scoring + +A character-level n-gram language model assigns a probability to each character given its context, enabling perplexity scoring without a word boundary assumption. + +**Model choice:** a 4-gram character model trained on language-specific Common Crawl shards. Store log-probabilities in a compact trie or a flat sorted array of (n-gram hash → log-prob) pairs. A 4-gram model for English requires approximately 8–20 MB in this encoding. The `whichlang` crate provides language identification but not perplexity; build or embed a separate compact model. + +**Perplexity computation:** for a span of length N, perplexity is `PP = exp(-1/N * Σ log P(c_i | c_{i-3}..c_{i-1}))`. Valid English text has perplexity roughly in [5, 30] under a well-trained model. Garbled text commonly exceeds 200. + +**Threshold:** flag spans with perplexity > 100 as `"low"` quality; above 300 as `"garbled"`. Spans below 10 may indicate repeated boilerplate and are worth a separate low-entropy check. + +**Runtime tradeoff:** perplexity scoring is more expensive than entropy. Apply it only to spans that pass character-level checks but fail word-level checks — treating it as a second-pass arbiter rather than a first-line filter. + +--- + +## 6. Cross-Validation Between Extraction Paths + +When both vector text extraction and OCR output are available for the same page region (e.g., a page with embedded text on which OCR was also run as a confidence check), compare the two using normalized edit distance (Levenshtein distance divided by the length of the longer string). + +**Agreement criterion:** if `normalized_edit_distance < 0.15`, both paths agree and confidence is high regardless of individual quality signals. If `0.15 ≤ distance < 0.40`, flag for review but prefer the vector path. If `distance ≥ 0.40`, the paths disagree significantly; use OCR output as a spell-check oracle — compute per-word overlap between OCR and vector output, and prefer whichever achieves higher real-word ratio. + +This cross-validation is also useful for detecting symbol font bleed-through: OCR on a symbol font region will produce incoherent results too, which confirms the region is non-textual, whereas OCR on correctly encoded text that the vector path garbled will produce coherent text that diverges significantly from the vector output. + +--- + +## 7. Symbol Font Detection and Recovery + +Symbol fonts are the most common source of coherent-looking but semantically wrong text. Detection combines font metadata with codepoint analysis. + +**Font-level signal:** inspect the font's `FontDescriptor.Flags` bit field. Bit 3 (`Symbolic`) set and bit 6 (`Nonsymbolic`) clear indicates the font self-declares as symbolic. Additionally, check the font name against known symbol font names: `Symbol`, `ZapfDingbats`, `Wingdings`, `Webdings`, and variants. + +**Codepoint-level signal:** compute the fraction of output codepoints in Unicode Dingbats (U+2700–U+27BF), Miscellaneous Symbols (U+2600–U+26FF), Mathematical Operators (U+2200–U+22FF), and Box Drawing (U+2500–U+257F). A combined fraction above 0.30 in a body-text span is strongly indicative. + +**Remediation:** do not emit these spans as prose. Annotate them with `readable: false`, `quality: "garbled"`, and add `"symbol_font"` to `quality_signals`. If the caller has requested exhaustive extraction, emit the raw codepoints under a `raw_glyphs` field. Do not attempt character correction on symbol font output — the mapping is fundamentally wrong at the encoding level, not the decoding level. + +--- + +## 8. Post-Detection Remediation + +When a span fails validation, the remediation decision tree is: + +1. **Try font encoding recovery** (see `glyph-recognition-and-unicode-recovery.md`). If the font has a usable glyph outline and the issue is a missing ToUnicode map, heuristic name-based mapping or shape similarity to a reference font may recover the correct codepoints. Re-run validation on the recovered span. + +2. **Re-run OCR on the page region** if encoding recovery fails or if the span is flagged `"garbled"` and the page has raster content at sufficient DPI. OCR is slow but authoritative on the visual content. Store the OCR result under `ocr_text` alongside the vector extraction. + +3. **Emit with degraded quality metadata** if neither recovery path succeeds or is available. Set `quality: "low"` or `quality: "garbled"` and `readable: false`. Populate `quality_signals` with the list of triggered checks. This allows callers to filter, log, or surface the spans without crashing on unexpected content. + +4. **Character-level correction** using edit distance to the nearest dictionary word is a last resort, applicable only to short tokens (≤ 12 characters) that fail the real-word check by a small margin. Compute Levenshtein distance to candidates within distance 2 using a BK-tree over the word list. Apply correction only if a unique nearest neighbor exists at distance 1 and the corrected span passes n-gram validation. + +--- + +## 9. Span-Level Quality Metadata + +Each extracted `TextSpan` carries the following readability fields: + +```rust +pub struct TextSpan { + pub text: String, + pub quality: SpanQuality, // High, Medium, Low, Garbled + pub readable: bool, // true iff quality is High or Medium + pub quality_signals: Vec, // which checks triggered + pub confidence: f32, // 0.0–1.0 composite score +} + +pub enum SpanQuality { High, Medium, Low, Garbled } + +pub enum QualitySignal { + ReplacementChars, + PuaCodepoints, + ControlChars, + EntropyAnomaly, + NgramAnomaly, + LowRealWordRatio, + SymbolFont, + CvRatioAnomaly, + CombiningOrphan, +} +``` + +`quality: "high"` requires: `real_word_ratio ≥ 0.60`, `replacement_ratio < 0.02`, `pua_ratio < 0.05`, entropy in [3.5, 6.5], no `quality_signals` triggered. + +`quality: "medium"` requires: at most two non-critical signals triggered, `real_word_ratio ≥ 0.35`, no garble-level entropy. + +`quality: "low"` means the span may contain recoverable text but significant anomalies are present. + +`quality: "garbled"` means the span almost certainly does not contain readable prose in its current form. + +--- + +## 10. Block-Level Readability Score + +Aggregate span quality into a block-level score using a weighted mean: + +``` +block_score = Σ (span_confidence * span_char_count) / Σ span_char_count +``` + +Map `SpanQuality` to a base confidence: `High → 1.0`, `Medium → 0.65`, `Low → 0.30`, `Garbled → 0.0`. Adjust by the `confidence` field if finer-grained scoring is available from perplexity. + +**Page-level readability score:** compute the character-weighted mean of block scores across the page. A score below 0.50 on a nominally vector page should trigger automatic OCR fallback for that page. Expose both block and page scores in the output: + +```rust +pub struct PageReadability { + pub score: f32, // 0.0–1.0 + pub ocr_recommended: bool, // score < threshold + pub block_scores: Vec<(BlockId, f32)>, +} +``` + +The threshold for `ocr_recommended` is configurable, defaulting to 0.50. Callers building pipelines that prioritize accuracy over speed can lower this to 0.70; callers that trust vector extraction for well-formed documents can raise it to 0.35 or disable the check entirely. + +The page-level score also serves as a signal for the block-level zone labeling pipeline (see `document-classification-and-zone-labeling.md`): a page with a score below 0.30 is a candidate for whole-page OCR rather than incremental span recovery. diff --git a/docs/research/xmp-and-document-metadata.md b/docs/research/xmp-and-document-metadata.md new file mode 100644 index 0000000..0958b6c --- /dev/null +++ b/docs/research/xmp-and-document-metadata.md @@ -0,0 +1,330 @@ +# XMP and Document Metadata in PDF + +**Project:** pdftract — Rust PDF text extraction library +**Scope:** Structured metadata extraction from PDF files, covering the legacy `/Info` dictionary and XMP metadata streams + +--- + +## 1. The /Info Dictionary + +The document information dictionary is an optional indirect object referenced by the `Info` key in the PDF file's cross-reference trailer (`trailer << /Root ... /Info N G R >>`). It predates XMP and was the sole metadata mechanism through PDF 1.6. + +### Standard Keys + +| Key | Type | Description | +|-----|------|-------------| +| `Title` | text string | Human-readable document title | +| `Author` | text string | Name of the person who created the document | +| `Subject` | text string | Subject matter summary | +| `Keywords` | text string | Space- or comma-delimited keyword list | +| `Creator` | text string | The authoring application (e.g., "Microsoft Word 2019") | +| `Producer` | text string | The PDF-writing library that generated the file (e.g., "Acrobat Distiller 23.0") | +| `CreationDate` | date string | When the document was first created | +| `ModDate` | date string | When the document was last modified | +| `Trapped` | name | `/True`, `/False`, or `/Unknown` — trapping status for print production | + +### Date Format + +PDF date strings use the format `D:YYYYMMDDHHmmSSOHH'mm'` where: + +- `D:` is a required literal prefix +- `YYYY` through `SS` are the year, month, day, hour, minute, and second (all numeric, left-zero-padded) +- `O` is the timezone offset sign: `+`, `-`, or `Z` (for UTC) +- `HH'mm'` are the timezone hour and minute offsets, separated by a literal apostrophe, with a trailing apostrophe + +All components after `YYYY` are optional but must be omitted from the right. For example, `D:20230415143022+05'30'` is valid; `D:202304` is also valid. When parsing, treat missing components as their minimum values (month 01, day 01, time 00:00:00, timezone UTC). + +### String Encoding + +`/Info` text string values follow two encoding paths depending on a leading BOM: + +- **PDFDocEncoding**: The default single-byte encoding when no BOM is present. It is a superset of Latin-1 with custom assignments in the `0x18`–`0x1F` and `0x80`–`0x9F` ranges. Rust extraction must implement the PDFDocEncoding-to-Unicode mapping table (PDF spec Annex D). +- **UTF-16BE with BOM**: If the first two bytes of the string object are `0xFE 0xFF`, the entire string (including BOM) is UTF-16BE. Rust's `std::str::from_utf8` cannot handle this; use `encoding_rs` or a manual UTF-16BE decoder. + +### Deprecation in PDF 2.0 + +PDF 2.0 (ISO 32000-2:2020) formally deprecates the `/Info` dictionary. Processors conforming to PDF 2.0 must treat XMP as the authoritative source. `/Info` may still appear in PDF 2.0 files for backward compatibility but shall not be used as the definitive source when XMP is present. + +--- + +## 2. XMP Overview + +XMP (Extensible Metadata Platform) is defined by ISO 16684-1 and ISO 16684-2. It encodes metadata as an RDF/XML document embedded in a PDF metadata stream. + +### Embedding in PDF + +The document-level XMP stream is attached to the document catalog (`/Type /Catalog`) via the `/Metadata` key: + +``` +1 0 obj +<< /Type /Catalog /Pages 2 0 R /Metadata 10 0 R >> +endobj + +10 0 obj +<< /Type /Metadata /Subtype /XML /Length ... >> +stream + + + + ... + + + +endstream +endobj +``` + +The `` processing instruction marks the start of an XMP packet. The `begin` attribute value is the Unicode BOM character (`U+FEFF`, encoded as UTF-8: `0xEF 0xBB 0xBF`), serving as an encoding hint. The `id` value `W5M0MpCehiHzreSzNTczkc9d` is a fixed magic string defined by the XMP specification. + +The closing `` uses `end="r"` (read-only, fixed-size packet) or `end="w"` (writable, in-place update permitted). + +--- + +## 3. XMP Namespaces Relevant to PDF + +XMP organizes properties into namespaces identified by URI, conventionally bound to prefixes: + +### Dublin Core (`dc:` — `http://purl.org/dc/elements/1.1/`) +- `dc:title` — document title (typically an `rdf:Alt` with language tags) +- `dc:creator` — author(s) as `rdf:Seq` of strings +- `dc:description` — abstract or summary (often `rdf:Alt`) +- `dc:subject` — topics as `rdf:Bag` of strings +- `dc:rights` — copyright statement (often `rdf:Alt`) +- `dc:date` — `rdf:Seq` of ISO 8601 date strings (last modification usually most relevant) +- `dc:format` — MIME type, typically `application/pdf` +- `dc:language` — `rdf:Bag` of BCP 47 language tags +- `dc:identifier` — document identifier + +### XMP Basic (`xmp:` — `http://ns.adobe.com/xap/1.0/`) +- `xmp:CreateDate` — ISO 8601 creation timestamp +- `xmp:ModifyDate` — ISO 8601 last modification timestamp +- `xmp:MetadataDate` — when the XMP metadata itself was last written +- `xmp:CreatorTool` — authoring application string +- `xmp:Label` — user-assigned label string +- `xmp:Rating` — numeric rating (integer) + +### XMP Media Management (`xmpMM:` — `http://ns.adobe.com/xap/1.0/mm/`) +- `xmpMM:DocumentID` — persistent unique identifier assigned at document creation; does not change across saves +- `xmpMM:InstanceID` — unique identifier for this specific rendition; changes on every save +- `xmpMM:History` — `rdf:Seq` of `stEvt:` resource events recording revision history + +### PDF-Specific (`pdf:` — `http://ns.adobe.com/pdf/1.3/`) +- `pdf:Keywords` — keyword string (mirrors `/Info` Keywords) +- `pdf:PDFVersion` — string such as `"1.7"` or `"2.0"` +- `pdf:Producer` — PDF-writing library string + +### Photoshop (`photoshop:` — `http://ns.adobe.com/photoshop/1.0/`) +- `photoshop:Instructions` — special handling instructions +- `photoshop:Source` — originating organization +- `photoshop:City`, `photoshop:Country` — geographic metadata (common in editorial/press workflows) + +--- + +## 4. RDF/XML Parsing + +XMP uses a constrained subset of RDF/XML (W3C RDF 1.1 XML Syntax). + +### Core Structure + +```xml + + + Adobe InDesign + + + My Document + + + + +``` + +The `rdf:about` attribute on `rdf:Description` is typically the empty string for document-level metadata, identifying the document itself. + +### Collection Types + +| RDF Type | Semantics | Use in XMP | +|----------|-----------|------------| +| `rdf:Seq` | Ordered list | Author list, date list, history | +| `rdf:Bag` | Unordered set | Subject keywords, language list | +| `rdf:Alt` | Alternatives | Localized strings (one value per language) | + +Items within these collections are `rdf:li` elements. + +### Localized Strings + +`rdf:Alt` containers carry `xml:lang` attributes on each `rdf:li`. The special tag `x-default` marks the preferred default. When extracting a title or description for a non-localized use case, select the `x-default` item; if absent, use the first item. + +### Attribute vs. Element Form + +Simple string properties may appear as XML attributes on `rdf:Description` rather than child elements: + +```xml + +``` + +Both forms are semantically equivalent. A conforming parser must handle both. + +### Inline Objects (`rdf:parseType="Resource"`) + +Structured sub-properties can be inlined without a wrapper element: + +```xml + + + + saved + xmp.iid:abc123 + + + +``` + +--- + +## 5. Conflict Resolution Between /Info and XMP + +When `/Info` and XMP both exist and disagree, the resolution rules are: + +1. **PDF 2.0 mandates XMP as authoritative.** When the PDF version is 2.0, discard `/Info` values in favor of XMP with no ambiguity. +2. **For pre-2.0 PDFs, prefer XMP when present.** Tools that update metadata often write only to XMP (Adobe Acrobat, LibreOffice), leaving `/Info` stale. XMP is more likely to be current. +3. **Fall back to `/Info` when an XMP field is absent.** Not all producers write all XMP namespaces. +4. **Log discrepancies as structured warnings.** Expose a list of conflict records (field name, `/Info` value, XMP value) in the extraction result so callers can decide how to handle them. + +**Common divergence pattern:** Microsoft Word exports PDFs with synchronized `/Info` and XMP initially. If Acrobat subsequently edits the XMP (adding subject keywords, changing title), `/Info` remains at the Word-exported values while XMP reflects the Acrobat edits. The `xmp:MetadataDate` field can help determine which was written later, but it is not always present. + +--- + +## 6. Page-Level and Object-Level Metadata + +XMP streams are not limited to the document catalog. Any stream object may carry a `/Metadata` key: + +- **Page objects** (`/Type /Page`): carry provenance for that specific page, important when pages from different source documents are merged. The page-level `xmpMM:DocumentID` will differ from the document-level one. +- **Image XObjects** (`/Subtype /Image`): carry rights, author, and capture metadata for individual embedded images. +- **Form XObjects** and other content streams: less common but permitted. + +When extracting, enumerate all page objects and XObjects for `/Metadata` keys. Collect page-level XMP into a per-page metadata array in the output, recording the page index and any differing `DocumentID` or `InstanceID` values. This supports provenance tracking in document assembly workflows. + +--- + +## 7. Encrypted Metadata + +The `/Encrypt` dictionary controls whether the `/Metadata` stream participates in encryption: + +- **`EncryptMetadata false`**: The metadata stream is stored in plaintext regardless of the file password. This is explicitly permitted so that document management systems can index XMP without possessing the decryption key. Detect this by checking the `EncryptMetadata` boolean in the `/Encrypt` dictionary (default is `true` if the key is absent). +- **`EncryptMetadata true` (default)**: The metadata stream is encrypted with the same key derivation as all other streams. XMP is inaccessible without the user or owner password. + +In pdftract, check `EncryptMetadata` before attempting to parse the `/Metadata` stream on encrypted files. If `false`, parse it unconditionally. If `true` and no decryption key is available, record the metadata as unavailable rather than emitting a parse error. + +--- + +## 8. XMP Packets and In-Place Update + +The xpacket wrapper enables XMP to be updated in a PDF file without rewriting the entire file. The packet is padded with ASCII spaces between the closing `` tag and the `` instruction: + +``` + + [hundreds of spaces] + +``` + +Implications for extraction: + +- **Padded XMP is not malformed.** Strip trailing whitespace before passing to an XML parser if the parser does not tolerate trailing content after the document element. +- **`end="r"`** signals a read-only packet (fixed-size, not intended for in-place rewrite). **`end="w"`** signals a writable packet. pdftract is read-only, so this distinction matters only for completeness. +- When the metadata stream length differs significantly from the XML content length, the excess is padding. Do not treat this as a parse error. + +--- + +## 9. Practical Extraction Output + +The normalized metadata structure pdftract should expose: + +```rust +pub struct DocumentMetadata { + pub title: Option, + pub authors: Vec, + pub subject: Option, + pub keywords: Vec, + pub creator_tool: Option, // authoring application + pub producer: Option, // PDF-writing library + pub creation_date: Option>, + pub modification_date: Option>, + pub metadata_date: Option>, + pub language: Option, // BCP 47 + pub document_id: Option, // xmpMM:DocumentID + pub instance_id: Option, // xmpMM:InstanceID + pub page_count: u32, + pub pdf_version: Option, // e.g. "1.7" + pub trapped: Option, + pub raw_xmp: Option, // full XMP XML for callers needing fidelity + pub metadata_conflicts: Vec, +} + +pub struct MetadataConflict { + pub field: &'static str, + pub info_value: String, + pub xmp_value: String, +} +``` + +**Sourcing each field:** + +| Field | Primary | Fallback | +|-------|---------|----------| +| `title` | `dc:title` (x-default) | `/Info` Title | +| `authors` | `dc:creator` (Seq) | `/Info` Author (split on `;`) | +| `subject` | `dc:description` (x-default) | `/Info` Subject | +| `keywords` | `pdf:Keywords` or `dc:subject` (Bag) | `/Info` Keywords (split on `,`/`;`/space) | +| `creator_tool` | `xmp:CreatorTool` | `/Info` Creator | +| `producer` | `pdf:Producer` | `/Info` Producer | +| `creation_date` | `xmp:CreateDate` | `/Info` CreationDate | +| `modification_date` | `xmp:ModifyDate` or last `dc:date` | `/Info` ModDate | +| `language` | `dc:language` (first item) | — | +| `document_id` | `xmpMM:DocumentID` | — | +| `instance_id` | `xmpMM:InstanceID` | — | +| `pdf_version` | `pdf:PDFVersion` | Header `%PDF-x.y` | +| `trapped` | `/Info` Trapped (name) | — | + +Always expose `raw_xmp` so callers with domain-specific namespaces (e.g., IPTC, PRISM, MusicXML-adjacent) can parse the full packet themselves without re-reading the file. + +--- + +## 10. Thumbnail and Preview Images + +### XMP Thumbnails + +The `xmp:Thumbnails` property (namespace `http://ns.adobe.com/xap/1.0/`) is an `rdf:Alt` of thumbnail structures. Each item uses the `xmpGImg:` namespace (`http://ns.adobe.com/xap/1.0/g/img/`): + +- `xmpGImg:width`, `xmpGImg:height` — pixel dimensions +- `xmpGImg:format` — typically `"JPEG"` +- `xmpGImg:image` — base64-encoded image data + +Decoding these requires base64 decoding followed by interpretation as the declared format (usually JPEG). + +### /Thumb on Page Dictionaries + +Page objects may carry a `/Thumb` entry pointing to an image XObject (a stream with `/Subtype /Image`) that represents a low-resolution preview of that page. This is independent of XMP. + +### Extraction Recommendation + +Thumbnail data is large (base64 JPEG) and rarely needed by text-extraction callers. The recommended approach is to exclude thumbnail bytes from the default `DocumentMetadata` output and expose thumbnail extraction as an opt-in API: + +```rust +pub fn extract_thumbnail(doc: &PdfDocument, page_index: u32) -> Option>; +pub fn extract_document_thumbnail(doc: &PdfDocument) -> Option>; +``` + +This prevents unnecessary allocations when the caller only needs text or structured metadata. The presence of a thumbnail can still be signaled via a boolean flag in `DocumentMetadata` without materializing the bytes. + +--- + +## References + +- ISO 32000-2:2020 — PDF 2.0 specification (§ 14.3 Metadata, § 7.9.2 String Object Encoding) +- ISO 16684-1:2019 — XMP specification, Part 1: Data model, serialization and core properties +- ISO 16684-2:2014 — XMP specification, Part 2: Description of core schemas +- W3C RDF 1.1 XML Syntax — `https://www.w3.org/TR/rdf-syntax-grammar/` +- PDF spec Annex D — PDFDocEncoding character set table