diff --git a/docs/research/digital-signatures-and-certification.md b/docs/research/digital-signatures-and-certification.md new file mode 100644 index 0000000..6d05a53 --- /dev/null +++ b/docs/research/digital-signatures-and-certification.md @@ -0,0 +1,120 @@ +# Digital Signatures, Certification, and Document Integrity in PDFs + +**Project:** pdftract — Rust PDF text extraction library +**Scope:** Signature field types, cryptographic metadata extraction, incremental update semantics, and integrity context for extraction confidence + +--- + +## 1. Signature Field Types and Permitted Modifications + +PDF defines three signature field subtypes, each with different implications for how much of a document's content can be treated as the original signed payload. + +An **approval signature** means a named entity approved the document at signing time. It does not restrict subsequent modifications; later editors may append content via incremental updates, and multiple approval signatures may appear in one document, each covering the cumulative byte range up to the moment that signature was applied. + +A **certification signature** (DocMDP) is always the first signature applied. It is registered at the catalog level through a `/Perms` dictionary with a `/DocMDP` reference, and its `/Reference` array contains a transform dictionary with `/TransformMethod /DocMDP` and a `/P` entry encoding the permissions level. Any modification exceeding the permitted level technically invalidates the certification. pdftract extracts the certification signer name and permissions level and annotates extracted text regions accordingly. + +A **usage rights signature** (UR3) enables otherwise-restricted reader features such as saving filled form data. UR3 signatures yield signer identity and certificate metadata for extraction but do not constrain which text is readable. + +--- + +## 2. The /Sig Dictionary and Extractable Fields + +A signature field's value object is a signature dictionary (`/Type /Sig`). Its entries are the primary source of structured metadata pdftract surfaces. + +**`ByteRange`** is an array of four integers `[offset1, length1, offset2, length2]` defining the byte spans covered by the cryptographic hash. Everything in the file except the Contents placeholder between those two ranges is hashed. pdftract uses ByteRange to distinguish signed bytes from post-signature incremental additions. + +**`Contents`** holds the cryptographic payload as a hexadecimal DER blob — a PKCS#7 (CMS) SignedData structure. pdftract parses this to retrieve the embedded signer certificate, the signing time from authenticated attributes, and any embedded timestamp token. It yields no extractable prose but is the source of all certificate metadata. + +**`SubFilter`** identifies the format: `adbe.pkcs7.detached` is the most common, embedding a full SignedData with certificate and optional RFC 3161 timestamp; `adbe.pkcs7.sha1` is a legacy SHA-1 variant; `ETSI.CAdES.detached` is used in European qualified signature workflows with stricter attribute requirements; `ETSI.RFC3161` contains a bare timestamp token rather than a signer signature, asserting document existence at a moment in time. + +The signature dictionary may also carry four text strings pdftract extracts directly: **`Name`** (signer name as entered in the UI), **`M`** (signing time in PDF date format as claimed by the signing application), **`Reason`** (free-text explanation), and **`Location`** (signer's reported location). These are caller-supplied and unverified; pdftract marks them as `claimed` in the output schema. + +--- + +## 3. Signature Appearance Streams + +A signature field widget annotation may have an `/AP` dictionary whose normal appearance (`/N`) is a Form XObject — the visual stamp rendered on the page, typically showing the signer's name, date, and reason text. + +From pdftract's perspective, this appearance stream is a standard content stream parsed identically to any other Form XObject. Text-showing operators (`Tj`, `TJ`, `'`, `"`) position strings in whatever font the XObject's `Resources` dictionary declares. pdftract resolves character codes through font encoding and ToUnicode maps exactly as it does for page content. The extracted appearance text is often more composed than the raw dictionary fields — signing applications format date and reason into readable sentences — and pdftract includes it alongside the structured metadata in the signature record. + +--- + +## 4. DocMDP Permissions and Extraction Confidence + +The `/P` integer in the DocMDP transform dictionary takes three values with specific meaning for post-certification modifications: **1** permits no changes; **2** permits form fills, widget annotation changes, and new approval signatures; **3** additionally permits page and annotation insertions and deletions. + +These levels have direct implications for extraction confidence. Content within the original certified byte range was hash-covered by the certification. Content from incremental updates may or may not be within the scope the certifier permitted. pdftract surfaces the DocMDP level so consuming applications can decide how much trust to place in each extracted text region. + +--- + +## 5. Incremental Updates After Signing + +When a user fills a form or adds an approval signature after an earlier signature, the PDF receives an incremental update: an additional body appended after the previous `%%EOF` marker, with its own cross-reference data and new or modified objects. pdftract must process all incremental updates to extract complete document content, chaining cross-reference sections from the latest trailer `/Prev` offset back to the original, building an object table where each number resolves to the most recently written version. + +Any byte that falls beyond the union of a given signature's ByteRange spans was added after that signature was computed. pdftract records for each text region which signatures' byte ranges cover it — regions covered by at least one signature carry an associated integrity note; regions added after all signatures lack any cryptographic cover. + +--- + +## 6. LTV Signatures and the DSS Dictionary + +Long Term Validation (LTV) embeds certificate revocation data into the PDF via the Document Security Store (`/DSS` at the catalog level). The DSS dictionary contains arrays of binary DER-encoded blobs: **`OCSPs`** (OCSP responses), **`CRLs`** (Certificate Revocation Lists), and **`Certs`** (chain certificates). A `/VRI` sub-dictionary maps signature hashes to their associated validation data. + +None of this is prose text. pdftract detects the presence of a populated DSS dictionary and reports a `ltv_present` boolean plus counts of embedded OCSP responses and CRLs. The presence of LTV data indicates that revocation status was confirmed as good at some point after signing — meaningful context for how much confidence to attach to the extracted content. + +--- + +## 7. Timestamp Tokens and Reliable Signing Time + +An **embedded timestamp** appears as an unsigned attribute (`id-aa-signatureTimeStampToken`) within the PKCS#7 SignedData in the signature's Contents field. pdftract parses the `unsignedAttrs` of the first SignerInfo and, if present, extracts the TSA's `genTime` — a GeneralizedTime value that is cryptographically guaranteed by the TSA's own signature. + +A **document timestamp** uses `SubFilter ETSI.RFC3161` and contains a bare RFC 3161 TimeStampToken as the Contents blob. These are commonly added by archival workflows to establish a trusted time anchor. + +The TSA-asserted `genTime` is significantly more reliable than the `/M` entry in the signature dictionary, which is set by the signing application and can be misconfigured or forged. pdftract's hierarchy for the `signing_time` output field is: RFC 3161 `genTime` if present, annotated `timestamp_token`; otherwise `/M`, annotated `claimed`. + +--- + +## 8. Certificate Information Extraction + +The signer certificate embedded in the PKCS#7 Contents field is a DER-encoded X.509 certificate. pdftract parses the TBSCertificate to extract the following text fields: **Subject DN** (including `commonName`, `organizationName`, `organizationalUnitName`, `countryName`, `emailAddress`); **Issuer DN** (same RDN structure, identifying the Certificate Authority); **Serial number** (rendered as colon-separated hex); and the **validity period** (`notBefore` / `notAfter`). If the TSA-extracted signing time falls outside the certificate's validity period, pdftract flags the record with `certificate_validity_issue: true`. Chain certificates from the PKCS#7 `certificates` field are parsed for subject and issuer DNs but not deeply surfaced; pdftract records chain depth and root issuer name. + +--- + +## 9. Signature Field Widget Annotations + +Signature fields manifest on the page as widget annotations with `/Rect` (bounding rectangle in page user space), `/P` (page reference), and `/AP` (appearance dictionary). The widget `/F` flags field includes bit 2 (`Hidden`) and bit 7 (`NoView`); if either is set, the signature has no visible representation and pdftract records `visible: false`. A non-visible signature remains cryptographically valid but contributes no appearance text for extraction. The bounding rectangle is recorded as `[x1, y1, x2, y2]` so consuming applications can associate signature metadata with a specific page region. + +--- + +## 10. Output Schema for Signature Metadata + +pdftract emits signature metadata as a `signatures` array at the top level of its JSON output, parallel to `pages` and `metadata`. Each entry is a `SignatureRecord` with the following fields: + +| Field | Type | Source | +|-------|------|--------| +| `field_name` | string | Field `/T` | +| `signature_type` | enum | SubFilter / transform method | +| `signer_name` | string | `/Name` (claimed) | +| `signer_common_name` | string | Certificate CN (cryptographic) | +| `signer_organization` | string | Certificate O (cryptographic) | +| `issuer_dn` | string | Certificate issuer RFC 4514 string | +| `serial_number` | string | Hex with colon separators | +| `cert_not_before` | RFC 3339 | Certificate notBefore | +| `cert_not_after` | RFC 3339 | Certificate notAfter | +| `signing_time` | RFC 3339 | TSA genTime or `/M` | +| `signing_time_source` | enum | `timestamp_token` or `claimed` | +| `reason` | string\|null | `/Reason` (claimed) | +| `location` | string\|null | `/Location` (claimed) | +| `mdp_permissions` | int\|null | DocMDP `/P`; null if not certification | +| `byte_range` | [u64; 4] | `/ByteRange` | +| `ltv_present` | bool | DSS dictionary presence | +| `ocsp_count` | integer | DSS OCSPs array length | +| `crl_count` | integer | DSS CRLs array length | +| `appearance_text` | string\|null | Extracted from `/AP/N` XObject | +| `visible` | bool | Widget annotation flags | +| `page` | integer | 1-based page number | +| `rect` | [f32; 4] | `[x1, y1, x2, y2]` user space | +| `certificate_validity_issue` | bool | Computed from signing_time vs. cert window | + +The `signing_time_source` field is essential for calibrating trust in extracted content. A `timestamp_token` source means a third-party TSA has cryptographically bound the document hash to a specific moment; a `claimed` source means the signing application reported the time. For extraction confidence, regions whose only covering signature has `signing_time_source: claimed` and no LTV data carry softer provenance than regions covered by a fully LTV-enabled, TSA-timestamped certification. + +The `byte_range` field enables the consuming application to cross-reference each signature's coverage against the page-level text regions in pdftract's `pages` output, determining with byte-level precision which extracted strings were present when each signature was applied and which were added after the fact. diff --git a/docs/research/medical-and-scientific-pdf-patterns.md b/docs/research/medical-and-scientific-pdf-patterns.md new file mode 100644 index 0000000..2a6c21f --- /dev/null +++ b/docs/research/medical-and-scientific-pdf-patterns.md @@ -0,0 +1,103 @@ +# Medical and Scientific PDF Extraction Patterns + +## Overview + +Medical and scientific PDFs represent some of the most structurally demanding documents that a text extraction engine will encounter. They combine dense tabular data, specialized notation, hierarchical section numbering, regulatory formatting conventions, and typographic symbols that are frequently mangled by naive extraction approaches. This document describes the patterns pdftract must handle correctly to produce reliable, readable output from clinical, pharmaceutical, and academic scientific documents. + +--- + +## 1. Clinical Trial Reports (ICH E3 Format) + +Clinical study reports following the ICH E3 guideline present a layered section hierarchy: numbered top-level sections (1. Synopsis, 2. Table of Contents, 3. Introduction, through 16. Appendices) with multi-level subsections such as 9.4.1.1 and deeper. pdftract must preserve this numbering scheme exactly — dropped or merged numbers corrupt the logical outline that downstream systems rely on for navigation. + +The protocol synopsis table at the front of an ICH E3 report is a structured two-column layout: parameter name on the left, value on the right. Rows cover study title, protocol number, phase, objectives, investigational product, dose regimen, patient population, and statistical design. Because these tables are sometimes rendered as grid-lined tables and sometimes as borderless aligned text, pdftract must detect both forms and emit them as coherent key-value pairs rather than interleaved character streams. + +Patient flow is typically represented as a CONSORT diagram — a flowchart image showing enrolled, randomized, allocated, lost to follow-up, and analyzed counts. The diagram itself is a vector graphic or rasterized image and its internal numbers are not extractable from the image data. However, CONSORT diagrams are always accompanied by a figure caption (e.g., "Figure 2. Patient disposition") that pdftract must pair with the figure block. The caption text, including the embedded counts if they appear in text form beneath the figure, must be extracted and associated with the figure reference. + +Adverse event tables are dense, multi-column structures with system organ class groupings, preferred terms, and per-arm count columns. Column headers often span multiple rows, and the system organ class rows use bold or indented formatting to distinguish them from the preferred term rows underneath. pdftract must preserve row-level indentation signals to allow downstream consumers to reconstruct the hierarchy without ambiguity. + +--- + +## 2. Medical Journal Articles (IMRaD Format) + +The IMRaD structure — Abstract, Introduction, Methods, Results, Discussion — is the dominant organizational pattern for biomedical journal articles. Section headers may be formatted as bold inline text, as visually distinct heading blocks, or, in two-column layouts, as text spanning the full page width while body text flows in columns. pdftract must detect IMRaD section boundaries regardless of whether they carry PDF structural tags, and emit them as block-level labels so that downstream processing can distinguish Methods text from Results text without manual parsing. + +Structured abstracts add a sublevel: Background, Objective, Methods, Results, Conclusions appear as bold inline labels within a single abstract block. These must be captured as labeled sub-blocks rather than collapsed into undifferentiated paragraph text. + +Two-column journal layouts require careful column ordering. Text in the left column must be read before text in the right column at equivalent vertical positions, and figure captions placed between columns or spanning both must be associated with the correct figure rather than inserted mid-sentence into the adjacent body text stream. + +--- + +## 3. Drug Labeling PDFs (FDA Prescribing Information) + +FDA-format package inserts follow a mandated section structure numbered 1 through 17: Indications and Usage (1), Dosage and Administration (2), Dosage Forms and Strengths (3), Contraindications (4), Warnings and Precautions (5), Adverse Reactions (6), Drug Interactions (7), Use in Specific Populations (8), Drug Abuse and Dependence (9), Overdosage (10), Description (11), Clinical Pharmacology (12), Nonclinical Toxicology (13), Clinical Studies (14), References (15), How Supplied/Storage and Handling (16), Patient Counseling Information (17). pdftract must recognize and tag these numbered headings to enable downstream systems to locate, for example, all Warnings and Precautions content across a batch of package inserts. + +The Highlights of Prescribing Information box appears at the top of the label as a visually distinct shaded or bordered block, formatted in a condensed multi-column layout that summarizes the most critical sections. Extraction must preserve the box as a distinct block separate from the full prescribing information body, because regulatory systems treat Highlights as a standalone artifact. + +Boxed Warnings — colloquially Black Box Warnings — are surrounded by a heavy border and typically set in bold text. They represent the highest severity safety signal in US labeling. pdftract must detect boxed warning blocks and apply a safety-critical flag to the extracted text so that downstream consumers can surface these warnings without parsing the full document. Detection heuristics include: a bordered rectangle enclosing bold text near the document top, the phrase "WARNING" or "BOXED WARNING" as a heading within the bordered region, and positioning before section 1 of the main label body. + +--- + +## 4. Lab Reports and Pathology Reports + +Laboratory and pathology report PDFs are typically generated from laboratory information systems (LIS) and follow a predictable structure: a patient metadata header containing name, date of birth, collection date/time, ordering provider, specimen type, and accession number; followed by a results table with columns for test name, result value, reference range, units, and an abnormality flag (H for high, L for low, C for critical). + +pdftract must extract these tables column-by-column rather than row-by-row to prevent units from being concatenated with result values or flags from drifting to incorrect rows. Reference ranges expressed as "3.5–5.0" must survive extraction with the en dash intact rather than being converted to a hyphen or dropped entirely. Flag values (H, L, HH, LL, C, A) are short and may be misread as part of the unit column if column alignment is not honored. + +Patient metadata headers warrant special handling: the fields appear in varied multi-column layouts and sometimes as label-colon-value inline text. pdftract must recognize the header region and extract it as structured metadata distinct from result rows. + +--- + +## 5. Scientific Paper Extraction: DOIs, Citations, and Author Affiliations + +Reference sections in scientific papers follow numbered or author-date citation formats. DOIs appear as "https://doi.org/10.xxxx/..." or as bare "DOI: 10.xxxx/..." strings. pdftract must extract DOIs as intact strings; line-wrapping within a DOI is a frequent extraction failure point because a hyphenated break inside the DOI path produces an invalid identifier. pdftract must detect mid-DOI line breaks and rejoin them. + +Author affiliation blocks link author names to institutions via superscript numerals or symbols. The author line appears at the top of the article, with each author followed by one or more superscript markers (¹, ², *), and the corresponding affiliations appear in smaller text below. pdftract must associate superscript markers with their author names and with their expanded affiliation strings, preserving the many-to-many relationship in the extraction output. + +ORCID iDs appear as 16-digit strings in the format 0000-0002-1825-0097, often following the ORCID logo (an image) or the label "ORCID:". pdftract must extract these as text, treating the image as non-extractable but capturing the adjacent identifier string. + +--- + +## 6. Chemical Structures in PDFs + +Chemical structure diagrams are rendered as vector graphics or embedded images. The atoms and bonds that constitute a structural formula are not encoded as text characters in the PDF content stream; they are geometric drawing operations. pdftract must not attempt to interpret structure diagrams as text and must instead mark the bounding region as a figure placeholder. + +The extractable chemical identity information appears in surrounding text as IUPAC systematic names (e.g., "(2S)-2-amino-3-(4-hydroxyphenyl)propanoic acid"), CAS registry numbers (e.g., CAS 60-18-4), and occasionally as InChI strings (e.g., "InChI=1S/..."). InChI strings are long and may wrap across lines; pdftract must detect them by prefix and rejoin wrapped segments. SMILES strings may also appear in supplementary tables. All of these are plain text and must be extracted verbatim. + +--- + +## 7. Statistical Notation + +Medical papers are dense with statistical notation that is vulnerable to extraction errors. Confidence intervals appear as "95% CI: 1.23–4.56" or "95% CI [1.23, 4.56]"; the en dash or em dash separating the bounds must survive as the correct Unicode character rather than being dropped or replaced with an ASCII hyphen. p-values are expressed with the less-than symbol: "p < 0.001"; the `<` character must not be interpreted as an XML/HTML tag boundary. Similarly, ≤ and ≥ are used in threshold expressions and must be extracted as their Unicode code points (U+2264, U+2265). + +Hazard ratios and odds ratios appear with confidence intervals: "HR 0.72 (95% CI 0.58–0.89)" — the parenthetical CI must be kept on the same logical line as the ratio. The multiplication sign × (U+00D7) appears in cell count expressions and must be preserved; the common failure is replacement with the letter x. + +--- + +## 8. Units and Measurement Notation + +SI unit notation in medical PDFs includes the micro prefix µ (U+00B5 or U+03BC), which is distinct from the ASCII letter u and must not be substituted for it. Degree symbols ° (U+00B0) appear in temperature measurements. These are frequently lost when PDFs are produced from fonts that map these characters to non-standard code points. + +Subscript characters in chemical formulas — H₂O, CO₂, CaCO₃ — may be encoded as actual Unicode subscript digits (U+2082, U+2082) or as font-size-reduced characters positioned below the baseline. pdftract must normalize both representations to Unicode subscript characters or, where appropriate, to a plaintext markup form, rather than omitting them or rendering them at the same baseline as surrounding text. + +--- + +## 9. Supplementary Material References + +Authors routinely direct readers to supplementary material that exists as a separate file. Phrases such as "Supplementary Table S1", "Supplementary Figure S2", "see online supplementary appendix", and "eTable 3 in the Supplement" appear as inline text and must be extracted verbatim. These cross-references are textually meaningful even when the supplement is not present, because they indicate that a referenced data artifact exists. pdftract must not strip these references as unreachable hyperlink targets. + +--- + +## 10. Regulatory Submission PDFs (NDA/BLA eCTD Format) + +Electronic Common Technical Document (eCTD) submissions are organized into five modules: Module 1 (Administrative), Module 2 (Summaries), Module 3 (Quality), Module 4 (Nonclinical Study Reports), Module 5 (Clinical Study Reports). Individual PDF files within an eCTD submission are navigated via a bookmark tree that mirrors the eCTD section hierarchy (e.g., m5/5.3/5.3.5.1/). pdftract must extract the PDF bookmark tree and emit it as a structured outline alongside the document text, because eCTD reviewers navigate primarily by section number. + +Form FDA 356h is the cover sheet for NDA and BLA submissions. It is a structured form with labeled fields: applicant name, address, NDA/BLA number, date of submission, proposed proprietary name, established name, pharmacological class, dosage form, route of administration, and a checklist of attached documents. pdftract must recognize form field regions and extract label-value pairs from them, preserving field boundaries even when the form is rendered as a static PDF image of a filled form rather than an interactive AcroForm. + +The Module 2 summaries — Quality Overall Summary, Nonclinical Overview, Clinical Overview, Clinical Summary — are narrative documents that reference section numbers in Modules 3–5. These cross-references (e.g., "see section 5.3.5.1") must be preserved as text so that downstream indexing can reconstruct the cross-module citation graph. + +--- + +## Summary + +Correct extraction from medical and scientific PDFs requires pdftract to handle section hierarchy preservation, multi-column layout ordering, special Unicode characters, table column integrity, figure-caption association, form field recognition, and bookmark tree extraction as first-class concerns. Failures in any of these areas produce output that is either unreadable to human reviewers or unparseable by downstream regulatory and clinical data systems. The patterns described here define the minimum correctness bar for pdftract to be a reliable tool in pharmaceutical, clinical research, and academic scientific workflows. diff --git a/docs/research/multilingual-document-extraction.md b/docs/research/multilingual-document-extraction.md new file mode 100644 index 0000000..b719f05 --- /dev/null +++ b/docs/research/multilingual-document-extraction.md @@ -0,0 +1,73 @@ +# Multilingual and Mixed-Script PDF Extraction + +## Overview + +PDF extraction from multilingual documents is one of the most demanding problems in the text extraction domain. A document mixing Latin prose with Arabic footnotes, or a Japanese academic paper that includes Hebrew proper nouns in a citation, demands that the extraction engine handle fundamentally different directionality models, encoding conventions, joining behaviors, and layout assumptions simultaneously. This document defines what pdftract must implement to produce logically ordered, Unicode-correct text from such documents. + +## 1. Mixed-Script Pages and Reading Order at the Span Level + +When a page contains both Latin and Arabic text — or Latin and CJK, or Latin and Hebrew — the naive approach of concatenating glyphs left-to-right by x-coordinate will produce garbage. Directionality is not a page property; it is a span-level property. A single paragraph can contain a Latin phrase, an embedded Arabic noun, and a Latin continuation, each segment flowing in a different direction before the paragraph proceeds left-to-right. + +pdftract must assign a base directionality to each span based on its script content, then compose spans using the Unicode Bidirectional Algorithm rather than geometric order alone. Span boundaries must be drawn where script or directionality changes, not solely where font or size changes. Each code point is tested against Unicode script property tables (UCD `Script` property), and a new span is opened when the dominant script transitions between families with different base directions. + +For Latin+CJK mixed text, directionality is less fraught because CJK is nominally left-to-right in horizontal layout, but span assembly still requires care: CJK characters carry different line metrics, and naively joining them with Latin characters using the same inter-glyph spacing assumptions breaks word boundaries. + +## 2. Unicode Bidirectional Algorithm (UAX #9) and Paragraph Embedding Levels + +The Unicode Bidirectional Algorithm (UBA, defined in UAX #9) is the normative framework for ordering characters within a paragraph of mixed-directionality text. pdftract must implement a conforming UBA resolver for span assembly rather than relying on geometric ordering. + +The Paragraph Embedding Level (PEL) is the foundational concept: an integer (0 for LTR, 1 for RTL) determined by the first strong directional character in the paragraph. All relative ordering of runs derives from this level. pdftract should derive the PEL from the dominant script of the first strong-character span on each logical text block. + +Explicit embedding uses Unicode control characters: Left-to-Right Embedding (LRE, U+202A), Right-to-Left Embedding (RLE, U+202B), and Pop Directional Formatting (PDF, U+202C). Some generators insert these into ToUnicode output streams. pdftract must preserve them and feed them to the UBA resolver rather than stripping them as noise. When absent, implicit bidi infers run boundaries from strong directional characters. After resolving bidi levels, spans must be reordered to logical order before emission — search engines, NLP pipelines, and screen readers all expect logical order. + +## 3. RTL Language Extraction: Visual vs. Logical Order Detection + +Arabic and Hebrew are natively right-to-left. The critical divergence is how a given PDF encodes RTL text: in logical order (as a native reader would type) or in visual order (glyph positions left-to-right by x-coordinate, readable only when reversed). + +A PDF from XeLaTeX or a modern word processor stores RTL glyphs in logical order, relying on the glyph positioning matrix for visual placement. PDFs from older Acrobat workflows or PostScript distillers may store glyphs in visual order. pdftract must detect which convention applies by comparing the geometric glyph sequence against the UBA-resolved expected sequence. If RTL glyphs are already reversed relative to the UBA prediction, the stream is in visual order and must be reversed before output. Detection must occur per text block, not per page, since a single page may contain content from different software pipelines. + +## 4. Arabic Cursive Joining and Encoding Source Detection + +Arabic letters have up to four contextual forms — isolated, initial, medial, and final — encoded in the Unicode Arabic Presentation Forms blocks (U+FB50–U+FDFF, U+FE70–U+FEFF) and in the primary Arabic block (U+0620–U+06FF). Presentation forms are legacy encodings; logical-order extraction must normalize them back to base code points. + +pdfLaTeX, XeLaTeX, and modern Word PDF export encode Arabic in logical order using base code points. Acrobat distilling from PostScript may encode Arabic using presentation-form code points in visual order. pdftract detects which applies by inspecting the ToUnicode CMap of each font covering the Arabic range. When presentation forms are present, the pipeline must: (a) reverse the glyph sequence from visual to logical order, and (b) normalize presentation-form code points to their base equivalents via Unicode canonical decomposition or an explicit shaping reversal table. This normalization must precede span assembly so that word-boundary detection operates on canonical code points. + +## 5. Hebrew: Logical Order, Legacy Visual Order, and Nikud + +Modern Hebrew PDFs from Word, LibreOffice, or XeLaTeX store text in logical order, with glyph positioning matrices handling visual placement. Legacy PDFs from WordPerfect for DOS or early Windows word processors may use visual order; the same detection heuristic as Arabic applies. + +Hebrew nikud (vowel diacritics) and cantillation marks (trop) are Unicode combining characters in the Hebrew block (U+05B0–U+05C7 for niqqud, U+0591–U+05AF for cantillation). They attach to their base consonant as combining character sequences. pdftract must preserve these sequences in canonical combining class order. Extractors that process glyphs atomically silently drop nikud, producing Hebrew text that loses vowelization — a significant data loss for liturgical, classical, or pedagogically annotated texts. + +## 6. Mixed-Column Layout with RTL Reading Order + +Arabic-language newspapers and RTL-dominant documents use multi-column layouts where columns themselves flow right-to-left: the rightmost column is read first, not last. Column detection algorithms that assume LTR column order will produce a completely inverted reading sequence. + +pdftract's column detection must be script-aware. When the dominant script on a page is RTL, columns are sorted by descending x-coordinate and inter-column reading order proceeds right-to-left; lines within each column remain top-to-bottom. The page's dominant PEL drives this decision. Pages with mixed column layouts — an RTL main body alongside an LTR sidebar — require column-level script classification to assign independent reading orders per region. + +## 7. CJK Vertical Text + +Japanese and Chinese documents frequently use vertical writing mode, where text flows top-to-bottom within columns that themselves flow right-to-left. PDF represents this with Td and Tm operators: in vertical mode, dominant glyph displacement is along the y-axis, and advance values come from the font's vertical metrics (vmtx table). + +pdftract must detect vertical writing mode by examining the Tm matrix for 90-degree rotations and checking whether the font declares vertical metrics. When detected, span assembly uses y-coordinate ordering to sequence glyphs and "column width" replaces "line height" for layout reconstruction. Vertical text also intermixes horizontal Latin numerals and abbreviations via tatechuyoko (horizontal-in-vertical) sub-mode; pdftract must detect these horizontal runs within a vertical column and preserve them inline without rotating their characters. + +## 8. Font Fallback and Cross-Script Unicode Contamination + +A multilingual PDF may embed ten or more fonts covering different scripts. ToUnicode CMaps must be applied strictly per-font: the same byte sequence in two different font resources may map to entirely different Unicode code points. pdftract must maintain font context for every glyph and never apply a CMap from one font to glyphs rendered in another. + +The contamination risk is highest when a PDF uses PUA (Private Use Area) ranges in one font to encode a script that another font encodes in a standard range, or when two fonts share overlapping byte ranges mapped differently. pdftract's span assembly must tag each character with its source font identifier and permit cross-font joining only when Unicode ranges are orthogonal. When joining across font boundaries, Unicode normalization (NFC for most scripts, NFD for scripts with combining characters) must be applied consistently. + +## 9. Transliterated Text in Scientific Documents + +Scientific and linguistic papers routinely present native-script terms alongside Latin transliterations: a paper on Arabic morphology may write "كتب (kataba)" in a single inline run. pdftract must preserve both the native-script sequence and the transliteration in their presented order, without discarding either as decorative. + +Span-merging heuristics that consolidate spans by script will incorrectly split these inline mixed pairs. pdftract's span merger must treat script-mixed spans as atomic when they appear within a single bounding box at the same baseline, joining them as a single logical unit whose internal directionality is resolved by the UBA before output. + +## 10. Language Tagging and the PDF Lang Attribute + +PDF's Tagged PDF structure supports a `Lang` attribute at the document, page, structure element, and span levels, following BCP 47 syntax (e.g., `ar-SA`, `he-IL`, `ja-JP`). NLP pipelines use these tags to select tokenizers, search engines use them for stemmers, and accessibility tools use them for text-to-speech voice selection. + +pdftract must extract `Lang` attributes at every level and propagate them via structure tree inheritance: a span with no explicit `Lang` inherits from its enclosing structure element, which inherits from the page, which inherits from the document. Fine-grained values override coarser defaults. The resolved tag attaches to each output span as metadata, letting downstream consumers validate language detection and correctly process spans whose script alone is ambiguous — Latin text may be English, Turkish, Vietnamese, or dozens of other languages depending on diacritic patterns. + +## Summary + +Correct multilingual extraction requires pdftract to implement: UBA-conforming bidi resolution with PEL detection; visual-to-logical normalization for Arabic and Hebrew with generator-convention detection; Arabic presentation-form normalization; Hebrew combining character preservation; RTL-aware column ordering; vertical CJK handling with tatechuyoko support; per-font CMap enforcement without cross-contamination; preservation of transliterated inline pairs; and propagation of PDF `Lang` attributes to output spans. Each is a distinct implementation concern. Failure in any one produces silently incorrect output — the bytes are present, but in the wrong order or normalization form. diff --git a/docs/research/shading-pattern-and-text-visibility.md b/docs/research/shading-pattern-and-text-visibility.md new file mode 100644 index 0000000..90fa5ac --- /dev/null +++ b/docs/research/shading-pattern-and-text-visibility.md @@ -0,0 +1,79 @@ +# Shading, Pattern Fills, and Their Interaction with Text Visibility + +## Overview + +PDF color is a layered system: color spaces define how numeric values map to perceptual colors, paint operators apply those values to the current path or glyph, and compositing operators blend painted content onto the page. For a text extraction library, the critical distinction is between color attributes that affect *what characters exist in the content stream* and those that affect only *how those characters are rendered*. Character codes, Unicode mappings, and glyph positions survive every color operation; what can genuinely suppress extraction value is text that was intentionally painted to be invisible — matching or nearly matching its background — as a watermark defense or layout artifact. pdftract must navigate this distinction without discarding valid content and without treating deliberately hidden text as a false negative. + +## PDF Color Space Taxonomy and Text Readability + +PDF defines color spaces across three tiers. Device spaces — DeviceGray, DeviceRGB, and DeviceCMYK — map values directly to output device primaries with no calibration curve. Calibrated spaces — CalGray, CalRGB, and Lab — embed a white point and gamma, producing device-independent color. Special spaces — ICC-based, Indexed, Pattern, Separation, and DeviceN — add indirection through profiles, lookup tables, or alternate spaces. + +From a text-extraction standpoint, device and calibrated spaces are straightforward: a single numeric tuple fully describes the painted color, and pdftract can compute its luminance for contrast analysis. ICC-based spaces reduce to an alternate color space when the embedded profile is unavailable, which is the common case at extraction time; pdftract should treat them as their declared alternate. Indexed spaces map a single integer through a lookup table to entries in a base space, so the effective color is always resolvable given the lookup table embedded in the PDF. + +Pattern and Separation spaces require separate treatment and are addressed in later sections. DeviceN is a generalization of Separation that covers multi-ink systems; it includes an alternate space and a tinting function that approximates the multi-ink blend in a device-independent space, and pdftract uses that alternate for luminance estimation. + +## Pattern Color Spaces: Tiling and Shading + +When the current color space is set to `/Pattern`, paint operations use a pattern dictionary rather than a numeric tuple. Type 1 patterns (tiling) define a cell that is replicated across the painted region; Type 2 patterns (shading) compute color from a shading function and paint it across the bounding box. + +A critical architectural point: when text glyphs are painted with a pattern fill, the character codes are already present in the content stream, bound to their Unicode mappings through the font's ToUnicode CMap or encoding vector. The pattern fill is a rendering instruction applied *after* the character is decoded. pdftract reads character codes during content stream parsing, before any paint step, so pattern-filled text requires no special extraction handling. The text is fully extractable regardless of the pattern type. + +The only practical implication is metadata: pdftract may annotate a span with `fill_type: pattern` to signal that visual rendering requires pattern evaluation, but this annotation carries no effect on Unicode recovery or position confidence. + +## Shading Types and Gradient-Rendered Text + +PDF defines seven shading types under `/ShadingType`: function-based (type 1), axial/linear (type 2), radial (type 3), free-form Gouraud-shaded triangle mesh (type 4), lattice-form triangle mesh (type 5), Coons patch mesh (type 6), and tensor-product patch mesh (type 7). Types 2 and 3 are the gradient forms most commonly seen in practice; types 4–7 appear in complex illustration work. + +When text itself is painted with a shading fill — an uncommon but valid PDF construction that requires a special graphics state sequence involving a clip to the glyph outlines followed by a shading paint — extraction is entirely unaffected. The character codes, positions, and font metadata were already parsed. pdftract records the presence of a shading fill as span metadata for downstream rendering systems, but takes no special extraction action. + +## Gradient Backgrounds and Contrast Detection + +The more common interaction between shading and text is a background shading rectangle painted before the text. An axial or radial gradient is drawn across the page or column region, and then text is painted on top in a fixed color. Here, the text color is deterministic but the background is spatially varying. + +pdftract must estimate the effective luminance of the background at each span's bounding region to assess whether sufficient contrast exists. For axial gradients, this means evaluating the shading function at the x- and y-coordinates corresponding to the span's center point or, for spans spanning a wide gradient region, evaluating at both endpoints and taking the minimum contrast. The shading function — whether a sampled function (type 0), exponential (type 2), stitching (type 3), or PostScript calculator (type 4) — is embedded in the shading dictionary and can be evaluated at arbitrary coordinates. + +For mesh shadins (types 4–7), evaluating the function at an arbitrary point is computationally involved. pdftract should fall back to sampling the declared color space bounds of the shading dictionary (the `/BBox` combined with the `/Function` domain endpoints) to compute a worst-case luminance range. If the entire luminance range produces contrast ratios above the threshold, the text is confidently visible; if any part of the range falls below threshold, pdftract applies the `low_contrast` confidence penalty and records the computed range in span metadata. + +## White-on-White and Low-Contrast Text Detection + +Text painted in a color that matches or nearly matches the page background is the primary class of intentionally invisible text. Detection requires computing the WCAG 2.1 relative luminance contrast ratio between the text color and the background color at that span's location. + +Relative luminance L is computed from linearized RGB: each sRGB channel c is linearized as `c/12.92` when `c <= 0.04045` and `((c + 0.055)/1.055)^2.4` otherwise, then combined as `L = 0.2126*R + 0.7152*G + 0.0722*B`. The contrast ratio is `(L_lighter + 0.05) / (L_darker + 0.05)`. WCAG 2.1 AA requires a ratio of 4.5:1 for normal text; pdftract uses a lower internal threshold of 1.5:1 as the cutoff below which text is classified as color-hidden, since even heavily degraded low-contrast text (ratio between 1.5 and 4.5) may be intentionally visible in specialized contexts such as watermarks or light-colored metadata annotations that the document author intended to display against a custom background. + +When the page background is white (the default), the computation simplifies: any text color whose luminance L satisfies `(1.05) / (L + 0.05) < 1.5` — meaning `L > 0.65` approximately — is treated as color-hidden. This covers light gray, near-white, and white text. + +The background color at any span position is resolved by walking the graphics state stack backward through the rendering sequence to find the most recently painted opaque background intersecting the span bounding box. For complex pages with layered content, pdftract performs this analysis during a two-pass content stream parse: first pass builds a background color map by recording all filled rectangles and their colors; second pass assigns background colors to each text span. + +## Spot Colors and DeviceN Alternate Space Mapping + +Spot colors — Separation color spaces referencing named inks such as Pantone values — are defined with an alternate color space and a tinting function that approximates the spot ink in the alternate space. When text is painted in a spot color, pdftract evaluates the tinting function at full coverage (tint value 1.0) in the alternate space to estimate the effective color. If the alternate space is DeviceRGB or DeviceCMYK, the resulting value is converted to luminance through the standard path. + +When the alternate space maps to near-white at full tint — a luminance above 0.65 — pdftract applies the same `color_hidden` classification as for direct white-text cases. If the alternate space is unavailable or the tinting function is a PostScript type 4 function that pdftract cannot evaluate, the span is extracted with full confidence and annotated with `fill_type: spot_color_unknown_alternate`. Dropping text because the spot color alternate could not be resolved would produce false negatives; the conservative policy is to extract and annotate. + +DeviceN follows the same logic. pdftract evaluates the DeviceN tinting function at the colorant values specified in the text paint operation, maps the result through the alternate space, and applies luminance-based contrast analysis. If evaluation fails, the span is extracted with reduced confidence rather than suppressed. + +## Pattern-Filled Backgrounds and OCR Fallback Policy + +When a tiling pattern creates a textured background — hatching, stippling, or repeating imagery — underneath vector text, the vector extraction path is entirely unaffected: character codes come from the text content stream, not from rendered pixels. pdftract's vector extraction reads the font and text operators directly and requires no image processing. OCR is a fallback mechanism for pages or regions where vector text is absent — scanned pages, image-only XObjects, or rasterized text. + +The policy is: attempt vector extraction first; if the content stream contains text operators for a given region, accept those results regardless of the visual complexity of the background. OCR is only invoked when the vector extraction pass returns no text for a region that contains rasterized content. A background tiling pattern does not demote the page to OCR status. + +## Transparency Groups and Reduced-Opacity Text + +Form XObjects may declare themselves as transparency groups, and their contents can be painted onto the page with a reduced alpha value via the `ca` (fill opacity) and `CA` (stroke opacity) graphics state parameters. When text inside a transparency group is painted at reduced opacity — for instance, a watermark group at `ca 0.3` — the character codes, font references, and positions within the XObject content stream are fully parsed by pdftract's content stream reader during XObject traversal. Opacity is a compositing parameter applied at rendering time; it does not remove characters from the stream. + +pdftract's content stream parser recursively descends into Form XObjects, inheriting the graphics state at the invocation point. Opacity values from the invoking graphics state are recorded in span metadata as `opacity: 0.3` but have no bearing on whether the span is extracted. A text span at any nonzero opacity is extractable. + +## Blend Modes and Unicode Recovery + +The `/BM` graphics state key sets the blend mode used when painting onto the page. Non-Normal blend modes — Multiply, Screen, Overlay, Darken, Lighten, ColorDodge, ColorBurn, HardLight, SoftLight, Difference, Exclusion, Hue, Saturation, Color, Luminosity — affect the composited pixel output but have no effect on character decoding. pdftract records the active blend mode in span metadata as `blend_mode: Multiply` (for example) to allow downstream consumers to reason about visual appearance, but Unicode recovery is identical for all blend modes. + +The only extraction-relevant consequence of a non-Normal blend mode is the contrast analysis step. Multiply blend mode applied to black text on a white background yields black, which is visible; applied to light gray text on white, it deepens the text toward the background color. If contrast analysis is triggered, pdftract must account for the blend mode when computing the effective composited color. For Normal, this is a direct substitution; for other modes, pdftract computes the composited result using the standard blend mode equations before applying the luminance threshold check. + +## Extraction Policy for Color-Hidden Text + +Spans that fall below the contrast threshold receive a `color_hidden: true` flag in the extraction output. This flag does not suppress the span from the results. The character data, position, and font metadata are included in full; the flag is advisory, informing the caller that the text was likely not intended to be read by a human viewer of the rendered document. Extraction confidence is reduced proportionally: spans at contrast ratio below 1.1 (near-invisible) receive a confidence penalty of 0.4; spans between 1.1 and 1.5 receive a penalty of 0.2. + +The rationale for extracting rather than suppressing is that extraction consumers — search indexers, accessibility tools, content pipelines — derive value from the text regardless of its visual presentation. Invisible text in a PDF may represent hidden metadata, copy-protection watermarks, or template artifacts; all of these are legitimate extraction targets. Suppression would be a silent false negative. The `color_hidden` flag gives callers the information they need to apply their own policy. + +When reporting extraction output in structured form, pdftract groups `color_hidden` spans in a dedicated section of the output manifest, alongside their contrast ratios and the color values resolved for both text and background. This audit trail allows callers to verify the classification and override it if domain knowledge suggests the text was intentionally visible in a specialty printing context that pdftract's sRGB-based luminance model does not capture.