diff --git a/docs/research/color-management-and-icc-profiles.md b/docs/research/color-management-and-icc-profiles.md new file mode 100644 index 0000000..e3f2d81 --- /dev/null +++ b/docs/research/color-management-and-icc-profiles.md @@ -0,0 +1,75 @@ +# Color Management, ICC Profiles, and Color Space Conversions in PDFs + +## Overview + +PDF color is richer than extraction requires, but understanding its structure is necessary to implement one extraction-critical feature: luminance estimation for text visibility detection. pdftract does not perform color-managed rendering — it never converts pixel values through ICC profiles or simulates halftone screens. Its goal is narrower: for each text span, estimate approximate luminance well enough to classify the span as visible or color-hidden. This document maps the full PDF color space hierarchy onto that goal, identifying which constructs require active handling and which can be ignored. + +## PDF Color Space Hierarchy + +PDF organizes color spaces into three tiers with different implications for text extraction. + +**Device-dependent spaces** — DeviceGray, DeviceRGB, and DeviceCMYK — map numeric values directly to output device primaries without calibration. These are the most common spaces for text coloring in practice. Because they carry no embedded profile, converting to luminance requires only a formula, not profile evaluation. + +**Device-independent (calibrated) spaces** — CalGray, CalRGB, and Lab — embed a white point and gamma. CalGray is a single-channel calibrated space; CalRGB extends that to three channels with a matrix to XYZ; Lab encodes color in CIE L\*a\*b\* where L\* is perceptual lightness on a 0–100 scale. These spaces are fully resolvable to luminance without a renderer. + +**Special spaces** — ICCBased, Indexed, Pattern, Separation, and DeviceN — add indirection. An ICCBased space wraps an embedded ICC profile with a declared alternate space as fallback. Indexed maps a single integer through a byte table to entries in a base space. Pattern replaces numeric color with a pattern dictionary. Separation and DeviceN represent spot inks with alternate-space approximations. + +For text visibility detection, the relevant spaces are those that can appear as the current color when a text-showing operator executes. Pattern-filled text is extractable without special handling — character codes are decoded before any paint step. The color-relevant spaces for luminance estimation are: DeviceGray, DeviceRGB, DeviceCMYK, CalGray, CalRGB, Lab, ICCBased, Separation, and DeviceN. Halftone dictionaries, transfer functions, rendering intents, and color rendering dictionaries affect rasterized pixel output only and can be ignored entirely. + +## ICC Profile Streams and the ICCBased Color Space + +An ICCBased color space is declared as `[/ICCBased stream]` where the stream dictionary carries `/N` (number of components: 1, 3, or 4) and an `/Alternate` fallback space. The profile encodes a full colorimetric transform including a rendering intent — perceptual, relative colorimetric, saturation, or absolute colorimetric — that controls out-of-gamut mapping during device rendering. + +For pdftract, rendering intent is irrelevant: it controls a rendering step that pdftract does not perform. What matters is recognizing the component count and the alternate space, then using the alternate for luminance estimation. If the alternate is DeviceRGB, treat the paint values as DeviceRGB. If the alternate is DeviceCMYK, apply the CMYK approximation. If no alternate is declared, infer from N (1 = gray, 3 = RGB, 4 = CMYK). Full ICC evaluation — loading the profile, building a transform chain — is not required for the luminance classification task. + +## CMYK to Luminance Conversion + +DeviceCMYK is common in professionally typeset documents. A full ICC transform from CMYK to Lab requires the printer profile, which is unavailable at extraction time. The practical approximation used by pdftract is: + +``` +L ≈ 1.0 - (C + M + Y + K * 0.25) clamped to [0, 1] +``` + +The K weighting at 0.25 rather than 1.0 prevents all-K compositions (rich black with no CMY contribution) from being miscalculated when CMY channels are contributing lightness. This formula is not colorimetrically accurate, but it reliably identifies high-luminance CMYK values that would be nearly invisible on a white page versus low-luminance values with strong contrast. The use case is binary classification — visible versus color-hidden — not perceptual color reproduction, and the formula's accuracy for that binary task is sufficient. + +## Lab Color Space: Direct Luminance Extraction + +The CIE L\*a\*b\* space is the one case where luminance requires no conversion formula. L\* is perceptual lightness normalized to 0–100: 0 is perceptual black, 100 is the reference white. When text is painted in a Lab color space, the luminance estimate is directly `L* / 100`. Text with L\* above approximately 87 on a white background produces a contrast ratio below 1.5 and is classified as color-hidden. This makes Lab the lowest-cost space for pdftract's luminance pipeline — no transform required. + +## Separation Color Spaces and Spot Color Luminance + +Separation color spaces represent a single named colorant — a spot ink such as a Pantone value — defined by an alternate color space and a tinting function. The tinting function takes a single tint input (0 to 1) and returns a value in the alternate space. + +For luminance estimation, pdftract evaluates the tinting function at the tint value specified in the paint operator and converts the result through the alternate-space formula. The most common alternate is DeviceCMYK; DeviceRGB alternates also occur. When the tinting function is a PostScript type 4 calculator function that cannot be evaluated at extraction time, pdftract extracts the span at full confidence and annotates it with `fill_type: spot_color_tint_unevaluated` rather than suppressing it. Dropping text because the spot alternate could not be resolved would produce false negatives. + +## DeviceN: Multi-Ink Luminance Approximation + +DeviceN generalizes Separation to N named colorants. Its tinting function takes N inputs (one per colorant) and returns a value in the alternate space. N can be as small as 1 (equivalent to Separation) or larger for multi-ink systems. + +For luminance estimation, pdftract reads the N component values from the paint operator, evaluates the tinting function, and converts the alternate-space result to luminance. The same evaluation-failure fallback applies: if the function cannot be evaluated, extract with reduced confidence and annotate rather than suppress. + +## Overprint Mode + +Overprint is controlled by three graphics state parameters: `op` (overprint for non-stroking operations), `OP` (overprint for stroking), and `OPM` (overprint mode 0 or 1). Overprint mode 1, meaningful only in CMYK and DeviceN, suppresses painting for colorant channels whose value is zero so that underlying ink shows through. + +For text extraction, overprint has no effect on character decoding. The practical implication for luminance estimation is narrow: a CMYK value of all zeros under OPM=1 would show underlying content in a renderer, but pdftract sees the tuple and computes luminance of 1.0 (white), correctly classifying the span as a potential color-hidden case. The overprint state is recorded in span metadata but does not alter extraction behavior. + +## Halftone and Transfer Functions + +Halftone screens and transfer functions are legacy print-production features. Halftone dictionaries specify screen angle, frequency, and spot function for each colorant, controlling how continuous-tone values convert to dot patterns on a physical print device. Transfer functions apply per-channel tone curves to compensate for press gain. + +Neither affects digital display or text extraction. Character codes, positions, and color values are read from the content stream before any halftone or transfer processing. pdftract ignores both constructs entirely. + +## Color in Type 3 Fonts + +Type 3 fonts define glyphs through arbitrary PDF content streams. A glyph procedure can include color operators, painting sub-paths in different colors or using pattern fills within a single glyph. This makes Type 3 the one case where a single character may render in multiple colors. + +Character codes and Unicode mappings are recovered from Type 3 fonts through the font's Encoding and ToUnicode entries exactly as with other font types. The colored rendering inside a glyph procedure does not affect character identity. However, luminance estimation cannot rely on the graphics state color when the glyph is a colored Type 3 (signaled by the `d0` operator at the glyph procedure start rather than `d1`). For these glyphs, pdftract records `fill_type: type3_colored` and does not apply the contrast check — executing the glyph procedure to sample its internal colors is beyond extraction scope. The character is extracted at full confidence. + +## Extraction Implications: Implementing the color_hidden Flag + +pdftract's luminance estimation resolves the current color space through a chain of fallbacks: ICCBased → alternate space → formula; Indexed → base space → formula; Separation → alternate via tinting function → formula; DeviceN → alternate via tinting function → formula; CalGray/CalRGB/Lab → direct or near-direct formula; DeviceGray/DeviceRGB/DeviceCMYK → direct formula. + +When any step in the chain fails — missing alternate, unevaluable function, malformed profile — the fallback is to extract without a luminance estimate, annotate the span with the failure reason, and assign neutral confidence. The conservative direction is always to include rather than suppress. + +The `color_hidden` flag is set when the computed contrast ratio between the estimated text luminance and the background luminance at the span's position falls below 1.5. This threshold covers white-on-white, light-gray-on-white, and analogous near-invisible cases. The flag does not suppress the span — character data, position, and font metadata are included in full. The goal throughout is not colorimetric accuracy but reliable binary classification: clearly visible versus potentially invisible. The approximations used — CMYK simplified formula, ICCBased alternate substitution, CalRGB treated as DeviceRGB — are calibrated for that binary task, not for reproducing the visual appearance of the document. diff --git a/docs/research/content-stream-operators.md b/docs/research/content-stream-operators.md new file mode 100644 index 0000000..d08f2f3 --- /dev/null +++ b/docs/research/content-stream-operators.md @@ -0,0 +1,107 @@ +# PDF Content Stream Operator Reference for Text Extraction + +## Overview + +A PDF content stream is a sequence of operands followed by operators, processed left to right. Text extraction requires accurate parsing of this stream, including correct handling of operator arguments, encoding subtleties, and interactions with the graphics state. This document covers every operator class relevant to pdftract's content stream parser. + +--- + +## 1. Text Object Delimiters: BT and ET + +Text objects are bracketed by `BT` (begin text) and `ET` (end text). The text matrix (Tm) and text line matrix (Tlm) are initialized to the identity matrix at `BT` and discarded at `ET`. Text operators are only valid inside a text object; invoking them outside is an error that real-world PDFs nonetheless commit. pdftract must tolerate `BT`/`ET` mismatches — unpaired or nested occurrences exist in producer output — and should maintain a nesting counter rather than a simple boolean flag. + +--- + +## 2. Font and Size: Tf + +`name size Tf` sets the current font to the resource named `name` (a PDF name object) and the text font size to `size` in unscaled text space units. The font resource must be looked up in the current resource dictionary's `Font` subdictionary. Failure to track the current font and size means character-to-Unicode mapping cannot be performed, because glyph encoding is font-specific. Every `Tf` invocation must trigger a font cache lookup or load. + +--- + +## 3. Text Positioning Operators: Tm, Td, TD, T* + +`a b c d e f Tm` sets both the text matrix and the text line matrix to the provided six-element matrix. This is an absolute positioning operation that replaces, not concatenates, the existing text matrix. + +`tx ty Td` moves the text position by `(tx, ty)` in text space and sets the text line matrix to the new position. `tx ty TD` is equivalent to `-ty TL` followed by `tx ty Td` — it simultaneously updates the leading parameter `TL`. + +`T*` moves to the next line, equivalent to `0 -TL Td` where `TL` is the current text leading value. Tracking `TL` across `TD` and `TL` operator invocations is required for correct line break detection. + +--- + +## 4. String Show Operators: Tj, TJ, ', " + +`string Tj` paints the glyphs for the given string and advances the text position by the sum of the glyph widths plus character spacing and word spacing adjustments. + +`array TJ` accepts a PDF array alternating between string objects and numeric objects. Each string element is rendered in sequence; each numeric element adjusts the horizontal text position by `-value / 1000` text units before rendering the next string. The sign convention is critical: a positive number moves left (tightens spacing), and a negative number moves right (adds space). Treating the sign incorrectly reverses kern direction and corrupts word boundary detection. Multiple strings within a single `TJ` array are logically concatenated text — they must be joined without inserting spurious word separators unless a sufficiently negative numeric element indicates a word gap. + +`string '` is exactly equivalent to `T* string Tj`: it moves to the next line and then shows the string. This is a shorthand that must not be confused with the PDF string delimiter `'`, which is not a valid PDF string delimiter at all — strings use parentheses or angle brackets. + +`Tw Tc string "` sets the word spacing to `Tw` and the character spacing to `Tc`, then moves to the next line and shows the string, equivalent to `Tw Tw Tc Tc T* string Tj`. The two numeric operands precede the string operand. Misidentifying `"` as a two-argument operator versus a one-argument operator will cause operand stack corruption for all subsequent operators in the stream. + +--- + +## 5. String Encoding: Literal and Hex Strings + +Both `Tj` and `TJ` accept PDF string arguments in one of two encodings. + +Literal strings are enclosed in parentheses: `(Hello)`. Parentheses must be balanced or escaped with a backslash. A backslash followed by a digit sequence introduces an octal escape. A backslash at a line break signals a continuation with no newline character in the string value. + +Hex strings are enclosed in angle brackets: `<48656C6C6F>`. Each pair of hex digits encodes one byte. An odd number of hex digits is completed with an implicit trailing zero. Hex strings may contain whitespace between digit pairs for readability, which must be ignored during decoding. + +Both encodings can represent arbitrary byte values, including null bytes (0x00). Some parsers terminate string reading at a null byte. pdftract must treat null bytes as valid string content and pass them through to the character mapping stage. Encodings such as UTF-16BE prefix their content with a BOM (0xFE 0xFF) and embed null bytes for ASCII characters; failing to read past nulls silently truncates text. + +--- + +## 6. Graphics State Operators Affecting Text: q, Q, cm, gs + +`q` pushes a copy of the entire graphics state — including all text state parameters (font, size, Tc, Tw, TL, Tr, Ts, and the text and text line matrices) — onto the graphics state stack. `Q` pops and restores it. Form XObjects commonly bracket their content with `q`/`Q`, so every recursive call into an XObject stream must begin with an implicit save and end with an implicit restore; a `q` without a matching `Q` inside an XObject is contained by the recursive frame. + +`a b c d e f cm` concatenates the provided matrix with the current transformation matrix (CTM). Because text coordinates are transformed through the CTM into device space, changes to the CTM affect the computed position of rendered text even when no text positioning operator is invoked. pdftract must maintain the full current transformation matrix to compute accurate bounding boxes or reading order. + +`name gs` applies a named entry from the current resource dictionary's `ExtGState` subdictionary. ExtGState dictionaries may set font (`Font` key), character spacing (`CA`, `ca` are opacity but `TC`, `TW`, `TL`, `Ts`, `Tf` are text), rendering mode (`TR`/`TR2`), and other text-relevant parameters. pdftract must inspect the ExtGState dictionary and update its internal text state accordingly. + +--- + +## 7. Inline Images: BI, ID, EI + +An inline image is introduced by `BI`, followed by key-value pairs describing the image (width, height, color space, filter, etc.), then `ID` on its own line, followed immediately by the raw binary image data, then `EI`. + +Detecting `EI` is non-trivial because the raw image data may contain the byte sequence `EI` as part of its payload. The robust algorithm is: compute the expected byte length of the image data from the width, height, bits per component, and color space, applying any compression filter to determine the compressed length; read exactly that many bytes after `ID`; then expect `EI` as the next token. If the filter length is not determinable (e.g., the filter is unknown), fall back to scanning for a whitespace-preceded `EI` followed by whitespace or an operator name — but this heuristic can misfire. pdftract should prefer length-based detection wherever possible and treat inline images as opaque blobs; they contain no text operators. + +--- + +## 8. Marked Content: BDC, BMC, EMC, MP, DP + +Marked content operators are the structural hooks for tagged PDF. `tag BDC` and `tag properties BDC` begin a marked content sequence with an optional property dictionary; `tag BMC` begins one without properties. `EMC` ends the innermost open marked content sequence. `tag MP` and `tag properties DP` are point operators that mark a location without delimiting a span. + +For text extraction, marked content enables mapping of content to logical structure (headings, paragraphs, table cells, artifacts). The `Artifact` tag marks content that should be excluded from extracted text (headers, footers, page numbers, decorative rules). The `ActualText` attribute in a property dictionary provides an explicit Unicode string to substitute for the rendered glyph sequence, handling ligatures, special characters, and layout artifacts. pdftract should track open marked content sequences and expose their tags and properties to the extraction layer. + +--- + +## 9. The Do Operator and XObjects + +`name Do` invokes the XObject named `name` from the current resource dictionary's `XObject` subdictionary. The named object is either a Form XObject or an Image XObject. + +A Form XObject is a PDF stream with its own content stream and its own resource dictionary. It must be processed recursively: the current graphics state is saved, the Form's matrix is concatenated with the CTM, the Form's resource dictionary becomes the active resource context, and the Form's content stream is parsed and executed as if it were inline. Text generated inside a Form XObject appears in device space at positions determined by the combined transformation. Failing to recurse into Form XObjects silently drops text. + +An Image XObject is a raster image. It contains no text operators and must be skipped. The distinction between Form and Image XObjects is the `Subtype` key in the XObject's dictionary: `/Form` versus `/Image`. + +--- + +## 10. Compatible Extensions: BX and EX + +`BX` and `EX` bracket a sequence of operators that may not be defined in the PDF specification version being parsed. A conforming reader that does not understand operators inside a `BX`/`EX` pair must skip them without error. pdftract must track `BX`/`EX` nesting depth and, when inside a compatible section, consume and discard all tokens — including any operands — until the matching `EX` is reached. Crashing or raising an error on unknown operators inside `BX`/`EX` violates the PDF specification and will fail on real-world files produced by non-standard tools. + +--- + +## 11. Tokenizer Edge Cases + +PDF defines six whitespace characters: space (0x20), horizontal tab (0x09), carriage return (0x0D), line feed (0x0A), form feed (0x0C), and null (0x00). Any combination of these may appear between operands and operators. The tokenizer must skip all whitespace between tokens and must not treat any whitespace character as significant except inside string literals and when determining line endings for the `'` and `"` operators. + +Comments begin with `%` and extend to the end of the line (the next CR, LF, or CR+LF sequence). Comment content is ignored. However, comments may appear anywhere between tokens — including between an operand and the operator it belongs to. The tokenizer must treat comments exactly as whitespace. + +PDF files commonly begin with a high-bit-byte comment such as `%âãÏÓ` or `%¥±ë` immediately after the `%PDF-1.x` header line. This comment signals to transfer protocols that the file is binary. The tokenizer must handle these high-byte characters without misinterpreting them as tokens; since they appear in a comment, they are discarded before any token scanning begins. + +Binary data at the start of compressed streams (after `stream\n`) may begin with bytes that coincidentally match operator names. pdftract must never parse stream data as operator tokens; the stream body is always accessed through its decoded filter output, not by scanning raw bytes inline in the content stream tokenizer. + +Operator names in PDF are composed of characters from a defined set; some operators use characters outside the alphanumeric range (`'`, `"`, `*`). The tokenizer must include these in its operator character set and must distinguish them from the delimiters `(`, `)`, `<`, `>`, `[`, `]`, `{`, `}`, `/`, `%` that begin other object types. diff --git a/docs/research/pdfx-prepress-extraction.md b/docs/research/pdfx-prepress-extraction.md new file mode 100644 index 0000000..afe5ac8 --- /dev/null +++ b/docs/research/pdfx-prepress-extraction.md @@ -0,0 +1,99 @@ +# PDF/X Prepress and Print Production PDF Extraction + +## Overview + +PDF/X is a family of ISO standards designed for reliable, blind exchange of print-ready content between creators and print service providers. Unlike general-purpose PDFs, PDF/X documents carry strict conformance requirements that constrain what is allowed inside the file—mandating font embedding, controlling color spaces, restricting transparency, and specifying page geometry boxes. These constraints, while imposing on document creators, are a benefit for extraction: a conformant PDF/X file makes strong guarantees that simplify several of pdftract's jobs. Understanding where those guarantees apply, and where prepress production artifacts complicate extraction, is essential to handling magazine spreads, packaging artwork, and newspaper print PDFs correctly. + +## The PDF/X Conformance Family + +The PDF/X lineage spans five major standards, each building on or refining the previous. + +**PDF/X-1a** (ISO 15930-1, later revised as 15930-4) is the strictest and most widely deployed standard for commercial print. It prohibits all device-independent color spaces in favor of DeviceCMYK, DeviceGray, Separation (for spot colors), and DeviceN. RGB images and ICC-tagged objects are forbidden outright. PDF/X-1a also disallows transparency, requiring fully flattened artwork, and mandates that all fonts be embedded without subsetting restrictions. Because color space choices are locked to CMYK and spot, extraction does not need to resolve color management to identify text color—text is either black, a gray level, a CMYK mix, or a named separation color. + +**PDF/X-3** (ISO 15930-3) relaxes the color restriction. ICC-tagged RGB color spaces are permitted alongside CMYK, provided the document includes an OutputIntent that names the intended output device profile. Transparency remains prohibited. PDF/X-3 is common in European print workflows where ICC-managed RGB photography is delivered to print alongside CMYK editorial content. Extraction must be prepared to encounter RGB text objects alongside CMYK content in the same file. + +**PDF/X-4** (ISO 15930-7) is the current preferred standard for high-quality commercial printing. It permits live transparency—meaning objects with non-opaque alpha can appear in the content stream without prior flattening. PDF/X-4 also allows optional content groups (layers) to be present. For extraction, PDF/X-4 files may contain text objects inside transparency groups, requiring that pdftract track the graphics state alpha stack (detailed in the graphics-state-tracking research) rather than assuming all text is fully opaque. + +**PDF/X-4p** extends PDF/X-4 by allowing the ICC output profile to be stored externally rather than embedded in the file. The `DestOutputProfileRef` dictionary entry points to a named external profile. This has no practical effect on text extraction. + +**PDF/X-5** covers partial exchange—used when referencing external graphical content (external ICC profiles, external artwork). PDF/X-5 documents may legitimately omit embedded content that would be resolved at output. For pdftract, this means some image content may be absent (represented by OPI proxies), but text content should still be fully present per the standard. + +The conformance level can be read from the `GTS_PDFXVersion` key in the document's XMP metadata (`pdfxid:GTS_PDFXVersion`) or from the `Info` dictionary's `GTS_PDFXVersion` entry. pdftract should capture this value and tag extraction output accordingly. Knowing the conformance level tells pdftract which assumptions are safe: on PDF/X-1a, transparency never needs to be resolved; on PDF/X-4, it does. + +## OutputIntent and Device Classification + +Every PDF/X file must contain an `/OutputIntents` array with at least one entry describing the intended output device. The entry is a dictionary with `/DestOutputProfile` (an embedded ICC profile stream), `/OutputConditionIdentifier` (a string like `FOGRA39` or `U.S. Web Coated (SWOP) v2`), and `/Info`. + +pdftract does not need to parse the ICC profile binary to benefit from this structure. The presence of a valid OutputIntent is sufficient to classify the document as a print-production PDF. The `OutputConditionIdentifier` string can be surfaced in metadata output as a hint about the intended press standard. If pdftract is producing structured output for downstream consumers (such as content management ingest pipelines for magazine archives), tagging the document class as `print_production` with the output condition identifier gives consumers useful provenance without requiring pdftract to interpret colorimetry. + +The `/OutputIntents` array key is also used by PDF/A. pdftract should check both `/GTS_PDFX` and `/GTS_PDFA` subtype strings to distinguish the two archival families. + +## Page Geometry Boxes and the Bleed Zone Problem + +PDF/X mandates specific usage of the page geometry boxes defined in the PDF specification. Understanding these boxes is critical for separating body content from print production artifacts. + +The **TrimBox** defines the final finished page dimensions—the boundary where the physical paper will be cut after printing. All editorial content (article text, photographs, page numbers) intended to be read by the end user lives within the TrimBox. The **BleedBox** extends beyond the TrimBox by the bleed amount, typically 3mm on each side. Content in the bleed zone is intentionally printed beyond the trim edge to prevent white slivers appearing if the cut is slightly off-register. The **ArtBox**, when present, describes the meaningful content area as defined by the document creator, which may be inset from the TrimBox. + +For text extraction, the TrimBox is the authoritative boundary for body content. Text objects whose bounding rectangles fall entirely outside the TrimBox should be labeled `bleed_content` in pdftract's zone classification. Text objects that straddle the TrimBox boundary—partially inside, partially outside—are the most ambiguous case. These are typically glyphs at the edge of a page-bleed background element or production labels placed in the bleed zone. pdftract should classify straddling glyphs based on whether their typographic origin (the baseline point) falls within the TrimBox. A glyph whose origin is inside the TrimBox but whose descenders extend into the bleed zone is body content; a glyph whose origin is in the bleed zone is bleed content. + +The `/CropBox`, if present, typically matches or is larger than the TrimBox. Some PDF/X workflows set the CropBox to the BleedBox size to show the full bleed when viewed on screen. pdftract must use TrimBox as the primary boundary for content classification and not assume CropBox represents finished dimensions. + +## Spot Colors and Separation Spaces + +PDF/X-1a files are saturated with Separation and DeviceN color spaces carrying Pantone names, brand color names, or custom identifiers. A text object might be specified in `[/Separation /PANTONE-485-C /DeviceCMYK <...alternate tint function...>]`. The tint function provides a CMYK fallback for screen preview and low-fidelity output, but the canonical color is the named separation. + +For extraction, the separation name itself is semantically meaningful in packaging and magazine workflows. pdftract should record the spot color name for text objects rendered in Separation spaces, rather than resolving to the CMYK alternate. This allows downstream systems to identify, for instance, that a logo mark or legal notice is printed in a specific Pantone ink—information that might affect content priority or processing rules in a brand asset workflow. + +DeviceN spaces, which combine multiple colorants into a single space, may name several spot inks simultaneously. A DeviceN array like `[/PANTONE-485-C /PANTONE-Cool-Gray-9-C]` identifies a duotone or multi-ink object. pdftract records all named colorants from the DeviceN components array when annotating text color. + +## OPI Proxy Images + +OPI (Open Prepress Interface) is a workflow mechanism where high-resolution images are replaced by low-resolution proxies during creative layout. The proxy carries OPI comments—either 1.3 (%%BeginOPI...%%EndOPI PostScript-style comments wrapped in a marked content sequence) or 2.0 (an `/OPI` dictionary in the image XObject's dictionary or in a surrounding content stream marked content section)—pointing to the path of the full-resolution original. + +For text extraction, OPI images are irrelevant but noteworthy. pdftract does not attempt to retrieve or process the high-resolution original. However, if pdftract is producing a document structure report alongside the text extraction, the presence of OPI comments should be flagged. OPI images appear in prepress-stage PDFs that have not yet been through final output processing; a document containing OPI 2.0 dictionaries may also contain other pre-output artifacts that affect fidelity. + +The OPI dictionary is found at the `/OPI` key within an image XObject's dictionary, containing subkeys `/1.3` or `/2.0` with the original file path. pdftract's image enumeration pass should check for this key and emit an `opi_proxy_detected` diagnostic when present. + +## Font Embedding and Extraction Confidence + +PDF/X mandates that all fonts—without exception—be fully embedded. Subsetting is permitted, but the subsetting restriction flag that would prevent copying or editing must not be set. This is a meaningful signal for extraction confidence. A PDF/X-conformant file with valid fonts should yield clean ToUnicode CMap coverage for all glyphs present, which means character-to-Unicode mapping should succeed without heuristic fallback. + +pdftract's confidence scoring for individual text spans can be elevated when the enclosing document carries a valid `GTS_PDFXVersion` identifier and all fonts encountered are fully embedded (Flags bit 2 not set, `FontDescriptor` present with `FontFile`, `FontFile2`, or `FontFile3`). This is distinct from PDF/A, where embedding is also mandated but ToUnicode presence is more strictly required for Level A conformance. + +When a PDF/X file is encountered with a missing or incomplete font embedding—a non-conformance—pdftract should treat it identically to any other font-missing case (triggering the fallback glyph recognition pipeline) but emit a conformance violation diagnostic. The document's claim of PDF/X conformance does not guarantee it. + +## Overprint Settings + +Overprint is fundamental to CMYK prepress. When an object overprints, its ink is added on top of the ink already on the substrate rather than knocking out the background. This is controlled by the `/OP` (overprint for stroking), `/op` (overprint for filling), and `/OPM` (overprint mode, 0 or 1) entries in ExtGState dictionaries applied via `/gs` operators. + +For text in PDF/X prepress documents, overprint is frequently set for black text to ensure it prints correctly on top of CMYK imagery (100% K overprinting is standard practice). This does not affect which Unicode characters are present, but it does affect visual rendering—overprinting black on a colored background produces rich black rather than a white knockout. + +pdftract records overprint settings from the current graphics state when processing text runs but does not simulate the visual overprint result. This is correct behavior: extraction concerns itself with textual content, not ink simulation. The recorded OPM and overprint flags can be surfaced in structured output if a downstream consumer needs them. + +## Trapping Annotations + +PDF/X-3 and later versions permit TrapNet annotations—annotations of subtype `/TrapNet` that encode trap geometry for the press. These are rectangles or paths describing where ink spread has been applied at color boundaries to prevent misregistration gaps. They appear in the `/Annots` array of a page dictionary. + +TrapNet annotations contain no text content and are irrelevant to extraction. pdftract should enumerate and skip them without raising warnings. The presence of TrapNet annotations is a useful provenance signal (the document was processed by a trapping engine before delivery) and can be noted in the document structure report. + +## Print Production Artifacts in the Bleed Zone + +Commercial print PDFs—particularly magazine advertising pages and packaging—routinely include production marks placed outside the TrimBox in the bleed and slug zones: crop marks (hairlines showing where the sheet is to be cut), registration marks (bull's-eye targets for aligning color separations), color bars (rows of color swatches for press calibration), and cut guides. These marks frequently include text: the name of the print service provider, a job number, a timestamp, Pantone color names next to the color bar swatches, and instructional text such as "CUT HERE" or "DO NOT PRINT." + +All of this text is production metadata, not editorial content. Because these elements are positioned outside the TrimBox, pdftract's zone classifier will label them `bleed_content` by the spatial rule described above. When pdftract is configured for clean body text output (the default extraction mode), bleed_content zones are excluded from the primary text stream. When pdftract is in full-page or audit mode, bleed_content is included in a separate output section labeled accordingly. + +## Magazine and Newspaper PDFs + +Magazine and newspaper PDFs are the dominant real-world use case for PDF/X-1a and PDF/X-4. They exhibit several structural patterns that extraction must accommodate. + +Multi-column editorial layouts are nearly universal. Text runs in narrow columns with precisely controlled gutter spacing. Adjacent columns may contain independent articles, requiring that pdftract's reading-order heuristics identify column boundaries before linearizing text. The column detection approach documented in the complex-layout-reading-order research applies directly here. + +Article threading links article text across non-contiguous pages using the PDF `/Threads` array of `/Thread` dictionaries, each containing a chain of `/Bead` dictionaries. Each bead references a page and a rectangle. Threading is the PDF mechanism for encoding "article continues on page 47"—the beads mark the reading order of article fragments across the publication. pdftract should traverse the thread beads to reconstruct article continuity, appending thread-linked text boxes in bead order rather than page order, and flagging the result as a threaded article in structured output. + +Pull quotes—enlarged excerpts from the article body, set in display type and overlaid on the column layout—are a common source of duplicate text. The same sentence appears once in the body text run and once as a pull quote in a larger font size at a different position. pdftract's post-extraction deduplication pass should identify these by comparing text proximity and string similarity, preferring the body text instance and tagging the pull quote as `display_duplicate`. + +Headlines, decks (subheadlines), bylines, and section labels are all in the trim box but at different font sizes and positions. pdftract's zone classifier should distinguish these by font size and position relative to column boundaries, surfacing them as `headline`, `byline`, and `section_label` zones rather than body text. + +## Summary + +PDF/X print production files offer pdftract several reliable signals: font embedding is guaranteed (raising extraction confidence), the conformance level constrains color spaces and transparency behavior, and OutputIntent provides device classification without profile parsing. The critical extraction challenge is spatial—separating editorial body text inside the TrimBox from the rich ecosystem of bleed content, production marks, and prepress artifacts that surround it. By anchoring zone classification on the TrimBox boundary, recording spot color names rather than resolving them, skipping OPI and TrapNet entries, and threading article beads for multi-page continuity, pdftract can deliver clean body text from even the most complex commercial print PDF. diff --git a/docs/research/text-positioning-and-font-metrics.md b/docs/research/text-positioning-and-font-metrics.md new file mode 100644 index 0000000..3fbaf82 --- /dev/null +++ b/docs/research/text-positioning-and-font-metrics.md @@ -0,0 +1,119 @@ +# Text Positioning, Font Metrics, and Spacing Precision in PDF Extraction + +## PDF Text State Parameters + +PDF rendering is driven by a text state that accumulates across operators within a BT/ET block. pdftract must maintain all seven text state scalars in its content stream interpreter, because each one contributes to the final glyph position and advance. + +**Tc (character spacing)** is a scalar in unscaled text space units added to the horizontal advance of every glyph after it is placed. It is additive with the advance width from the font. Because Tc accumulates on every character, even a Tc of 0.5 pt visibly spreads a long word. pdftract must apply Tc after each glyph's advance, including glyphs inside a TJ array, before applying any TJ kerning number that follows. + +**Tw (word spacing)** is a scalar applied in addition to Tc, but only when the glyph code is 0x20 (the ASCII space). For multi-byte CIDFont encodings the space character may map to a different code point; pdftract must consult the ToUnicode CMap to identify which glyph code represents U+0020 before applying Tw. When Tw is nonzero, a space glyph effectively has advance = glyph_advance + Tc + Tw. + +**Tz (horizontal scaling)** is a percentage value (100 = normal). It scales all horizontal distances — including advance widths and Tc and Tw — in text space. The scaling factor is Tz / 100. It does not affect vertical positioning or ascent/descent. pdftract must multiply every horizontal displacement by this factor when computing the glyph's contribution to the text matrix. + +**TL (text leading)** is the vertical distance between baselines when the T* operator is used or when TD is called. It is stored but does not affect individual glyph placement; it determines the vertical offset applied by T*. + +**Tf (font and size)** selects the current font resource and sets the text font size. The font size is a scale factor applied to glyph coordinates expressed in font units. pdftract must look up the named font in the current resource dictionary, decode its subtype (Type1, TrueType, CIDFont via Type0, Type3), and load its metrics accordingly. + +**Tr (rendering mode)** affects whether the glyph is filled, stroked, clipped, or invisible. For text extraction purposes only Tr = 3 (invisible) is functionally significant; pdftract should suppress glyphs with Tr = 3 from the output since they are typically used for clipping without visual rendering. + +**Ts (text rise)** shifts the baseline vertically, positive values moving the glyph upward, used for superscripts and subscripts. The rise is in unscaled text space and must be added to the y-component of the text position before transforming to user space. pdftract must include Ts in the baseline coordinate used for line grouping, since a superscript glyph on the same nominal line as its base text will have a different baseline y value and should not be merged into the same text span. + +## The Text Matrix Tm and Line Matrix Tlm + +At the start of a BT block, both the text matrix Tm and the line matrix Tlm are initialized to the identity matrix. These are 3×3 homogeneous matrices maintained in parallel. + +The **Tm operator** sets both Tm and Tlm to the supplied matrix. It completely replaces the current position; it does not accumulate. + +The **Td operator** (lowercase) moves the text position by (tx, ty) relative to the start of the current line: Tlm = [[1,0,0],[0,1,0],[tx,ty,1]] × Tlm, and Tm is set to the new Tlm. The **TD operator** (uppercase) is identical but also sets TL = −ty. + +The **T* operator** is equivalent to Td(0, −TL). + +After each glyph is rendered, the text position advances. The advance in text space is: + +``` +tx = (glyph_advance_in_font_units / 1000) * font_size * (Tz / 100) + Tc * (Tz / 100) +``` + +plus Tw * (Tz/100) if the glyph is the space character. This advance is applied to Tm by post-multiplying a translation matrix. Tlm is not modified by glyph advance; it retains the position of the start of the current line until a line-movement operator is encountered. + +To convert the text position to user space, pdftract multiplies the current text position vector by Tm and then by the current transformation matrix (CTM) accumulated from the graphics state stack. + +## Font Units vs. User Space Units + +Glyph metrics in PDF fonts are expressed in font units. For Type 1 and most TrueType fonts the coordinate system is defined so that 1 em = 1000 font units. Type 3 fonts define their own font matrix via the /FontMatrix entry; the standard value is [0.001, 0, 0, 0.001, 0, 0], mapping the 1000-unit space into a 1-unit space consistent with Type 1. + +The conversion from font units to text space is: + +``` +text_space_units = font_units * font_size / 1000 +``` + +For Type 3 fonts, pdftract must apply the /FontMatrix to glyph coordinates before applying the font size, because the font matrix may not be the standard 1/1000 scaling. + +The resulting text-space coordinate is then transformed to user space by Tm, and user space is transformed to device space by the CTM. For bounding box extraction in a device-independent form, pdftract should output coordinates in user space (points, where 1 pt = 1/72 inch) by applying Tm but stopping before the CTM, unless the caller requests device-pixel coordinates. + +## Width Arrays and WX Entries + +Width information for each glyph is critical for accurate advance computation. pdftract must load widths from the font dictionary rather than relying on glyph outlines, which may not be embedded. + +For **simple fonts** (Type1, TrueType, Type3), the /Widths array contains the advance widths for glyph codes from /FirstChar to /LastChar, in font units. The /MissingWidth entry in the font descriptor provides the default for codes not covered by /Widths. pdftract must handle the case where /Widths is absent (relying entirely on /MissingWidth) gracefully. + +For **CIDFonts** (used as descendants of Type0 fonts), the /W array uses a compact, sparse encoding. It alternates between two forms: a range form `c_first [w1 w2 ... wn]` mapping consecutive CIDs starting at c_first, or a run form `c_first c_last w` assigning the same width w to all CIDs in the range. The /DW entry gives the default width for CIDs not covered by /W. pdftract must parse both forms and build a lookup table for O(1) advance retrieval per glyph. + +When the TJ operator provides kerning numbers, those numbers are in thousandths of a text space unit and are subtracted from the current x position (negative = leftward displacement). Glyph widths from /W or /Widths are in font units (thousandths of an em). These are different scales: a TJ kerning value of −100 means a displacement of 0.1 text space units at the current font size, while a glyph width of 1000 in /W means a full em. + +## Horizontal vs. Vertical Writing Modes + +The /WMode entry in a CIDFont's /CIDSystemInfo or in the CMap determines writing direction: 0 for horizontal (default), 1 for vertical. + +In **horizontal mode**, the advance vector after each glyph is along the positive x-axis in text space. The text matrix is updated by tx as described above; ty is zero for normal glyphs. + +In **vertical mode**, the advance vector is along the negative y-axis. pdftract must use the /W2 array, which provides vertical advance widths (v) and glyph origin offsets (w1x, w1y) for each glyph. The origin offset shifts the glyph's origin from the default position at the top-center of the advance rectangle. The effective advance after each glyph is: + +``` +ty = -(v / 1000) * font_size +``` + +with the horizontal offset w1x applied as a one-time shift at the start of the glyph. Vertical text extraction requires grouping glyphs by their x-coordinate (the column baseline) rather than the y-coordinate. + +## Kerning in TJ Arrays and Word Boundary Reconstruction + +The TJ operator accepts an array whose elements are either byte strings (rendered as glyphs) or numbers (kerning adjustments). A negative number shifts the text position leftward, effectively closing space between glyphs; a positive number opens space. The displacement in text space units is: + +``` +displacement = -(kerning_number / 1000) * font_size * (Tz / 100) +``` + +When reconstructing word boundaries, pdftract must decide whether a TJ number represents intentional word spacing or typographic kerning. A reliable heuristic: if the absolute displacement exceeds a threshold (typically 0.2 to 0.3 times the space character's advance width at the current font size), treat it as a word gap and inject a synthetic space. Below this threshold, treat it as kerning and accumulate it into the preceding glyph's bounding box. + +## Character Spacing and Word Spacing Interaction + +Tc and Tw interact with TJ kerning in a specific order. For each glyph emitted by a Tj or TJ string segment: first apply the glyph's advance from /W or /Widths, scaled to text space; then add Tc scaled by Tz/100; then add Tw scaled by Tz/100 if the glyph code is the space character. TJ kerning numbers are applied between string segments, not between individual glyphs within a segment. + +This means that if a word is encoded as a single string inside a TJ array, Tc accumulates across every character in that string. pdftract must not apply Tc only once per string — it applies once per glyph code emitted. + +## Sub-pixel Precision + +PDF coordinates are floating-point. A character at x = 72.35 pt and another at x = 72.36 pt are distinct positions that affect layout analysis. pdftract must preserve at least two decimal places in all bounding box coordinates. Rounding glyph positions to integers introduces errors that misalign characters on the same baseline by enough to break line-grouping logic, particularly in justified text where inter-word spacing is adjusted to sub-point precision. + +All intermediate computations — Tm multiplication, advance accumulation, Tc/Tw addition — must be carried out in 64-bit floating point. Output bounding boxes should be serialized at two decimal places minimum. + +## Bounding Box Computation + +A glyph's bounding box in user space is computed as follows. The lower-left corner x-coordinate is the current text position's x after transforming through Tm and CTM. The width of the bounding box is: + +``` +bb_width = (glyph_advance + Tc + Tw_if_space - kerning_applied_after) * (Tz / 100) * font_size / 1000 +``` + +where kerning is the TJ number that follows this glyph in the array, if any. For the vertical extent, pdftract uses the font's /Ascent and /Descent values from the font descriptor, scaled by font_size / 1000. These are the typographic ascent and descent in font units and define the bounding box height from baseline − |Descent| to baseline + Ascent, adjusted by Ts. + +For a tight span bounding box spanning multiple glyphs on the same baseline, x_min is the text position at the first glyph and x_max is the text position after the last glyph's advance (including Tc, excluding post-span kerning). The vertical extent uses the maximum Ascent and minimum Descent across all fonts used in the span. + +## Baseline Grid and Line Detection + +After transforming through Tm and CTM, each glyph has a baseline y-coordinate in user space. Glyphs with identical or nearly identical baseline y values belong to the same line. Because floating-point arithmetic in PDF generators is imprecise, pdftract must use a tolerance for baseline comparison — a reasonable value is 0.5 pt. + +The grouping algorithm should sort glyphs by their baseline y-coordinate and cluster them into lines using single-linkage: two glyphs join the same line if their baseline y values differ by less than the tolerance. Within a line, glyphs are sorted by their x-coordinate (or y-coordinate in vertical mode). After clustering, pdftract checks for text rise: glyphs within a line that have nonzero Ts should be tagged as superscript or subscript rather than merged into the main text run, since their visual y position differs but their logical line membership is retained for reading-order purposes. + +The final output — an ordered sequence of glyph records each carrying Unicode text, user-space bounding box, font size, and baseline y — gives downstream consumers the data needed for accurate word segmentation, column detection, and table extraction without requiring re-processing of the raw content stream.