Add research: span merging, Unicode normalization, implementation plan
Two new research documents covering the glyph-to-span-to-block assembly pipeline (inter-operator merging, adaptive word gap threshold, column detection, ligature bbox splitting, multi-granularity output) and Unicode post-processing (NFC normalization, selective NFKC decomposition for ligatures, PUA preservation, soft hyphen resolution, ZWJ/ZWNJ handling, combining character reordering). Also adds docs/plan/implementation-plan.md: the full 7-phase Rust implementation roadmap covering core parser, font/encoding pipeline, content stream processing, text assembly, OCR integration, API surface, and advanced features — with crate selections, complexity ratings, test strategy, and v0.1–v1.0 release milestones. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
6b96d8d637
commit
12fad41596
3 changed files with 1262 additions and 0 deletions
1048
docs/plan/implementation-plan.md
Normal file
1048
docs/plan/implementation-plan.md
Normal file
File diff suppressed because it is too large
Load diff
74
docs/research/span-merging-and-text-run-assembly.md
Normal file
74
docs/research/span-merging-and-text-run-assembly.md
Normal file
|
|
@ -0,0 +1,74 @@
|
|||
# Span Merging, Text Run Assembly, and Glyph-to-Word-to-Line Pipeline
|
||||
|
||||
## The Extraction Atom: The Single Glyph
|
||||
|
||||
Every text extraction pipeline begins with the smallest meaningful unit: the individual glyph. When pdftract processes a content stream, each `Tj` or `TJ` operator produces one or more glyphs, and each glyph is a self-contained record: a character code resolved to a Unicode scalar, a bounding box in user space computed from the current text matrix and font metrics, a reference to the active font and its size, and the current rendering mode. This is the atom from which all higher-level structure is assembled.
|
||||
|
||||
Starting at the glyph level is the only semantically correct choice. PDF does not encode words, lines, or paragraphs — it encodes positioned drawing commands for individual glyphs. Any grouping imposed above the glyph is an inference made by the extractor. By preserving every glyph's position and font state independently, pdftract retains the information needed to correctly evaluate every subsequent merging decision. Discarding glyph-level detail earlier in the pipeline is irreversible and will cause incorrect span boundaries, particularly for documents with mixed fonts, tracking adjustments, or complex kerning.
|
||||
|
||||
The bounding box of a glyph is computed as: origin at the glyph's current text position, width equal to the glyph's advance width scaled by the font size and the current horizontal scaling factor, height derived from the font's ascender and descender metrics. These four values, combined with the font reference and rendering mode, constitute the complete glyph record.
|
||||
|
||||
## Intra-Operator Span Assembly
|
||||
|
||||
Within a single `Tj` operator, all glyphs share the same font, size, rendering mode, and text matrix state at the start of the operator. They are trivially concatenated into a span — the bounding box of the assembled span is the union of the individual glyph bounding boxes, and the text content is the concatenation of their decoded Unicode characters.
|
||||
|
||||
The `TJ` operator introduces kerning displacements between glyphs via numeric elements in its array. Most of these displacements are fine-grained tracking adjustments that do not represent word boundaries. pdftract treats a `TJ` kerning value as a word-break signal only when the displacement exceeds 0.25 times the current font size in user space (expressed in thousandths of a text space unit, so the threshold in text space is 250 units). Displacements below this threshold adjust glyph positions but do not split the span. Displacements at or above this threshold cause the current span to be closed and a new span to begin after the gap. This threshold was calibrated against a broad corpus of PDFs where inter-word spacing in `TJ` arrays consistently falls in the 200–600 unit range while intra-word kerning is typically below 100 units.
|
||||
|
||||
## Inter-Operator Merging
|
||||
|
||||
Consecutive `Tj` and `TJ` operators frequently represent a single continuous text run that was split across operators for reasons internal to the producing application — trailing kerning adjustments, color changes that were reverted, or simply the output of rich text compositors that emit one operator per styled run. pdftract merges consecutive operators into a single span when all of the following conditions hold:
|
||||
|
||||
- The font and font size are identical.
|
||||
- The rendering mode is identical.
|
||||
- The vertical deviation between the baseline of the new operator and the baseline of the current open span is less than 0.1 times the font size.
|
||||
- The horizontal gap between the right edge of the last glyph in the current span and the left edge of the first glyph in the new operator is less than 0.5 times the space width for the active font (the advance width of the space glyph, or 0.25 em if the font has no space glyph).
|
||||
|
||||
Small `Td` adjustments between operators — the common idiom for micro-positioning in PDF generators — do not prevent merging as long as the resulting positions fall within these tolerances. The vertical tolerance accommodates sub-pixel rounding in the text matrix, and the horizontal tolerance accommodates the natural variation in inter-character spacing without admitting gaps large enough to represent inter-word spacing.
|
||||
|
||||
When a merge is performed, the bounding box of the receiving span is extended to cover the new glyphs, and the text content is concatenated. When the conditions are not met, the current span is closed and a new span is opened for the incoming operator.
|
||||
|
||||
## Line Formation
|
||||
|
||||
Spans are grouped into lines by baseline proximity. Two spans belong to the same line if their baselines differ by no more than 0.5 points in user space. This tolerance is tighter than the inter-line spacing of any typical document, which means it will not merge glyphs from adjacent lines while still accommodating the small floating-point rounding errors that accumulate during text matrix computation.
|
||||
|
||||
Superscripts and subscripts present a special case. A superscript glyph has a baseline elevated above the line by roughly 30–40% of the font size, and a subscript is depressed by a similar amount. pdftract detects these as glyphs whose baseline deviates from the dominant baseline of the current line cluster by more than 0.5 points but whose font size is detectably smaller than the line's primary font size (typically less than 75% of the dominant size). These glyphs are assigned to the nearest line cluster rather than being promoted to a new line, and they are tagged with a `superscript` or `subscript` flag on their span.
|
||||
|
||||
Within a line, spans are sorted by x-coordinate for left-to-right text. For RTL text, detected by the presence of Unicode bidirectional characters in the right-to-left categories (Arabic, Hebrew, and their associated punctuation ranges), spans are sorted by reverse x-coordinate. Mixed-direction lines preserve the visual reading order by applying the Unicode Bidirectional Algorithm at the span level after positional sorting.
|
||||
|
||||
## Word Boundary Injection
|
||||
|
||||
After spans within a line are assembled and sorted, pdftract injects word boundaries by scanning the inter-glyph gaps along the line. The challenge is that the "correct" gap threshold for word separation varies by font, point size, and document style. A fixed threshold produces over-segmentation in tightly spaced text and under-segmentation in loosely tracked text.
|
||||
|
||||
pdftract uses an adaptive threshold computed per line via a gap histogram. For each consecutive glyph pair within the line, pdftract records the horizontal gap between the right edge of the first glyph and the left edge of the second (after accounting for kerning displacements). These gaps are binned into a histogram. In documents with normal word spacing, this histogram is bimodal: a dense cluster of small intra-word gaps near zero (including negative values from kerning) and a second cluster of inter-word gaps centered around the space width of the font. The threshold is placed at the valley between these two peaks, found by scanning from the intra-word peak toward larger gap values until the bin count begins increasing again.
|
||||
|
||||
When the histogram is unimodal (e.g., in very short lines with one or two words), pdftract falls back to a fixed threshold of 0.3 times the space width of the dominant font. A space character is injected into the output at each gap that exceeds the threshold.
|
||||
|
||||
## Block Formation from Lines
|
||||
|
||||
Lines are grouped into text blocks — contiguous regions of related text such as paragraphs, headings, captions, or table cells. Block formation uses three signals:
|
||||
|
||||
1. **Inter-line spacing**: consecutive lines whose vertical gap falls within 20% of the median inter-line spacing for the local region are candidates for the same block. A gap more than 1.5 times the median spacing signals a block break.
|
||||
2. **Left margin alignment**: lines within a block share a left margin within a tolerance of 2 points (accounting for first-line indentation, which is detected as a single-line offset and does not trigger a block break).
|
||||
3. **Font size consistency**: a shift in the dominant font size between consecutive lines signals a block break and a potential heading boundary.
|
||||
|
||||
Each block is assigned a `kind` label derived from font characteristics: `heading` if the dominant font size exceeds 1.2 times the body text size for the page, `body` for standard paragraph text, `caption` for small-font text adjacent to figures, and `code` for monospaced font blocks.
|
||||
|
||||
## Column-Aware Line Grouping
|
||||
|
||||
On multi-column pages, naive line grouping by vertical proximity will incorrectly merge lines from separate columns into the same block. pdftract detects column boundaries before block formation by analyzing the x-coordinate distribution of all span left edges on the page. A significant gap in this distribution — a bin with zero or near-zero occupancy flanked by dense clusters on both sides — marks a column boundary. Lines whose x-ranges fall entirely within a single column band are constrained to merge only with other lines in the same column. Lines that span column boundaries (e.g., full-width headings) are identified by their x-extent and excluded from the column constraint.
|
||||
|
||||
## Mixed-Font Spans
|
||||
|
||||
When a single logical word is rendered with a font change mid-word — an italic letter in a roman word, a Greek character in a Latin text — pdftract must decide whether to split the word at the font boundary or preserve the word as a single unit with a mixed-font flag. Splitting at the font boundary produces incorrect tokenization that breaks downstream search and selection.
|
||||
|
||||
pdftract preserves word integrity by merging glyphs across font transitions within a word. A font transition within a word is detected when the horizontal gap between the last glyph of the current font and the first glyph of the incoming font is below the word-break threshold. The merged span carries a `flags` field with bits for `bold`, `italic`, and `mixed_font`. The `mixed_font` flag signals to consumers that the span's font reference is nominal (the font of the first glyph) and that per-glyph font information is available in the glyph record array if needed.
|
||||
|
||||
## Ligature Handling
|
||||
|
||||
Ligatures such as fi, fl, ffi, and ffl are encoded in many fonts as single glyph codes that map to multi-character Unicode sequences. The glyph occupies a bounding box that covers the combined extent of both constituent characters. pdftract expands ligatures to their Unicode equivalents in the text output but must distribute the bounding box across the expanded characters for character-granularity output.
|
||||
|
||||
The distribution is proportional: the ligature's bounding box width is divided among the constituent characters according to the nominal advance widths of those characters in the font's character metrics. If the font does not provide individual metrics for the components (common with symbol fonts), the box is divided equally. This approximation introduces a small positional error — typically less than 0.5 points — that is acceptable for word-level bbox queries but should be noted when sub-character precision is required.
|
||||
|
||||
## Output Granularity
|
||||
|
||||
A single extraction pass through the content stream produces the complete glyph record array. From this array, pdftract assembles word-granularity spans (one span per word, with the merged bounding box and concatenated text), character-granularity spans (one span per glyph, preserving individual bounding boxes), and paragraph-granularity spans (one span per block, with the block's bounding box and full text content) without re-parsing the PDF. The granularity is selected at query time by applying the appropriate merge level to the cached glyph array. This design means that a single page parse supports all output modes — character-level bbox queries for text selection, word-level spans for search, and paragraph-level spans for document structure analysis — without redundant work.
|
||||
140
docs/research/unicode-normalization-and-text-cleanup.md
Normal file
140
docs/research/unicode-normalization-and-text-cleanup.md
Normal file
|
|
@ -0,0 +1,140 @@
|
|||
# Unicode Normalization, Text Cleanup, and Post-Processing Pipeline
|
||||
|
||||
## Overview
|
||||
|
||||
Raw text extracted from PDF streams is rarely clean Unicode. Glyph-to-character mappings in PDF fonts encode text in forms optimized for rendering, not for downstream consumption: ligature glyphs stand in for character sequences, soft hyphens interrupt words at line breaks, Private Use Area code points mask unresolved glyphs, and combining diacritics may arrive in visual rather than logical order. The pdftract post-processing pipeline exists to resolve all of these issues in a defined, predictable sequence before text reaches the caller.
|
||||
|
||||
This document specifies the precise Unicode transformations applied in that pipeline stage.
|
||||
|
||||
---
|
||||
|
||||
## 1. Unicode Normalization Form
|
||||
|
||||
pdftract outputs text in **NFC** (Canonical Decomposition followed by Canonical Composition, as defined in Unicode Standard Annex #15). This choice is mandated by the PDF/UA-2 standard, which requires ActualText and ToUnicode output to be in NFC. Beyond standards compliance, NFC is the most compact canonical form and is what most downstream tools—text search indexes, NLP tokenizers, and string comparison routines—expect to receive.
|
||||
|
||||
The distinction between normalization forms matters here. **NFD** decomposes every precomposed character into its base letter and combining diacritical sequence (e.g., `é` → `e` + U+0301), which is useful for diacritic-stripping but produces inflated code unit counts. **NFKD** and **NFKC** additionally apply compatibility decompositions, which collapse ligatures and other compatibility variants into their canonical equivalents—useful for search indexing, but too destructive for general-purpose extraction output where the caller may need to distinguish `fi` from `fi` for layout reconstruction. pdftract applies compatibility decompositions selectively and explicitly (see Section 2) rather than globally via NFKC, then composes the result to NFC.
|
||||
|
||||
The normalization step runs last in the pipeline, after all other transformations, so that earlier steps do not produce NFD intermediate forms that subsequently compose incorrectly.
|
||||
|
||||
---
|
||||
|
||||
## 2. Selective Compatibility Decomposition
|
||||
|
||||
PDF fonts frequently encode typographic ligatures as single glyphs and map them to Compatibility Area code points rather than multi-character sequences. A naïve extraction that preserves these code points produces text where `"efficient"` is stored as `"e` U+FB03 `ient"`, which breaks substring search, spell checking, and word-boundary detection.
|
||||
|
||||
pdftract applies the following compatibility decompositions unconditionally when the code point appears in body text content:
|
||||
|
||||
| Code Point | Name | Expansion |
|
||||
|---|---|---|
|
||||
| U+FB00 | LATIN SMALL LIGATURE FF | `ff` |
|
||||
| U+FB01 | LATIN SMALL LIGATURE FI | `fi` |
|
||||
| U+FB02 | LATIN SMALL LIGATURE FL | `fl` |
|
||||
| U+FB03 | LATIN SMALL LIGATURE FFI | `ffi` |
|
||||
| U+FB04 | LATIN SMALL LIGATURE FFL | `ffl` |
|
||||
| U+FB05 | LATIN SMALL LIGATURE LONG S T | `st` |
|
||||
| U+FB06 | LATIN SMALL LIGATURE ST | `st` |
|
||||
| U+FB00–U+FB4F | Full Alphabetic Presentation Forms block | per Unicode decomposition mapping |
|
||||
|
||||
For **superscript and subscript digits** (e.g., U+00B2 SUPERSCRIPT TWO, U+00B3 SUPERSCRIPT THREE, U+2070–U+2079, U+2080–U+2089), pdftract applies compatibility decomposition only when the character is in a run of body text where no mathematical context is detected. When the span is tagged as a formula, equation, or appears within a mathematical font context identified during the glyph-mapping stage, the superscript/subscript code points are preserved verbatim so that the caller can reconstruct notation correctly. Body text heuristics check for surrounding alphanumeric characters and the absence of mathematical operator adjacency; when ambiguous, the code point is preserved and flagged in the confidence metadata.
|
||||
|
||||
Alphabetic Presentation Forms outside the ligature set (e.g., U+FB50–U+FDFF Arabic Presentation Forms-A, U+FE70–U+FEFF Arabic Presentation Forms-B) are decomposed only when the current script run is Latin. For Arabic text, these forms carry distinct semantic weight in some legacy encodings and are left for the Arabic shaping logic in Section 10.
|
||||
|
||||
---
|
||||
|
||||
## 3. Private Use Area Handling
|
||||
|
||||
Code points in the Private Use Area (U+E000–U+F8FF, supplementary plane U+F0000–U+FFFFF) appear when a PDF's ToUnicode CMap assigns PUA values to glyphs whose actual characters are unknown—commonly in hand-crafted or scanned documents with embedded bitmapped fonts where the font vendor used PUA internally.
|
||||
|
||||
pdftract does **not** silently drop PUA code points and does not attempt heuristic substitution. Instead, each PUA code point is preserved verbatim in the output string and annotated in the structured JSON output with `confidence_source: "Synthetic"` and `confidence: 0.0`. This makes the gap visible to downstream processors without corrupting the surrounding text or altering character offsets. Callers that need clean plain text can filter on the confidence metadata; callers that need to audit extraction quality can locate every unresolved glyph precisely.
|
||||
|
||||
PUA cleanup is out of scope for pdftract: resolving these requires either a per-font encoding table provided by the caller or OCR fallback, both of which are caller responsibilities.
|
||||
|
||||
---
|
||||
|
||||
## 4. Soft Hyphen Handling
|
||||
|
||||
U+00AD (SOFT HYPHEN) is inserted by TeX and similar typesetters at potential line-break positions within words. In the rendered PDF the glyph may or may not be visible depending on whether the line actually broke at that point. When text is extracted naïvely, these soft hyphens appear mid-word in the output stream regardless of rendering context.
|
||||
|
||||
pdftract resolves soft hyphens using glyph position data from the PDF content stream. When a U+00AD is followed by a line break in the glyph sequence (detected by a vertical position delta exceeding the line height threshold) and the first character on the next line is a lowercase letter, the soft hyphen and the line break are both removed and the two word fragments are joined. This heuristic covers the dominant TeX case. When the next line begins with an uppercase letter or a digit, the soft hyphen is removed but a space is inserted, under the assumption that the break was between words. In all other cases the soft hyphen is removed unconditionally—U+00AD has no display semantics in plain text output and is never useful to callers.
|
||||
|
||||
---
|
||||
|
||||
## 5. Non-Breaking Space Normalization
|
||||
|
||||
U+00A0 (NO-BREAK SPACE), U+202F (NARROW NO-BREAK SPACE), and U+2007 (FIGURE SPACE) are treated differently depending on the output mode and the detected content type.
|
||||
|
||||
In **body text** spans—paragraphs, headings, captions—all three are normalized to U+0020 (SPACE). Typographers use NBSP to prevent line breaks at specific positions, but that layout intent is lost in text extraction; preserving NBSP in body text output causes unexpected behavior in search indexes and tokenizers that do not normalize it.
|
||||
|
||||
In **formatted content** spans—table cells, form fields, code blocks, and structured data regions detected by layout analysis—the original code points are preserved. A figure space (U+2007) in a numeric column is semantically significant for alignment; a narrow NBSP in a date or unit string (`100 km`) is intentional and its removal would corrupt the value.
|
||||
|
||||
The `--text` output mode applies body-text normalization globally. The JSON structured output preserves the original code points and annotates the span's content-type classification so the caller can make their own normalization decision.
|
||||
|
||||
---
|
||||
|
||||
## 6. Control Character Filtering
|
||||
|
||||
C0 control characters (U+0000–U+001F) and C1 control characters (U+0080–U+009F) appear in extracted text as artifacts of encoding errors, particularly in documents that mix single-byte encodings or use MacRoman/WinANSI code pages where byte values in the C1 range map to printable characters in the source encoding but are incorrectly interpreted as Unicode.
|
||||
|
||||
pdftract strips all C0 and C1 control characters from body text with two exceptions: U+0009 (CHARACTER TABULATION) and U+000A (LINE FEED) are retained when they appear in form field values, where they are legitimate content. U+000D (CARRIAGE RETURN) is normalized to U+000A rather than stripped, because some PDF generators use CR to terminate form field lines. The null byte U+0000 is stripped unconditionally regardless of context.
|
||||
|
||||
---
|
||||
|
||||
## 7. Zero-Width Characters
|
||||
|
||||
U+200B (ZERO WIDTH SPACE), U+FEFF (BYTE ORDER MARK / ZERO WIDTH NO-BREAK SPACE), U+200C (ZERO WIDTH NON-JOINER), and U+200D (ZERO WIDTH JOINER) require distinct treatment.
|
||||
|
||||
**U+200B** and **U+FEFF** are stripped from all output. ZWSP is used by some PDF generators as an internal glyph separator with no semantic content; BOM has no meaning within a string body. Neither survives into pdftract output.
|
||||
|
||||
**U+200C (ZWNJ)** and **U+200D (ZWJ)** affect shaping in Arabic, Indic, and other complex scripts. In runs where the span's detected language is Arabic (`ar`), Persian (`fa`), Hindi (`hi`), or any other language whose script relies on ZWJ/ZWNJ for correct glyph selection, these code points are preserved verbatim. Stripping a ZWNJ from a Persian compound word or a ZWJ from an Indic conjunct consonant produces incorrect text that cannot be faithfully re-rendered. In Latin-script spans where these code points appear due to encoding errors, they are stripped.
|
||||
|
||||
Language detection for this decision uses the script-run classification produced during the glyph-mapping stage, falling back to Unicode script property of the surrounding characters.
|
||||
|
||||
---
|
||||
|
||||
## 8. Smart Quotes and Typographic Punctuation
|
||||
|
||||
U+2018 (LEFT SINGLE QUOTATION MARK), U+2019 (RIGHT SINGLE QUOTATION MARK), U+201C (LEFT DOUBLE QUOTATION MARK), U+201D (RIGHT DOUBLE QUOTATION MARK), U+2013 (EN DASH), and U+2014 (EM DASH) are **preserved as-is** in all output modes.
|
||||
|
||||
These are correct Unicode characters, not encoding errors. Normalizing them to ASCII apostrophes, straight quotation marks, or hyphens would constitute lossy transformation that destroys typographic information present in the source document. Downstream tools that require ASCII-only punctuation must perform their own substitution; pdftract does not make that decision for the caller.
|
||||
|
||||
---
|
||||
|
||||
## 9. Whitespace Collapse in Plain Text Output
|
||||
|
||||
The `--text` output mode applies a final whitespace normalization pass that is not applied to JSON structured output:
|
||||
|
||||
- Multiple consecutive U+0020 SPACE characters within a line are collapsed to a single space.
|
||||
- U+000D, U+000D U+000A, and U+000C (FORM FEED) line endings are normalized to U+000A.
|
||||
- Trailing whitespace is stripped from every line.
|
||||
- Multiple consecutive blank lines are collapsed to a single blank line.
|
||||
|
||||
This pass runs after all other transformations. It is intentionally absent from JSON output, where span-level whitespace reflects the actual character sequence returned by the extraction engine and the caller controls presentation.
|
||||
|
||||
---
|
||||
|
||||
## 10. Combining Character Ordering
|
||||
|
||||
Some legacy PDF generators, particularly those targeting RTL scripts, write glyph sequences in visual order rather than logical Unicode order. A base character may be followed by combining diacritics in the order they appear left-to-right on screen rather than in Unicode canonical combining class (ccc) order. When combining marks with different canonical combining classes are in the wrong sequence, Unicode normalization produces a different composed character than intended—or fails to compose at all.
|
||||
|
||||
pdftract detects out-of-order combining sequences by examining the canonical combining class of each code point in a combining character sequence. When the sequence is not in non-decreasing ccc order, the marks are sorted by ccc before the final NFC composition pass. This reordering is applied only to sequences where all marks belong to the same base character (i.e., the sequence is a single combining character sequence in Unicode terms) to avoid incorrectly reordering marks that belong to adjacent base characters.
|
||||
|
||||
For Arabic and Hebrew text where the visual-to-logical reordering problem is pervasive, pdftract additionally applies the Unicode Bidirectional Algorithm to the extracted character sequence before the combining character sort, ensuring that the logical string order matches Unicode's expected representation for RTL text.
|
||||
|
||||
---
|
||||
|
||||
## Pipeline Execution Order
|
||||
|
||||
The transformations above execute in the following sequence to avoid interactions between steps:
|
||||
|
||||
1. Control character filtering (C0/C1 strip)
|
||||
2. Zero-width character handling (strip ZWSP/BOM; preserve ZWJ/ZWNJ in complex-script spans)
|
||||
3. PUA annotation (flag and pass through)
|
||||
4. Soft hyphen resolution (requires raw glyph positions, must precede whitespace normalization)
|
||||
5. Ligature and compatibility decomposition (selective, as specified in Section 2)
|
||||
6. Superscript/subscript resolution (context-dependent)
|
||||
7. Non-breaking space normalization (body text only)
|
||||
8. Combining character reordering
|
||||
9. NFC normalization (final composition)
|
||||
10. Whitespace collapse (`--text` mode only)
|
||||
|
||||
Smart quote and typographic punctuation preservation requires no active transformation—it is an absence of normalization—and is therefore not a discrete step.
|
||||
Loading…
Add table
Reference in a new issue