jedarden 116db89c95 Add three research documents on routing and text reconstruction

- word-boundary-reconstruction: expected position formula with Tc/Tw/Tz,
  TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold
  strategies including adaptive histogram, multi-column gap discrimination
- scanned-vs-vector-page-classification: four-category taxonomy, fast
  pre-checks, image coverage AABB computation, character density ratio,
  validity rate, glyph bbox plausibility, region routing map, confidence
  scoring with cost-aware OCR threshold
- pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP
  pdfaid detection, Level B/U/A guarantee implications for extraction,
  font embedding requirements, artifact tagging, PDF/A-3 embedded files,
  PdfaLevel enum with per-level fast-path branching

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:22:08 -04:00

13 KiB

Raw Blame History

Word Boundary Reconstruction

Problem Statement

A substantial fraction of real-world PDFs — especially those produced by TeX/LaTeX toolchains, legacy CAD exporters, and older desktop publishing systems — contain no explicit space characters (U+0020) in their content streams. The visual whitespace between words is produced entirely through glyph positioning arithmetic. When a text extractor naively concatenates glyph-to-Unicode mappings without accounting for positional gaps, every word runs together and the output is unreadable. Reconstructing word boundaries is therefore one of the highest-impact correctness problems in PDF text extraction.

1. Why Spaces Are Missing

The PDF content stream model does not require producers to emit space characters. The spec defines word spacing (Tw) and character spacing (Tc) as graphics state parameters precisely because positioning is expected to substitute for literal space glyphs.

TeX/dvips and pdfTeX operate character-by-character. Each glyph is placed at an absolute or relative position computed by TeX's box-and-glue model. Inter-word glue is converted to a Td offset or a positive numeric element inside a TJ array; no 0x20 byte ever appears in the string arguments. This is by design: TeX fonts often lack a space glyph entirely, and the Type 1 / Type 2 charstring for character code 0x20, if present, has zero advance width.

Advance-width substitution is the general pattern: rather than encoding a space glyph, authoring tools advance the text position by a computed amount equal to the intended inter-word gap, then begin the next word. The result is visually identical to a space but structurally absent from the character stream.

2. Glyph Advance Width and Position

Every glyph has an advance width defined in the font's metric tables. In PDF:

Type 1 / TrueType fonts: the Widths array in the font dictionary maps character codes to glyph widths in 1/1000 of the font's em unit.
CIDFonts: the DW key provides a default advance width; the W key provides per-glyph overrides as a compact run-length encoding.

After rendering glyph g whose advance width is w_g (in glyph units), the text position advances to:

x_next_expected = x_current + (w_g * font_size / 1000)

If the actual x-position of the following glyph deviates positively from x_next_expected by more than a threshold, a gap exists. The magnitude of that gap determines its semantic: a small gap is likely a word space; a larger gap may indicate a sentence boundary, a tab stop, or a column separator.

3. Computing Expected Position Accurately

The simplified formula above omits three graphics state parameters that the PDF spec requires to be applied:

x_next_expected =
    x_current
    + (w_g / 1000 * font_size + Tc + Tw_if_space) * Tz / 100

Where:

Tc (character spacing, set by the Tc operator): added to the advance of every glyph.
Tw (word spacing, set by the Tw operator): added after any single-byte glyph whose character code is 0x20 only. For multi-byte encodings this term never applies.
Tz (horizontal scaling percentage, set by the Tz operator, default 100): scales the entire horizontal advance.

Failure to apply Tc and Tz causes systematic over- or under-estimation of expected positions and produces false gap detections. A text matrix transformation (from Tm or Td) must be applied to convert glyph-space expected positions into device space before comparing with the next glyph's actual device-space coordinates.

4. The Gap Threshold

The central parameter is the minimum gap magnitude that triggers space insertion. Several strategies exist; an adaptive combination is most robust:

Fixed fraction of font size. A gap exceeding 0.2 * font_size is commonly cited. This works for typical roman typefaces at body text sizes but breaks for narrow condensed faces or for documents that mix font sizes.

Fraction of average glyph width. Compute the mean advance width of the glyphs observed on the current text line (excluding outliers). A gap exceeding 0.3 * mean_advance adapts better to condensed or wide typefaces.

Font space glyph width. If the font's Widths array contains an entry for character code 0x20, that width (converted to device units as w_space * font_size / 1000) is the canonical space reference. This is the most accurate signal when available.

Fallback half-em. When no space glyph is defined, use 500 glyph units (half the em) as the reference width: 0.5 * font_size.

Adaptive histogram method. Collect all observed inter-glyph gaps on a page. The distribution is typically bimodal: a sharp peak near zero (tight kerning pairs) and a broader peak near the space width. Fit or locate these two peaks; use the valley between them as the threshold. This requires sufficient glyph count (at least ~50 gaps) to be reliable and can be computed incrementally per-font-size class.

In practice, use the font space glyph width when available, fall back to the adaptive histogram when sufficient data exists, and use 0.25 * font_size otherwise.

5. TJ Operator Kerning Arrays

The TJ operator accepts an array whose elements alternate between byte strings and numeric offsets. A numeric element displaces the text position by -offset * font_size / 1000 (the sign convention is reversed from normal advance: positive values move left, negative move right — i.e., positive offsets are backward).

Wait — to be precise per the PDF spec: the displacement is -(offset / 1000) * font_size in text space. A negative numeric element therefore moves the position forward (adds gap); a positive element kerns tighter (moves backward). TeX uses negative offsets for kerning between adjacent letters and large negative offsets (typically below −250 in 1000-unit space) to implement word separation.

The space-detection rule for TJ numeric elements:

if offset < -space_threshold_in_glyph_units {
    insert_space()
}

Where space_threshold_in_glyph_units maps the device-space threshold back to 1000-unit glyph space: threshold_device * 1000 / font_size. TeX-generated PDFs commonly use offsets around −250 to −350 to represent a normal inter-word space in a 1000-unit font. Treat each transition between a string element and a numeric element, and back to a string, as a potential gap site.

6. Td/TD/Tm Positioning

When the PDF content stream transitions between text positioning commands, the text matrix changes. Relevant operators:

Td tx ty: moves the text line position by (tx, ty) in text space.
TD tx ty: same as Td but also sets TL = -ty.
Tm a b c d e f: sets the text matrix directly.

Between consecutive text painting operators (Tj, TJ, ' ", etc.), if the text matrix changes such that the new horizontal position in device space exceeds x_last_glyph_end by more than the space threshold, insert a space.

Rules:

Positive horizontal jump (new x > expected x by threshold): insert a space.
Negative horizontal jump (new x < expected x): do not insert a space; this is a backtrack, indicating overlapping text, a correction, a superscript/subscript, or right-to-left text reordering. Log as a backtrack event in debug metadata.
Jump between BT/ET blocks: treat the start of each new text object as a potential word boundary using the same threshold rule, comparing the new block's starting position to the ending position of the last glyph from the previous block.

7. Vertical Gap Interpretation

A change in the y-coordinate of the text position signals a line change rather than a word gap. The threshold:

if abs(delta_y) > 0.5 * line_height {
    emit line break
}

Where line_height is approximated as the current font size multiplied by the leading factor (default 1.2 if no explicit TL is set). A vertical gap exceeding approximately 1.5× the line height with no intervening content suggests a paragraph break.

Output conventions:

Line break: emit \n.
Paragraph break: emit \n\n.
Continuation on same line after vertical micro-adjustment (|Δy| < 0.1 × font_size): treat as same line, no break; this covers subscript/superscript corrections.

Avoid inserting a horizontal space when a vertical line break is also emitted, as the two are mutually exclusive for a given gap event.

8. Font-Specific Space Width

The space threshold must be font-local. A narrow condensed typeface may have an inter-word space of only 150 glyph units (15% of em), while a wide serif face may use 350 units (35%). Using a global threshold produces both false positives (splitting ligatures) and false negatives (missing spaces in dense faces).

Resolution strategy (in priority order):

Look up character code 0x20 in the font's Widths array. If present and nonzero, use it.
For CIDFonts, look up CID 0x0020 in the W array, then fall back to DW.
Consult the font's FontDescriptor for MissingWidth; if the space glyph is absent, this is the width assigned to unknown glyphs (often useful as a lower bound).
If all metrics are absent, use 500 glyph units as the default half-em heuristic.
Override with the adaptive histogram estimate when ≥50 inter-glyph gaps are available for the current font at the current nominal size.

Cache the resolved space width per (font_resource_name, font_size) pair to avoid redundant lookups per glyph.

9. Multi-Column Gap vs. Word Gap

A horizontal gap exceeding approximately 2 * font_size in device space on the same baseline is not a word gap — it is a tab stop, column separator, or layout gutter. Inserting a space at such a site produces a run of text that incorrectly merges content from separate columns.

Detection heuristic: if delta_x > 2.0 * font_size and abs(delta_y) < 0.1 * font_size, classify the gap as a layout gap rather than a word gap. The appropriate response depends on the layout mode:

In single-column mode: preserve as a sequence of tab characters or whitespace (extractor-configuration-dependent).
In multi-column mode: treat as a column boundary and do not concatenate the two spans into the same text run at this point; defer ordering to the reading-order algorithm.

This decision point integrates with the column detection logic described in complex-layout-reading-order.md. The word-boundary reconstructor should expose the gap classification (word_gap, layout_gap, line_break, paragraph_break) in its span metadata so that the layout stage can consume it without re-deriving it.

10. Output and Configuration

Inferred space tagging. Explicitly encoded space glyphs (character code 0x20 present in the stream) and inferred spaces (inserted by gap detection) must be distinguishable in the intermediate representation. Each inferred space span carries inferred: true in its debug metadata. This enables downstream consumers to audit false positives without reprocessing the PDF.

Configuration parameter: space_detection_threshold. Expose a per-extractor configuration value:

pub enum SpaceThreshold {
    /// Automatically select per font using the priority strategy above.
    Auto,
    /// Fixed fraction of font size (e.g., 0.25).
    FractionOfFontSize(f32),
    /// Absolute value in device-space points.
    AbsolutePoints(f32),
}

Default: SpaceThreshold::Auto. When Auto, the extractor uses font metric lookups with adaptive histogram fallback. Callers processing documents where every inter-word gap is explicit can set SpaceThreshold::AbsolutePoints(f32::MAX) to disable inference entirely.

Per-page statistics. The PageOutput structure exposes:

pub struct PageSpaceStats {
    pub explicit_space_count: u32,
    pub inferred_space_count: u32,
    pub backtrack_event_count: u32,
    pub layout_gap_count: u32,
}

A high inferred_space_count relative to explicit_space_count (ratio > 5:1) is a reliable signal that the document was produced by a TeX toolchain or a similarly space-omitting authoring system. This signal can inform downstream heuristics such as ligature normalization and hyphenation handling.

Implementation Notes for Rust

Maintain a TextState struct that tracks Tc, Tw, Tz, font_size, text_matrix, and line_matrix as mutable graphics state, updated by the corresponding PDF operators.
After each glyph is rendered, record glyph_end_x (device space) as glyph_start_x + advance_device.
Before rendering the next glyph, compute expected_x from the full formula including Tc and Tz; compare actual x to expected_x; classify and emit gap events.
For TJ arrays, iterate elements in order; accumulate string runs and emit gap events at each sign-significant numeric element before consuming the next string run.
Store the font space width cache in a HashMap<(ObjectId, OrderedFloat<f32>), f32> keyed by font object ID and nominal font size to handle fonts used at multiple sizes.
The adaptive histogram should bucket gaps into bins of width 0.01 * font_size and perform a simple two-peak scan (find the global maximum, zero out ±3 bins, find the second maximum) to locate the space-width peak without a full GMM fit.

13 KiB Raw Blame History Unescape Escape