Add six research documents covering output-side extraction topics

- table-structure-reconstruction: line detection, gap analysis, Hough
  transform, graph-based cell reconstruction, merged cells, multi-page tables
- mathematical-expression-handling: five encoding cases, OpenType MATH table,
  symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers
- language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi,
  CJK vertical text, ligature normalization, whatlang/lingua integration
- document-classification-and-zone-labeling: margin heuristics, font
  clustering, cross-page recurrence, footnote/caption/sidebar detection
- post-extraction-normalization: hyphen handling, ligature expansion,
  paragraph reconstruction, Unicode normalization, pipeline ordering
- chunking-for-llm-consumption: semantic snapping, heading hierarchy,
  sliding window overlap, table chunking strategies, token budget, late chunking

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-16 14:56:25 -04:00
parent ef9c03095d
commit b805593973
6 changed files with 1131 additions and 0 deletions

View file

@ -0,0 +1,198 @@
# Chunking for LLM Consumption
**Project:** pdftract — Rust PDF text extraction library
**Scope:** Algorithms and output formats for chunking structured extraction results into LLM-ready segments
---
## 1. Why Chunking Is a pdftract Concern
A PDF extraction pipeline typically ends at flat text. The consuming application then applies chunking — splitting that text into segments sized for embedding models or LLM context windows. This division of labor has a significant defect: the chunker must re-infer structure that the extractor already computed.
pdftract operates at the block level. Each block carries a `kind` (paragraph, heading, table, figure, footnote, list item), a `bbox` (bounding box on the page), a `zone` label (body, sidebar, header, footer, caption), and full Unicode text. These properties encode the semantic structure of the document. A paragraph boundary in pdftract's output is not a heuristic — it is derived from the PDF's glyph stream geometry and, where present, its logical structure tree.
A downstream chunker working from a flat string has none of this. It must guess paragraph boundaries from double newlines, infer heading levels from font-size differences it cannot see, and split tables it cannot identify. Every inference the downstream tool makes is a degraded approximation of what pdftract already resolved with precision.
The practical consequence is chunk contamination: a heading gets merged with the paragraph preceding it from the previous section; a table row straddles a chunk boundary; a footnote bleeds into body text. Each of these reduces embedding quality and retrieval precision.
Exposing chunking as a built-in output mode — a `--mode chunks` flag or a `chunks` field in the JSON envelope — allows pdftract to apply its structural knowledge directly. The semantic boundaries are already known at extraction time. Chunking is the correct layer at which to use them.
---
## 2. Semantic Boundary Types
pdftract identifies several block transition types that make natural, high-quality chunk boundaries:
- **Heading transitions.** A block with `kind: heading` at any level (H1, H2, H3) marks the start of a new document section. This is the strongest semantic boundary available.
- **Paragraph breaks.** Adjacent paragraph blocks with no heading between them represent continuous prose in the same section. The gap between them is a valid split point.
- **Table boundaries.** A `kind: table` block is a self-contained unit with a defined start and end. Splitting inside a table loses row coherence and column semantics.
- **Figure and caption units.** A `kind: figure` block paired with an adjacent `kind: caption` block should be kept together. Separating them makes the caption uninterpretable in retrieval.
- **Footnote blocks.** `kind: footnote` blocks often belong to specific body paragraphs by reference number. They are candidates for inclusion with their referencing paragraph or for separate indexing, but should not straddle arbitrary boundaries.
- **List boundaries.** A sequence of `kind: list_item` blocks forms a unit. Splitting a list mid-item degrades readability and breaks the syntactic completeness of the item.
Each of these is already labeled in pdftract's block output. A chunker with access to the block stream can use these labels directly without any re-inference.
---
## 3. Fixed-Size Chunking with Semantic Snapping
The baseline chunking strategy targets a maximum of N tokens per chunk. Naive fixed-size chunking splits at exactly N tokens, producing fragments that end mid-sentence or mid-paragraph.
Semantic snapping improves on this: accumulate blocks until the token budget is reached, then extend or retract to the nearest clean semantic boundary before closing the chunk. In practice, this means:
1. Accumulate blocks in order.
2. After adding each block, check whether the running token estimate exceeds the target.
3. When it does, close the chunk at the end of the current block (if the block itself is within budget) or at the last sentence boundary within the current block's text.
4. Begin the next chunk at the start of the next block.
This approach keeps block integrity. A paragraph that fits within the budget is never split. A paragraph that exceeds the budget is split at a sentence boundary — identified by terminal punctuation followed by whitespace — rather than at a character offset.
Blocks larger than the target chunk size (long tables, large prose paragraphs) require special handling. For prose blocks, split on sentence boundaries and emit each sentence group as its own chunk, preserving the block's metadata (page, zone, heading context) on every sub-chunk. For table blocks, see Section 6.
---
## 4. Heading-Based Hierarchical Chunking
Heading-based chunking uses H1/H2/H3 transitions as primary split points, producing chunks that correspond to document sections rather than token windows.
The algorithm builds a document tree from the heading block sequence:
1. Scan the block stream in order.
2. When a heading block is encountered, push it onto a heading stack, popping any heading at the same or lower level (H2 pops a preceding H2 but not a preceding H1).
3. Accumulate subsequent non-heading blocks as children of the current heading node.
4. Each leaf node (heading + its body blocks) becomes a chunk candidate.
Every chunk inherits the full heading path from root to its immediate heading, forming a breadcrumb: `["Introduction", "Background", "Prior Work"]`. This breadcrumb is included in the chunk's metadata and optionally prepended to the chunk text so that embedding models encode the section context alongside the content.
For very large sections (a single H2 section spanning 4,000 tokens), hierarchical chunking falls back to paragraph-boundary splitting within the section, carrying the heading breadcrumb forward on each sub-chunk.
Documents with no headings degrade gracefully to paragraph-boundary chunking. The heading breadcrumb is omitted or replaced with a page-range label.
---
## 5. Sliding Window with Overlap
RAG retrieval systems suffer from boundary loss: a query whose answer spans two adjacent chunks retrieves neither chunk with high confidence because the relevant content is split across a boundary. Sliding window chunking with overlap addresses this by including a suffix of the previous chunk at the start of the current one.
Typical overlap sizing is 1020% of the target chunk size. For a 512-token target, 64100 tokens of overlap is standard. Overlap beyond 25% produces diminishing returns while significantly inflating index size.
Semantic snapping interacts with overlap in a non-obvious way: the overlap region should not begin mid-sentence. When computing the overlap suffix, walk backward from the chunk boundary to the nearest sentence start, then include from that point forward. This ensures the overlap text is syntactically complete and embeds correctly.
Overlap helps when:
- Queries target local context (a specific fact, a named entity, a numeric value) that might fall near a chunk boundary.
- Documents are dense prose with high local coherence.
Overlap hurts when:
- Documents are primarily tabular or list-based (overlap duplicates structured data without semantic benefit).
- The embedding model has a very short context window (overlap consumes budget needed for content).
- Index size is a hard constraint (every overlapping token appears in two chunk embeddings).
pdftract's block structure supports overlap implementation cleanly: overlap is measured in blocks (include the last M blocks of the previous chunk at the start of the current one) rather than in raw characters, preserving semantic integrity.
---
## 6. Table Handling in Chunks
Tables require special treatment because row-level coherence is critical for embedding quality. Three strategies are viable, each with distinct tradeoffs:
**A. Whole-table as single chunk.** Emit every table as one chunk regardless of size. This preserves row and column relationships completely. The drawback is unbounded chunk size — a 200-row financial table becomes a single embedding that may exceed model context limits and produces a coarse retrieval unit.
**B. Row-boundary splitting with header repetition.** Split the table into N-row segments, repeating the header row at the start of each segment. This bounds chunk size while preserving column semantics. The repeated header adds token overhead (proportional to column count and row segment count) but makes each sub-chunk independently interpretable. This is the recommended strategy for wide or long tables.
**C. Serialize as markdown within surrounding prose.** Convert the table to GitHub-flavored markdown and include it in the prose chunk that precedes or follows it. This works well for small tables (25 rows) embedded in analytical text where the table is subordinate to the prose argument. It fails for large tables where the serialized markdown dominates the chunk and overwhelms the prose context.
The appropriate strategy depends on table size and document type. pdftract can expose a `table_chunk_strategy` parameter with values `single`, `row_split`, and `inline_markdown`.
---
## 7. Token Budget Awareness
Chunk size must be measured in tokens, not characters, because language models have token-count context limits and embedding models have token-count input limits. The character-to-token ratio is not fixed: English prose averages roughly 4 characters per token under byte-pair encoding; CJK text averages 12 characters per token due to high-entropy characters that do not merge into multi-character tokens.
pdftract should implement a fast token estimator that does not depend on a specific model's tokenizer. A practical approach:
- For ASCII-dominant text, estimate `token_count ≈ char_count / 4.0`.
- For text with high Unicode density (detected via codepoint range sampling), adjust the denominator toward 1.52.0.
- For mixed content, compute a weighted average based on character class proportions.
This estimate is exposed as `token_estimate` in chunk output and used internally to enforce `max_tokens` budget limits. The estimate is intentionally conservative (slightly over-counts) to avoid producing chunks that overflow model context limits at inference time.
`max_tokens` should be a first-class chunking parameter alongside `strategy` and `overlap_tokens`.
---
## 8. Metadata per Chunk
Every chunk emitted by pdftract must carry the following metadata fields:
- **`pages`** — the 1-indexed page range covered by the chunk's source blocks.
- **`heading_breadcrumb`** — ordered array of heading texts from the document root to the section containing this chunk.
- **`zone`** — the dominant zone label of the chunk's blocks (`body`, `sidebar`, `header`, `footer`, `caption`). Determined by the zone label appearing in the majority of the chunk's blocks by character count.
- **`char_offset_start` / `char_offset_end`** — character offsets into the full document text (defined as the concatenation of all block texts in document order). These enable citation generation: given a chunk retrieved by a RAG system, the citing application can locate the exact span in the source document.
- **`chunk_index`** — zero-indexed position of this chunk in the full chunk sequence.
- **`total_chunks`** — total number of chunks emitted for the document.
This metadata feeds retrieval ranking (prefer body-zone chunks over sidebar-zone chunks for general queries), citation generation (reconstruct the page and paragraph reference), and debug inspection (verify chunk boundaries align with document structure).
---
## 9. Late Chunking Compatibility
Late chunking is a retrieval technique where the full document is passed to a long-context embedding model and the resulting token embeddings are pooled per chunk region after the forward pass. This preserves global document context in local chunk embeddings — a quality improvement over independent chunk embedding.
Late chunking requires two things from the extraction layer: (a) the full document text as a single string, and (b) the character or token offsets of each chunk within that string, so that the post-pass pooling step knows which embeddings to aggregate.
pdftract can expose a `full_text_with_offsets` mode that emits:
1. A single `full_text` string — the concatenation of all block texts in reading order with standardized separators.
2. A `chunks` array where each entry contains only `char_offset_start`, `char_offset_end`, and the metadata fields from Section 8 (no repeated text).
The consuming application passes `full_text` to the embedding model and uses the offset array to pool the resulting embedding matrix. This decouples chunking strategy from the embedding call, allowing the same pdftract output to drive both standard independent-chunk embedding and late-chunking pipelines without re-extraction.
---
## 10. Output Format
When chunking mode is enabled, pdftract emits a top-level `chunks` array in its JSON output. Each element conforms to:
```json
{
"chunk_index": 0,
"total_chunks": 42,
"text": "...",
"token_estimate": 380,
"pages": [3, 4],
"heading_breadcrumb": ["Introduction", "Background"],
"zone": "body",
"char_offset_start": 1240,
"char_offset_end": 2890
}
```
Field semantics:
| Field | Type | Description |
|---|---|---|
| `chunk_index` | integer | Zero-based position in the chunk sequence |
| `total_chunks` | integer | Total chunks in this document |
| `text` | string | The chunk's full text content |
| `token_estimate` | integer | Estimated token count (conservative BPE estimate) |
| `pages` | integer[] | 1-indexed page numbers spanned by this chunk |
| `heading_breadcrumb` | string[] | Heading path from document root to this chunk's section |
| `zone` | string | Dominant zone label of source blocks |
| `char_offset_start` | integer | Start offset in the full document text string |
| `char_offset_end` | integer | End offset (exclusive) in the full document text string |
The `text` field is omitted in `full_text_with_offsets` mode, where the consuming application derives text from the full document string using the offset pair.
Chunking parameters are specified in the extraction request:
```json
{
"strategy": "heading_hierarchical",
"max_tokens": 512,
"overlap_tokens": 64,
"table_chunk_strategy": "row_split"
}
```
Valid `strategy` values: `fixed_size`, `heading_hierarchical`, `sliding_window`, `full_text_with_offsets`.

View file

@ -0,0 +1,176 @@
# Document Classification and Zone Labeling
## Overview
After raw text extraction, each glyph or span has a position, font reference, and character content — but no semantic role. Zone labeling is the process of assigning a role to each text block: `body`, `heading`, `header`, `footer`, `footnote`, `caption`, `sidebar`, `marginalia`, or `page_number`. This pass runs after block assembly (grouping spans into lines and lines into paragraphs) but before reading-order resolution.
---
## 1. Why Zone Labeling Matters
Without zone labeling, extracted text is a raw positional dump. The damage is concrete:
- **Running headers interleaved with body paragraphs.** A header reading "Chapter 3: Results" appears between sentences because its y-coordinate places it between two body blocks on the same page.
- **Page numbers embedded mid-sentence.** A numeric "42" extracted in column order falls between the last word of one paragraph and the first word of the next.
- **Footnote markers disrupting prose flow.** Superscript `³` extracted inline pulls the following footnote text — located at the bottom of the page — into the paragraph body.
- **Sidebar text inserted at random positions.** A pull quote in the right margin, if read left-to-right by x-coordinate, bisects the main-column paragraph it is adjacent to.
The cost compounds downstream. Language models, search indexers, and screen readers all treat the extracted string as coherent prose. Injected non-body content corrupts sentence boundary detection, keyword density, and logical paragraph structure. Zone labeling is the gate that filters what reaches the output string.
---
## 2. Page Margin Heuristics
The simplest zone signals are geometric: headers and footers live at fixed vertical positions near the page boundary.
**Threshold definition.** Define `header_zone_max_y` as the y-coordinate below which a block must start to be considered a candidate header (measuring from the top of the page). A reliable default is 1015% of page height. Similarly, `footer_zone_min_y` is the y-coordinate above which a block must end to be a footer candidate, measured from the bottom — again, 1015%.
```
header_zone_max_y = page_height * 0.12
footer_zone_min_y = page_height - page_height * 0.12
```
Blocks whose bounding box falls entirely within these bands are header/footer candidates, not yet confirmed.
**Page number pattern detection.** Within the footer (or header) band, apply regex against the extracted text:
```
^\d+$ // bare number: 42
^Page\s+\d+(\s+of\s+\d+)?$ // Page 3 of 12
^[ivxlcdmIVXLCDM]+$ // roman numerals: xiv
^[-]\s*\d+\s*[-]$ // em-dash framing: — 42 —
```
A block matching any of these within the margin band receives label `page_number` at high confidence (≥ 0.90).
**Stability filter.** A single page cannot confirm a header or footer — any text can appear near the top by chance. Apply the stability filter (described in section 5) before committing the label.
---
## 3. Font-Based Classification
Font metadata distinguishes heading hierarchy from body text, and body from ancillary text like captions and footnotes.
**Build a font inventory.** On first pass over the document, collect `(font_name, font_size, is_bold, is_italic)` tuples from every span. Normalize font sizes to points. Cluster sizes into bins using a simple histogram with a 0.5pt merge tolerance to collapse rounding artifacts. The bin with the highest total character count is the **dominant body size** — call it `body_pt`.
**Heading detection.** A block where all spans share a font size `> body_pt * 1.25` and `is_bold == true` is a strong heading candidate. Multiple heading levels are recoverable by ordered font-size clustering: the largest non-body size is `h1`, the next is `h2`, and so on, up to three levels before the signal becomes unreliable.
**Caption and footnote detection.** Blocks where the font size is `< body_pt * 0.85` are small-text candidates. Combine with position (bottom-of-page for footnotes, adjacent to a whitespace gap for captions) and font style (often italic for captions) to disambiguate.
**Dominance rule.** If a block mixes body-sized and heading-sized spans (e.g., a sentence with a bold lead word), classify by the dominant span — the one covering more than 60% of character width.
---
## 4. Positional Heuristics
**Centred text as heading signal.** Compute the horizontal midpoint of a block's bounding box. If it falls within 5% of page width from the page centre, and the block is a single line, raise the heading confidence. Centring alone is not sufficient — font size must also exceed body size.
**Indentation patterns.** Measure the left-edge x-coordinate of the first line vs. subsequent lines in a paragraph block. Standard body paragraphs have a consistent left margin with optional first-line indent (positive or negative). A hanging indent — where the first line starts further left than continuation lines — is a strong footnote or bibliography signal. A large positive indent on every line suggests a block quote.
**Column boundary detection.** Collect the left-edge x-coordinates of all body-candidate blocks on a page. Cluster them; two tight clusters indicate a two-column layout, defining column boundaries. Any block whose x-origin falls outside both columns and within the page margin is a marginalia candidate.
**Outer margin detection.** For a single-column document, define the body column as the region bounded by the median left and right x-extents of body blocks (±5% tolerance). Text that starts to the right of `body_right + page_width * 0.05` is marginalia.
---
## 5. Cross-Page Consistency
A text block that recurs at the same position across multiple pages is definitionally a running element — header or footer — regardless of whether it triggered the margin-band heuristic.
**Position fingerprint.** For each page, record `(y_normalized, height, width)` for every candidate block, where `y_normalized = block_top / page_height`. Two blocks across pages are positionally equivalent if their `y_normalized` values differ by less than 0.01 and their widths are within 5%.
**Sliding window.** Process pages in groups of five (or fewer at document boundaries). A block position that appears in at least four of five consecutive pages is a running element. Assign `header` or `footer` based on whether it sits in the top or bottom margin band; if it falls outside both bands but recurs consistently, assign the closer one.
**Recto/verso alternation.** Academic and book PDFs often alternate left-aligned headers on even pages (verso) with right-aligned headers on odd pages (recto). Detect this by checking whether positionally equivalent blocks alternate between page-parity groups. When alternation is confirmed, apply the header label to both positions. Text content may differ (e.g., chapter title vs. section title); only position need match.
**Recurring text fragments.** Normalize extracted text (trim whitespace, collapse runs) and hash each candidate block. A hash appearing on more than 50% of pages is a strong running-element signal even if position varies slightly (e.g., centred headers on different-width pages).
---
## 6. Footnote Detection
Footnote detection requires matching two artifacts: the inline marker and the footnote body.
**Inline markers.** During span assembly, track spans where `font_size < body_pt * 0.75` and the span baseline is raised above the line baseline by more than 2pt. These are superscript candidates. Extract the character: if it is a digit, letter, or standard footnote symbol ( † ‡ § ¶), record it as a marker with its position.
**Footnote body location.** On the same page, look for blocks in the lower region (below `page_height * 0.65`) that begin with a matching marker character, optionally followed by a period or space. The block's font size is typically `< body_pt * 0.85`.
**Separator rule.** Many PDF producers render a short horizontal rule (a thin rectangle path, typically 3050% of column width, 0.51pt thick) immediately above the footnote area. When such a path is detected, all text blocks below it and above the footer band are footnote candidates, raising their confidence.
**Overflow footnotes.** A footnote body that begins on page N and continues on page N+1 has no marker on page N+1. Detect this by tracking whether the last footnote block on a page ends mid-sentence (no terminal punctuation followed by whitespace). If so, the first small-font block at the bottom of the next page inherits the footnote label.
---
## 7. Caption Detection
**Proximity to image placeholders.** PDF image XObjects (type `/XObject`, subtype `/Image`) and form XObjects used as figures occupy rectangular regions on the page. After extracting all XObject bounding boxes, identify text blocks whose bounding box top is within `body_pt * 3` of an XObject's bottom (for below-figure captions) or whose bottom is within the same threshold of an XObject's top (for above-figure captions).
**Prefix pattern.** Apply regex to the block's first token:
```
^(Figure|Fig\.|Table|Tbl\.|Scheme|Plate|Exhibit|Supplementary\s+Figure)\s+\d+
```
A prefix match raises caption confidence to ≥ 0.85 independent of position.
**Short block heuristic.** Captions are rarely longer than three lines. If a block adjacent to an image XObject contains more than three lines, treat only the first three as caption and reclassify the remainder as body.
---
## 8. Sidebar and Pull Quote Detection
**Narrow column detection.** A sidebar occupies a column significantly narrower than the main body column. If body column width is `W`, a block whose bounding box width is `< W * 0.45` and whose x-extent overlaps the body column by at least 10% is a sidebar candidate.
**Font differentiation.** Pull quotes are typically set in a larger or italic typeface to distinguish them visually from body text. A block that is bold or italic, centred or right-aligned, and horizontally overlaps the main column is a pull quote candidate. Assign label `sidebar` for narrow-column placement, or remain `body` with reduced confidence if the signal is ambiguous.
**Bounding box overlap logic.** Compute the intersection-over-union (IoU) of the candidate block and the main body column rectangle. IoU above 0.3 but below 0.9 indicates a partial overlap consistent with sidebar placement.
---
## 9. Confidence and Fallback
Each block receives a `zone_confidence: f32` in [0.0, 1.0] computed from a weighted sum of signals:
| Signal | Weight |
|---|---|
| Margin band (geometric) | 0.30 |
| Font size deviation from body | 0.25 |
| Cross-page recurrence | 0.25 |
| Regex / prefix pattern match | 0.15 |
| Positional heuristic (indent, centre) | 0.05 |
Weights are normalized per label. When no label achieves confidence ≥ 0.50, default to `body`. This is the safe fallback: false negatives (unlabeled headers/footers passed through as body) are preferable to false positives (body text discarded as a header).
Expose the confidence in output so callers can tune their own threshold. A caller building a full-text search index may accept all blocks regardless of zone. A caller building a clean prose renderer may filter to `zone == body && zone_confidence >= 0.70`.
---
## 10. Output Representation
Every block in the JSON output carries:
```json
{
"text": "...",
"zone": "body",
"zone_confidence": 0.82,
"bbox": { "x0": 72.0, "y0": 144.0, "x1": 540.0, "y1": 160.5 },
"page": 3
}
```
Valid `zone` values:
| Value | Description |
|---|---|
| `body` | Main prose content |
| `heading` | Section or chapter heading |
| `header` | Running page header |
| `footer` | Running page footer |
| `footnote` | Footnote body text |
| `caption` | Figure, table, or scheme caption |
| `sidebar` | Sidebar or pull quote |
| `marginalia` | Margin annotation or note |
| `page_number` | Standalone page number |
The `zone` field is always present. `zone_confidence` is always a finite `f32`. Callers that want unfiltered text iterate all blocks; callers that want clean prose filter to `zone == "body"` or `zone in ["body", "heading"]`. Zone information is never used to modify `text` content — it is metadata only.

View file

@ -0,0 +1,214 @@
# Language Detection and Script Handling in pdftract
## Overview
Multilingual PDF documents expose three distinct problems for a text extraction library: identifying which Unicode script a sequence of codepoints belongs to, reconstructing logical order from glyphs that may have been stored in visual order, and normalizing script-specific presentation variants to canonical Unicode forms. This document covers each problem, the relevant standards, and the implementation strategy for `pdftract`.
---
## 1. Script Detection from Glyph Data
### Unicode Script Property (UAX #24)
Every Unicode codepoint carries a `Script` property defined in UAX #24. The Unicode Character Database (UCD) ships `Scripts.txt` and the companion `ScriptExtensions.txt`. Script extensions matter because some codepoints — most common-use punctuation, digits U+0030U+0039, and combining marks — are legitimately shared across scripts and carry the `Common` or `Inherited` value rather than a specific script name.
A `pdftract` span classifier should resolve script assignments in this priority order:
1. **Specific script** — codepoints with a single non-`Common`, non-`Inherited` script assignment are classified directly.
2. **Script extensions** — codepoints with multiple entries in `ScriptExtensions.txt` (e.g., U+0300 COMBINING GRAVE ACCENT extends into `Latin`, `Greek`, `Cyrillic`) inherit the script of the surrounding run.
3. **Common/Inherited** — treated as transparent; they attach to the script of the nearest resolved codepoint within the same bidi run.
### Mixed-Script Spans
A single PDF text object can contain codepoints from multiple scripts (e.g., a Japanese sentence with embedded Latin product names). The standard approach is **script-run segmentation**: scan the codepoint sequence left to right, maintaining a current script state, and emit a new span boundary whenever the resolved script changes from one specific value to another. `Common` and `Inherited` codepoints do not trigger boundaries.
The Unicode `ScriptExtensions` data can be used to suppress spurious splits: if a `Common` punctuation character appears between two Latin spans with no intervening RTL text, it should remain in the Latin span rather than producing a one-character `Common` fragment.
### CJK Script Identification
CJK requires distinguishing four overlapping script blocks:
| Script | Key Ranges |
|--------|-----------|
| Han | U+4E00U+9FFF (BMP), U+3400U+4DBF (Extension A), U+20000U+2A6DF (Extension B) |
| Hiragana | U+3041U+3096 |
| Katakana | U+30A1U+30FA, U+31F0U+31FF |
| Hangul | U+AC00U+D7A3 (syllables), U+1100U+11FF (jamo) |
Han is shared across Chinese, Japanese, and Korean. Language detection (Section 7) must disambiguate Han-dominant runs; script detection alone cannot.
### PDF `/Lang` Attribute
Tagged PDFs may carry a `/Lang` entry (BCP 47 language tag) on the document catalog, individual structure elements, or marked-content sequences. When present, `/Lang` is a strong prior:
- `ja` → expect Han + Hiragana + Katakana, writing mode potentially vertical.
- `ar` or `he` → expect RTL bidi direction, visual-order glyph storage likely.
- `zh-TW` vs. `zh-CN` → disambiguates Traditional vs. Simplified Han.
When `/Lang` is absent or when extracted text falls outside the declared language's expected scripts, fall back to character-level detection. Never suppress the fallback entirely: many PDFs carry a top-level `/Lang` that does not apply uniformly to all content (e.g., an English document with a Hebrew quotation).
---
## 2. Unicode Bidirectional Algorithm (UBA, UAX #9)
### Algorithm Structure
UAX #9 defines a multi-pass algorithm over a paragraph of codepoints. Each codepoint has a **bidi character type** (Strong: L/R/AL; Weak: EN/ES/ET/AN/CS/NSM/BN; Neutral: B/S/WS/ON; Explicit: LRE/RLE/LRO/RLO/PDF/LRI/RLI/FSI/PDI).
Key steps:
1. **Paragraph embedding level**: if the first strong character is R or AL, the paragraph is RTL (embedding level 1); otherwise LTR (level 0).
2. **Explicit level runs**: `LRE`/`RLE` push a new embedding level; `PDF` pops. The isolate controls (`LRI`/`RLI`/`FSI`/`PDI`, introduced in Unicode 6.3) create isolated bidi contexts that do not affect the surrounding paragraph's level stack.
3. **Weak type resolution**: sequences of weak types are resolved based on surrounding strong types per a finite-state table.
4. **Neutral resolution**: neutral characters between two same-direction strong runs take that direction; between opposing runs they take the paragraph direction.
5. **Reorder**: within each level run, apply the level-based reordering algorithm to produce visual order.
### Why PDF Breaks Bidi
PDF authoring tools generally emit glyphs in **visual order** for RTL text rather than in logical (Unicode) order. The content stream positions each glyph individually on the page via the text matrix; there is no implicit cursor advance that encodes reading direction. An Arabic sentence rendered right-to-left appears in the content stream starting from the rightmost glyph.
Consequences for extraction:
- Naively reading content-stream character codes left-to-right from a page produces reversed Arabic/Hebrew words.
- Mixed LTR/RTL content is interleaved in spatial order: the leftmost object on the page comes first in the stream, regardless of its logical position in the paragraph.
### Detecting and Reversing Visual-Order RTL
Detection heuristic: after Unicode recovery, if a run of characters with strong R or AL bidi type appears in left-to-right spatial order (i.e., X coordinates increase as the content-stream position increases), the run is stored in visual order and must be reversed. The threshold for "increasing X" should tolerate per-glyph kerning noise (±2 units in text space).
Reversal procedure:
1. Identify the visual-order run boundaries (the span between two LTR-direction glyphs or page-object boundaries).
2. Reverse the codepoint sequence within each RTL word (space-delimited or width-gap-delimited).
3. Apply UBA to the reassembled logical string to verify paragraph direction.
Note: some PDF producers (notably newer versions of Adobe Acrobat) do store RTL text in logical order with correct ToUnicode. The detection heuristic must be conditional, not unconditional.
---
## 3. Arabic and Hebrew Specifics
### Arabic Shaping and Presentation Forms
Arabic uses a joining model: each base letter has up to four contextual glyph forms — **isolated**, **initial**, **medial**, and **final** — determined by whether the character joins to the preceding and/or following letter. Critically, all four forms map to the same base Unicode codepoint. A PDF font may embed glyphs named `uniFE8D` (isolated alef) or `uniFE8E` (final alef), which are Arabic Presentation Forms from the block U+FB50U+FDFF (Presentation Forms-A) and U+FE70U+FEFF (Presentation Forms-B).
Normalization: apply Unicode compatibility decomposition (NFKD or NFKC) to map presentation forms to their base codepoints. For the ligature block (U+FB50U+FDFF), some entries (e.g., U+FB8A ARABIC LETTER TCHEH WITH THREE DOTS ABOVE) lack a NFKC decomposition and should be preserved as-is. After normalization, the shaping context is lost, but the logical character identity is recovered — which is what text extraction requires.
Mandatory ligatures such as **lam-alef** (U+0644 + U+0627 and variants) have precomposed forms in the presentation block. These should be expanded back to their two-codepoint sequences during normalization.
### Hebrew Vowel Points and Cantillation
Hebrew base letters (U+05D0U+05EA) may be followed by **nikud** (vowel points, U+05B0U+05C7) and **cantillation marks** (U+0591U+05AF). These are combining characters with `Inherited` bidi type, which means they correctly attach to the preceding base letter in logical order. For plain-text extraction, nikud and cantillation can be optionally stripped or preserved depending on the output mode; `pdftract` should expose a normalization flag `strip_combining_marks: bool` per script.
### RTL Word Boundaries Without Spaces
Some Arabic PDFs omit inter-word spaces in the content stream (words are positioned by glyph advances rather than space characters). Word boundary detection falls back to **X-gap analysis**: a gap between adjacent glyphs significantly larger than the average intra-word advance (heuristic: > 0.25 × em) is treated as a word boundary.
---
## 4. CJK Handling
### Horizontal vs. Vertical Writing Modes
PDF CMaps carry a `/WMode` entry: `0` = horizontal, `1` = vertical. A font may embed two CMaps — a horizontal CMap (name ending in `-H`) and a vertical CMap (name ending in `-V`). The content stream selects between them via the font resource's `/Encoding` or via direct CIDFont reference.
CJK punctuation normalization: fullwidth forms (U+FF01U+FF60) are compatibility equivalents of their ASCII counterparts. For prose extraction, map fullwidth to halfwidth via NFKC unless the output is destined for layout-sensitive consumers. The `pdftract` normalization pipeline should apply NFKC only to `Common`-script fullwidth/halfwidth punctuation, not to Han or Kana characters (NFKC decomposes some compatibility Kana which should be preserved).
### CJK Line-Break Rules (UAX #14)
The Unicode Line Breaking Algorithm (UAX #14) defines **non-starter** characters (closing brackets, closing quotation marks, Japanese small kana: ぁぃぅぇぉっゃゅょ) that cannot begin a line, and **non-ender** characters (opening brackets) that cannot end a line. When `pdftract` reassembles lines from individual glyphs, these rules inform the merge heuristic: a glyph with a non-starter break class that appears at the apparent start of a new line in the spatial layout should be joined to the preceding line.
---
## 5. Vertical Text
### PDF Encoding of Vertical CJK
In vertical writing mode, the text matrix in the content stream applies a 90-degree rotation: the current transformation matrix (CTM) component produces a glyph that advances downward rather than rightward. The glyph's width in the font metrics becomes its vertical advance, and the horizontal dimension becomes the em-square height.
Detection: examine the `Tm` (text matrix) operator. A matrix of the form `[0 -1 1 0 tx ty]` or `[0 1 -1 0 tx ty]` indicates vertical text. Combined with `/WMode 1` in the CMap, this is a reliable signal.
Reconstruction: to recover horizontal reading order from a vertical column:
1. Sort glyphs by decreasing Y within a column (top-to-bottom).
2. Sort columns by increasing X (left-to-right for vertical text flowing left-to-right between columns, which is the default for Japanese).
3. Assign direction `ttb` to the span.
### Tate-Chu-Yoko
Tate-chu-yoko (縦中横) is a typographic convention where a short horizontal sequence (typically 24 Latin characters or digits) is set horizontally within a vertical line. In PDF, these glyphs appear without the 90-degree rotation applied to surrounding CJK glyphs. Detection: within a vertical column, glyphs with a non-rotated text matrix and Latin/digit script classification form a tate-chu-yoko inline sequence. They should be extracted as a single horizontal sub-span with direction `ltr`, embedded within the enclosing `ttb` span.
---
## 6. Ligatures and Script-Specific Normalization
### Unicode Normalization Forms
| Form | Definition | Use in pdftract |
|------|-----------|----------------|
| NFC | Canonical decomposition then canonical composition | Default for Latin, Greek, Cyrillic output |
| NFD | Canonical decomposition only | Internal processing of combining marks |
| NFKC | Compatibility decomposition then canonical composition | Arabic presentation forms, fullwidth CJK punctuation |
| NFKD | Compatibility decomposition only | Intermediate step for specific scripts |
Apply NFKC selectively: Arabic (to collapse presentation forms), fullwidth punctuation (U+FF01U+FF60), and Latin ligatures from the Alphabetic Presentation Forms block (U+FB00U+FB06: ff, fi, fl, ffi, ffl, ſt, st).
### Latin Ligatures
The glyphs `fi`, `fl`, `ff`, `ffi`, `ffl` have explicit Unicode codepoints (U+FB01, U+FB02, U+FB00, U+FB03, U+FB04). PDF fonts commonly use these as single glyphs mapped via ToUnicode to either the precomposed ligature or the two-character sequence. For text search and NLP compatibility, always expand to the constituent characters: `fi` → U+0066 U+0069. Preserve the original ligature codepoint in a `raw_codepoints` field if the consumer needs to reconstruct original layout.
### Devanagari Conjunct Consonants
Devanagari conjunct consonants (Sanskrit: saṃyuktākṣara) are encoded in Unicode as a base consonant + virama (U+094D) + following consonant. PDF fonts may embed precomposed conjunct glyphs that have no standard Unicode representation. Recovery requires mapping via the font's glyph name (e.g., `kka` → U+0915 U+094D U+0915) using a glyph-name-to-sequence table. NFD decomposition of Devanagari preserves the logical structure and should be preferred over NFC for output.
---
## 7. Language Detection
### Statistical and Dictionary Approaches
For runs of 50+ characters with a known script, statistical **n-gram language identification** is reliable. The `whatlang` crate (Rust) uses trigram frequency profiles for 69 languages; the `lingua` crate supports 75 languages with a higher-accuracy bigram + unigram model at the cost of a larger compiled profile set. Both crates accept `&str` and return a language tag with confidence score.
For shorter spans (1050 characters), dictionary-based detection — checking whether the top-N most frequent words from a candidate language appear in the span — outperforms n-gram models. Maintain per-script stop-word lists (the 200 most frequent words per language) compiled into the binary.
### Using `/Lang` as a Prior
When the PDF supplies `/Lang`, use it to bias detection: if the extracted text scores above 0.4 confidence for the declared language, accept the declaration. If the text scores below 0.4 for the declared language but above 0.7 for another, emit a `lang_conflict` warning and use the detected language. If detection confidence is below 0.4 for all candidates, emit `und` (undetermined).
Confidence threshold summary:
| Condition | Output |
|-----------|--------|
| `/Lang` present, detection ≥ 0.4 for declared | Use `/Lang` tag |
| `/Lang` present, conflict detected (other ≥ 0.7) | Use detected tag, warn |
| `/Lang` absent, detection ≥ 0.6 | Use detected tag |
| Any path, confidence < 0.4 | `und` |
---
## 8. Output Metadata on Spans and Blocks
Each extracted `Span` and `Block` in the `pdftract` JSON output carries the following language and script metadata:
```json
{
"text": "مرحباً بالعالم",
"lang": "ar",
"script": "Arab",
"direction": "rtl",
"normalization": ["nfkc", "visual_order_reversed"],
"lang_confidence": 0.92,
"writing_mode": "horizontal"
}
```
Field definitions:
- **`lang`** — BCP 47 language tag (e.g., `ar`, `he`, `ja`, `zh-TW`, `und`). Sourced from `/Lang` or detection.
- **`script`** — ISO 15924 four-letter script code (e.g., `Arab`, `Hebr`, `Hani`, `Hira`, `Hang`, `Deva`, `Thai`, `Latn`). Derived from UAX #24 per-codepoint classification, taking the dominant script of the span.
- **`direction`** — One of `ltr`, `rtl`, or `ttb`. Derived from UBA paragraph direction for horizontal text; `ttb` set when vertical writing mode is detected via CTM analysis and `/WMode 1`.
- **`normalization`** — Array of normalization operations applied, in application order. Valid values: `nfc`, `nfkc`, `nfd`, `nfkd`, `visual_order_reversed`, `ligature_expanded`, `presentation_forms_collapsed`, `combining_marks_stripped`.
- **`lang_confidence`** — Float in [0.0, 1.0] from the language detector. Omitted when `lang` is sourced from `/Lang` and no conflict was detected. Set to `null` when `lang` is `und`.
- **`writing_mode`** — `horizontal` or `vertical`. `vertical` implies `direction` is `ttb`; tate-chu-yoko sub-spans within a vertical block carry `direction: ltr` and `writing_mode: horizontal`.
Blocks aggregate span metadata: the `script` and `lang` of a block are the modal values across its constituent spans. Blocks containing spans from more than one script carry a `mixed_script: true` flag and list all scripts in a `scripts` array alongside the dominant `script` field.

View file

@ -0,0 +1,142 @@
# Mathematical Expression Handling in pdftract
## Overview
Mathematical notation in PDFs does not follow a single encoding scheme. Depending on the authoring tool and font stack, the same rendered equation may be stored as structured XML, as a sequence of Unicode code points from specialized fonts, as legacy symbol-mapped glyphs, as a raster image, or as procedural vector drawing instructions. A robust extraction library must detect which encoding is present, apply the appropriate recovery path, and produce a normalized structured output. This document specifies each encoding case, the algorithms for handling it, and the output representation.
---
## 1. How Math Is Encoded in PDF
### (a) MathML in Tagged PDF StructTree
PDF/UA-compliant documents and Word-exported PDFs with the "Save as PDF" accessibility option may embed MathML directly in the logical structure tree. The `StructTree` dictionary contains `Formula` structure elements whose `ActualText` or associated file attachment holds a well-formed MathML fragment. The extraction path is unambiguous: walk the `StructTree`, locate `Formula` nodes, extract the `ActualText` string or the associated `AF` file stream, and validate the XML. No font decoding is needed.
### (b) OpenType Math Fonts with Correct ToUnicode
Authoring tools that target Unicode-native math (MathType, LibreOffice, newer LaTeX engines with `lualatex`/`xelatex`) embed OpenType fonts such as STIX Two Math, Latin Modern Math, or Cambria Math and include correct `ToUnicode` CMap entries. Glyphs map directly to Unicode Mathematical Alphanumeric Symbols (U+1D400U+1D7FF) and operator blocks. The PDF content stream is legible at the character level; the challenge is spatial reconstruction — determining which glyphs form a numerator, denominator, superscript, or radical argument.
### (c) Legacy TeX/LaTeX Output with Computer Modern or AMS Fonts
`pdflatex` and `dvips`-produced PDFs embed Type1 or Type2 fonts in legacy TeX encodings. These fonts carry no `ToUnicode` entries, or carry entries that map to PUA code points. The CM family uses OT1 encoding for text, OML for math italic, OMS for math symbols, and OMX for large operators and delimiters. Recovery requires consulting the encoding vector, which maps slot numbers to glyph names, then resolving those glyph names to Unicode via the Adobe Glyph List extended with TeX-specific names.
### (d) Math as Embedded Raster Images
Word processors and equation editors sometimes render complex expressions to a bitmap and embed it as an image XObject (`/Subtype /Image`) in the content stream. EPS figures included via `\includegraphics` appear as Form XObjects. In these cases no character data is recoverable from the content stream. Detection relies on aspect ratio, position within a text block, and the absence of any text operators in the surrounding XObject. The extraction fallback is to crop the rendered page image at the object's bounding box and encode it as base64.
### (e) Math as Type 3 Fonts with Arbitrary Glyph Procedures
Type 3 fonts define each glyph as a PDF content stream of drawing commands. Some equation editors and older TeX backends embed math characters this way. The glyph streams contain no semantic information — only `m`, `l`, `c`, and fill operators. Recovery is strictly visual: render each glyph to a small bitmap and run it through a shape classifier. Given the cost, Type 3 math is best treated as a raster fallback after an attempt to match glyph bitmaps against a reference atlas of common math symbols.
---
## 2. The OpenType MATH Table
The `MATH` table (introduced in OpenType 1.8) is the canonical source of math layout metadata for fonts like Cambria Math and STIX Two Math. It contains three subtables.
**MathConstants** holds 51 scalar values (in font design units) that govern layout: `ScriptPercentScaleDown` and `ScriptScriptPercentScaleDown` give the em-size ratios for script and script-script levels; `FractionNumeratorDisplayStyleShiftUp` and `FractionDenominatorGapMin` control fraction layout; `RadicalVerticalGap` and `RadicalRuleThickness` describe radical construction; `UpperLimitGapMin` and `LowerLimitGapMin` cover large operator limits.
**MathGlyphInfo** associates per-glyph metadata with specific glyph IDs: italic correction (the horizontal overhang of an italic glyph, used for correct accent placement), top accent attachment points (the x-coordinate at which a combining accent centers itself over the base glyph), and extended shape flags (marking glyphs that require special italic correction behavior).
**MathVariants** maps base glyph IDs to size variants and glyph construction recipes. Each extensible glyph — a bracket, brace, radical sign, integral, or arrow — has a list of prebuilt size variants followed by a `GlyphAssembly` that describes how to assemble an arbitrary-height or arbitrary-width version from parts (a start piece, one or more extender pieces that repeat, and an end piece). Parsing `MathVariants` allows pdftract to recognize that a sequence of component glyphs in the content stream constitutes a single large delimiter or radical, rather than treating each piece as an independent character.
Inference from the MATH table: if a glyph's bounding box places it above the current baseline by more than `SuperscriptShiftUp` and its font size is within `ScriptPercentScaleDown` of the enclosing font size, classify it as a superscript argument. Similar logic applies to subscripts, fractions, and radicals.
---
## 3. Symbol Font Encoding Recovery
Legacy TeX fonts use four encoding vectors relevant to math:
- **OT1** — 128 slots, mostly Latin; slot 0x0B is `\beta` in math mode due to glyph sharing.
- **OML** — 128 slots of math italic: lowercase and uppercase Latin italic, Greek lowercase, and special math glyphs.
- **OMS** — 128 slots of math symbols: operators, relations, arrows.
- **OMX** — 128 slots of large operators and extensible delimiters; many glyphs are halves or extenders.
Recovery procedure: (1) extract the font's encoding array from the Type1 `Encoding` dictionary or the `cmap` subtable; (2) map slot numbers to glyph names using the encoding vector; (3) look up glyph names in an augmented glyph-name-to-Unicode table that covers TeX-specific names (`arrowlefttophalf`, `bracketleftbt`, etc.) and the AGLFN; (4) for slots that resolve to PUA or remain unmapped, cross-reference the font's `CharStrings` dictionary name against a compiled symbol atlas.
Unicode Mathematical Alphanumeric Symbols (U+1D400U+1D7FF) provide distinct code points for mathematical italic, bold, script, fraktur, double-struck, monospace, and sans-serif variants of Latin and Greek letters. When a glyph name resolves to a plain Latin letter but the enclosing font is identified as a math italic font (via the font name containing `MathItalic`, `CMMI`, or `OML`), remap to the corresponding U+1D4xx italic code point.
Font identification heuristics: a font is a math font if its name matches known math font families (`cmsy`, `cmex`, `cmmi`, `msam`, `msbm`, `esint`, `stmary`, `txsy`, `pxsy`), or if its `FontDescriptor` `Flags` field has bit 6 (Symbolic) set alongside an encoding with more than 20% glyph-name matches against the math symbol glyph atlas.
---
## 4. Spatial Heuristics for Expression Detection
**Inline vs. display math.** Display math is centered on the page (horizontal center within 5% of page width) and surrounded by inter-paragraph vertical gaps larger than the prevailing line spacing. Inline math shares the baseline grid of surrounding text runs and has no exceptional vertical gap.
**Superscript and subscript detection.** A glyph run is a superscript if its baseline offset from the enclosing line's baseline is positive and falls within the range `[SuperscriptShiftUp * 0.5, SuperscriptShiftUp * 1.5]` (in scaled font units). Subscripts shift negative by an analogous range. A secondary check compares the font size: script-level glyphs are typically 6071% of the base size.
**Grouping into expression trees.** After baseline classification, group glyphs using a modified connected-components pass: two glyphs belong to the same expression if their bounding boxes overlap on the horizontal axis or are separated by less than one em-width, and they share a common ancestor in the script-level hierarchy. Radical constructs are detected by locating an OMX radical glyph followed by a horizontal bar (`radicalex`) and grouping all glyphs under the bar into the radicand argument. Fraction structures are detected by a horizontal rule glyph with two vertically separated glyph groups straddling it.
**Bracket matching.** Opening delimiter glyphs from OMX or MathVariants are matched to their closing counterparts by tracking a depth counter. Assembled delimiters (multi-part from GlyphAssembly) are collapsed to a single logical delimiter before matching.
---
## 5. MathML in Tagged PDFs
The extraction path for tagged PDFs proceeds as follows. Parse the `StructTreeRoot` from the document catalog. Traverse the structure tree depth-first, collecting nodes with `/S /Formula`. For each `Formula` node, inspect the `A` (attribute) dictionary and the `AF` (associated files) array. MathML may appear as:
- A UTF-16BE string in `ActualText` — decode to UTF-8 and parse as XML.
- A file specification in `AF` with `AFRelationship /Supplement` and a MIME type of `application/mathml+xml` — decompress the embedded stream and parse.
Validate the extracted XML against the MathML 3 schema subset. Common defects in Word-exported MathML include missing namespace declarations, `mfenced` elements with non-standard separators, and empty `mrow` wrappers. Apply a normalization pass: add the `xmlns` attribute if absent, replace `mfenced` with explicit `mo` delimiters and `mrow`, and strip empty `mrow` elements.
---
## 6. LaTeX Reconstruction
When the source is glyph sequences (cases b and c) rather than embedded MathML, LaTeX reconstruction proceeds in two phases.
**Phase 1 — symbol mapping.** Map each Unicode math code point (after encoding recovery) to a LaTeX command string using a compiled lookup table covering the full Unicode math range: U+2200U+22FF (mathematical operators), U+27C0U+27EF (supplemental arrows), U+1D400U+1D7FF (alphanumerics), and AMS extension blocks. Characters with multiple LaTeX representations (e.g., U+2212 `` mapping to both `\minus` and `-`) prefer the representation appropriate to context (operator position).
**Phase 2 — structure reconstruction.** Apply the expression tree from the spatial heuristics pass: superscript groups become `^{...}`, subscripts become `_{...}`, fraction numerators and denominators become `\frac{num}{denom}`, radicands become `\sqrt{...}` (or `\sqrt[n]{...}` if an index argument is detected above the radical glyph), and integral glyphs with limit arguments become `\int_{...}^{...}`. Delimiter pairs from bracket matching become `\left( ... \right)` using the appropriate delimiter command.
Limitations: reconstruction is heuristic and degrades for deeply nested structures, for multi-line display environments (`align`, `cases`), and for any glyph that has no Unicode mapping. Confidence decreases with nesting depth and increases with the proportion of glyphs that resolve cleanly to Unicode.
---
## 7. Fallback Strategies
Fallbacks are selected based on a per-expression confidence score (0.01.0) computed from: fraction of glyphs with clean Unicode mappings, availability of MATH table data, presence of MathML in StructTree, and structural ambiguity (unmatched delimiters, zero-width gaps suggesting missing glyphs).
| Confidence | Output Strategy |
|---|---|
| ≥ 0.85 | Full LaTeX and/or MathML reconstruction |
| 0.600.84 | Unicode math string only (`unicode` field populated, `latex` omitted) |
| 0.300.59 | Placeholder `[MATH]` with bounding box; Unicode field if partially recoverable |
| < 0.30 | Base64 image crop of the rendered expression region; all text fields omitted |
Image crops are produced by rendering the page to a raster at 150 DPI (sufficient for readability) and cropping to the expression bounding box with 4-point padding on each side.
---
## 8. Output Representation
Math blocks appear as JSON objects in the extraction output with the following schema:
```json
{
"kind": "math",
"subtype": "inline",
"latex": "\\frac{d}{dx}\\left(x^2\\right) = 2x",
"mathml": "<math xmlns=\"http://www.w3.org/1998/Math/MathML\">...</math>",
"unicode": "d/dx(x²) = 2x",
"confidence": 0.91,
"bbox": { "page": 3, "x0": 144.0, "y0": 612.5, "x1": 310.2, "y1": 628.0 },
"image_b64": null
}
```
Field semantics:
- `kind`: always `"math"` for math blocks.
- `subtype`: `"inline"` for expressions within a text run; `"display"` for centered block equations.
- `latex`: LaTeX source string if confidence ≥ 0.85 and reconstruction succeeded; `null` otherwise.
- `mathml`: MathML 3 XML string if extracted from StructTree or reconstructed with high confidence; `null` otherwise.
- `unicode`: Best-effort Unicode rendering of the expression; populated when confidence ≥ 0.30.
- `confidence`: Float in [0.0, 1.0] reflecting the extraction reliability estimate.
- `bbox`: Page number (1-indexed) and coordinates in PDF user-space units (origin at bottom-left).
- `image_b64`: Base64-encoded PNG crop of the rendered expression; populated only when confidence < 0.30 and a raster render is available; `null` otherwise.
When both `latex` and `mathml` are present, they are independently derived (one from StructTree, one from reconstruction) and may differ in normalization. Consumers should prefer `mathml` when present, as it is either source-authoritative or structurally more complete than the heuristic LaTeX.

View file

@ -0,0 +1,183 @@
# Post-Extraction Normalization Pipeline
Raw text extracted from a PDF is not presentation-ready text. Glyphs are decoded individually, positioned by absolute coordinates, and carry no semantic information about word boundaries, paragraph structure, or typographic intent. This document describes the normalization pipeline that transforms raw extracted text into clean, semantically coherent output.
---
## 1. Hyphenation Handling
PDF typesetters insert hyphens at line boundaries for optical justification. Three distinct codepoints appear in practice:
- **U+002D HYPHEN-MINUS** — the workhorse character. Used both as a hard (intentional) hyphen and as an end-of-line break marker inserted by the typesetter.
- **U+00AD SOFT HYPHEN** — a Unicode line-break hint. When present mid-word, it signals that the word *may* be broken here, but is not itself visible. Remove it unconditionally during normalization.
- **U+2010 HYPHEN** — unambiguous hard hyphen, always intentional.
The difficult case is U+002D at end-of-line. Detecting it requires combining positional evidence with lexical evidence:
1. **Positional test**: the glyph is the last character on the line, and its right edge is within a configurable threshold (typically 510% of column width) of the right text margin for that column.
2. **Lexical test**: concatenate the word prefix (characters before the hyphen on the current line) with the word suffix (first token on the next line). Query a language-appropriate dictionary or spell-checker. If the concatenated form is a known word and the hyphenated form is not, the hyphen is a break artifact and should be removed when joining lines.
3. **Compound-word fallback**: if neither form resolves cleanly, preserve the hyphen. Compound words in German, Dutch, and Norwegian are frequently hyphenated intentionally even mid-line.
Language-specific rules add complexity. German has mandatory spelling hyphens (e.g., *Dampf-schiff* as a stylistic compound variant) that must not be joined. Arabic and Hebrew text flow right-to-left; end-of-line positions are mirrored. Thai and CJK scripts do not use hyphens at all.
The implementation strategy: build a `HyphenResolver` trait with a method `fn should_join(prefix: &str, suffix: &str, lang: Language) -> bool`, backed by a word-frequency dictionary for the target language. For `lang = Unknown`, default to preserving the hyphen.
---
## 2. Ligature Expansion
OpenType fonts frequently encode multi-character sequences as single glyphs in the Private Use Area or as Unicode Compatibility Area codepoints. The standard Latin ligatures with assigned Unicode codepoints are:
| Codepoint | Form | Expansion |
|-----------|------|-----------|
| U+FB00 | ff | f + f |
| U+FB01 | fi | f + i |
| U+FB02 | fl | f + l |
| U+FB03 | ffi | f + f + i |
| U+FB04 | ffl | f + f + l |
| U+FB05 | ſt | ſ + t |
| U+FB06 | st | s + t |
Expand all of these unconditionally for search-oriented output. A full-text search index that receives `U+FB01` will not match the query `fi`; expansion to component letters is required.
For display-fidelity output where the caller wants to preserve typographic forms, expansion should be optional. Expose a `LigatureMode` enum: `Expand` (default for search), `Preserve` (for display). Note that NFKC normalization (see §5) collapses these ligatures automatically, so if NFKC is applied, ligature mode has no additional effect.
**Arabic ligatures** are more complex. The mandatory ligature *lam-alef* (U+FEFB/U+FEFC) must be expanded to lam (U+0644) + alef (U+0627) for correct text processing. Other Arabic presentation forms in the FBxxFExx range should similarly be decomposed. Arabic shaping is the font renderer's responsibility, not the extraction layer's; after expansion, a bidirectional algorithm (Unicode Bidirectional Algorithm, UBA) determines display order.
**Devanagari** uses conjunct consonants that are orthographically distinct from their component sequences. These are *not* ligatures in the presentation sense; they represent distinct orthographic units. Do not attempt to decompose them.
---
## 3. Line and Paragraph Break Reconstruction
PDF text streams contain positioned runs, not lines. Reconstruction requires:
**Soft wrap detection (same paragraph)**: a line break is a soft wrap when:
- The vertical gap between the bottom of line *n* and the top of line *n+1* is within 1.2× the line height (the typeset leading).
- The last character of line *n* is not sentence-ending punctuation (`.`, `?`, `!`, `:`) or when it is but the next line begins with a lowercase letter (indicating mid-sentence break).
- The right edge of the last glyph on line *n* is within the right-margin proximity threshold (the line was wrapped, not short).
When all conditions hold, join with a single U+0020 SPACE.
**Hard paragraph break detection**: a vertical gap exceeding 1.5× the line height, or a first-line indent on the following line (detected as a left-edge offset exceeding a threshold), signals a paragraph boundary. Emit a double newline or a paragraph separator (U+2029 PARAGRAPH SEPARATOR) depending on output format.
**Short lines**: a line whose right edge falls well inside the right margin that is followed by a line with a reset left margin signals a paragraph break even without a large vertical gap (common in ragged-right body text and poetry).
Store each text segment with its bounding box `(x0, y0, x1, y1)` in page coordinates. Sort by `(y0, x0)` for left-to-right scripts; use the dominant reading direction for bidi content.
---
## 4. Whitespace Normalization
PDF character positioning uses absolute coordinates. Adjacent glyphs separated by a small positive advance (less than one-third of the space glyph width for the current font) are concatenated without a space. Larger gaps produce either an explicit space glyph or an implicit word boundary.
After joining glyph runs:
- **Collapse runs of U+0020**: replace any sequence of two or more SPACE characters with a single SPACE.
- **Remove invisible Unicode spaces**: strip U+200B ZERO WIDTH SPACE, U+200C ZERO WIDTH NON-JOINER, U+200D ZERO WIDTH JOINER, and U+FEFF BOM/ZWNBSP where they appear mid-text.
- **NO-BREAK SPACE (U+00A0)**: normalize to U+0020 in body text. Preserve in contexts where breaking is semantically wrong (between a number and its unit, e.g., *42 kg*) if the caller opts in.
- **Trim per block**: strip leading and trailing whitespace from each reconstructed paragraph block before emitting.
---
## 5. Unicode Normalization
Unicode defines four normalization forms:
- **NFD**: Canonical Decomposition. Precomposed characters are decomposed into base + combining sequences. Useful for accent stripping downstream.
- **NFC**: Canonical Decomposition followed by Canonical Composition. The standard interchange form; round-trips with NFD.
- **NFKD**: Compatibility Decomposition. Collapses compatibility variants: fullwidth ASCII, circled letters, fraction characters, ligatures, superscripts.
- **NFKC**: Compatibility Decomposition followed by Canonical Composition. The most aggressive normalization.
For PDF extraction:
- **Apply NFC** to the output by default. It normalizes precomposed characters extracted via different code paths into a consistent form without destroying content.
- **Do not apply NFKC by default.** NFKC collapses `fi` (U+FB01) to `fi` (collapsing the ligature, which is usually correct), but also collapses `①` to `1`, `½` to `12`, and fullwidth `` to `A`. This alters content that may be semantically significant (fractions in mathematical texts, circled numbers in diagrams). Expose NFKC as a caller-controlled option.
- **Surrogates and noncharacters**: codepoints U+D800U+DFFF (lone surrogates) and U+FDD0U+FDEF plus U+FFFE/U+FFFF (noncharacters) must be removed. They appear when a font's CMap maps a glyph to a malformed Unicode value. Replace with U+FFFD or drop, depending on caller preference.
- **Private Use Area codepoints**: U+E000U+F8FF are frequently used as glyph placeholders in symbolic fonts. Strip them unless the caller's glyph recovery layer has already mapped them to real codepoints (see the glyph recognition research document).
The `unicode-normalization` crate provides `nfc()`, `nfd()`, `nfkc()`, `nfkd()` iterators over `char` streams and is the canonical implementation for Rust.
---
## 6. Quote and Dash Normalization
PDFs from professional typesetters use typographic punctuation. Two normalization strategies are useful:
**Preserve typographic forms** (default, for display fidelity):
- Left single quotation mark: U+2018 `'`
- Right single quotation mark / apostrophe: U+2019 `'`
- Left double quotation mark: U+201C `"`
- Right double quotation mark: U+201D `"`
- Em dash: U+2014 `—`
- En dash: U+2013 ``
**Normalize to ASCII equivalents** (for search and downstream NLP):
- U+2018, U+2019 → U+0027 `'`
- U+201C, U+201D → U+0022 `"`
- U+2014, U+2013, U+2012 → U+002D `-` (or preserve dashes as-is; NLP pipelines vary)
The figure dash (U+2012) is rare but appears in some European typesetting. The horizontal bar (U+2015) appears in Greek text for dialogue attribution.
Expose this as a `QuoteMode` enum (`Preserve`, `AsciiEquivalents`) and a `DashMode` enum (`Preserve`, `HyphenMinus`, `Retain`). Neither should default to normalization; lossy transformations require explicit opt-in.
---
## 7. Running Header and Footer Deduplication
After zone classification (header zone, footer zone, body zone), headers and footers must be removed from the primary text stream. The extraction strategy:
1. Classify text blocks by vertical position: blocks in the top 10% or bottom 10% of the page area are candidates.
2. Across a document, compare candidate blocks across pages. A block whose text (ignoring page numbers) appears on ≥ 80% of pages is a running header or footer.
3. Strip these blocks from the text stream. Optionally emit them into a parallel `headers: Vec<String>` / `footers: Vec<String>` field on the page output.
4. Page numbers embedded in headers/footers are identified by the pattern of incrementing integers. Normalize them out when stripping. If a header reads `Chapter 3 — Methodology 42`, the page number `42` varies per page while `Chapter 3 — Methodology` is the repeated fragment.
For deduplication, a normalized comparison (lowercased, whitespace-collapsed, digits wildcarded) across pages is sufficient. Store a fingerprint `(text_without_digits_normalized, frequency_count)` per candidate block.
---
## 8. Control Character and Artifact Removal
Strip the following unconditionally:
- **U+000C FORM FEED** — page separators inserted by some PDF export tools.
- **U+000D CARRIAGE RETURN** not followed by U+000A — normalize CR+LF to LF; standalone CR to LF.
- **U+0000 NULL** — produced by malformed CMap entries.
- **U+FFFD REPLACEMENT CHARACTER** — indicates a failed codepoint decode upstream; remove or log and drop.
- **Private Use Area codepoints** (U+E000U+F8FF, U+F0000U+FFFFF, U+100000U+10FFFF) that were not resolved by the glyph recovery layer.
These characters are artifacts of the extraction process, not content. A downstream consumer encountering a NULL byte or PUA codepoint in extracted text has no correct interpretation for it.
---
## 9. Number and Digit Form Normalization
Unicode encodes digit sequences for multiple scripts:
- **Arabic-Indic**: U+0660U+0669 (`٠١٢٣٤٥٦٧٨٩`)
- **Extended Arabic-Indic**: U+06F0U+06F9
- **Devanagari**: U+0966U+096F (`०१२३४५६७८९`)
- **Thai**: U+0E50U+0E59 (`๐๑๒๓๔๕๖๗๘`)
Normalizing these to ASCII digits (U+0030U+0039) aids downstream numeric parsing but destroys information in multilingual documents. This normalization must be opt-in. The default should preserve the original digit forms.
Date normalization (parsing and re-emitting dates in a canonical format) is out of scope for the extraction layer and belongs in a higher-level application.
---
## 10. Pipeline Ordering
The normalization steps must execute in the following order to avoid interactions:
1. **Ligature expansion** — before Unicode normalization, so that NFKC (if applied) does not need to handle ligatures separately; expansion maps are simpler than NF decompositions.
2. **Unicode normalization (NFC)** — after ligature expansion but before any string comparison operations; ensures that precomposed characters from different code paths produce identical byte sequences.
3. **Control character and artifact removal** — after NFC so that NFC does not accidentally compose an artifact codepoint with a preceding base character.
4. **Whitespace collapse** — after artifact removal, which may produce adjacent spaces when stripped codepoints had surrounding whitespace.
5. **Hyphen joining / line reconstruction** — requires clean whitespace and consistent codepoints to correctly detect end-of-line positions and perform dictionary lookups.
6. **Paragraph reconstruction** — after line joining; requires final line boundaries to be determined.
7. **Header and footer removal** — after paragraph reconstruction, so that block boundaries are stable before cross-page comparison.
8. **Quote and dash normalization (optional)** — last, so it operates on coherent paragraph text rather than on fragments that might contain split quotation contexts.
Order matters concretely: applying whitespace collapse before hyphen joining can destroy the space that should separate words after an erroneous join. Applying Unicode normalization after quote normalization can alter the bytes used for smart quotes if the normalization form affects the Letterlike Symbols block.
The pipeline should be implemented as a sequence of `fn normalize(input: &str, config: &NormalizationConfig) -> String` transforms chained via iterator adapters, with `NormalizationConfig` carrying all opt-in flags (`ligature_mode`, `nfkc`, `quote_mode`, `dash_mode`, `digit_normalization`, `no_break_space_handling`). Each step is independently testable and the chain is short-circuit capable if a step is disabled.

View file

@ -0,0 +1,218 @@
# Table Structure Reconstruction
## The Problem
PDF is a presentation format. Its content streams describe where ink lands on a page — not what that ink means. There is no semantic concept of "table", "row", or "cell" in an untagged PDF. Every glyph and path operator exists only to produce visual output; the burden of interpretation falls entirely on the reader.
This creates several compounding difficulties:
**No semantic markup.** Even what appears to be a neatly formatted table with ruled lines may be represented as a collection of `re` (rectangle) fill operations, scattered glyph positioning commands, and `l`/`S` (line/stroke) operators — all independent of one another in the content stream. The association between a drawn border and the text it encloses is purely geometric, not encoded.
**Borderless tables indistinguishable from columnar prose.** A two-column table with no borders is visually identical to a two-column prose layout. The only distinguishing signals are: whether the number of rows exceeds a threshold, whether horizontal alignment is consistent across all rows, and whether adjacent columns carry semantically distinct data types. None of these signals are definitive on their own.
**Merged cells.** A cell spanning two columns has no path operators uniquely identifying the span. From the drawing perspective, the grid simply has a missing interior line segment. A cell spanning two rows may be identified only by the absence of a horizontal divider and the presence of text centered between two horizontal rules. These absences must be inferred from the reconstructed grid, not read directly.
**Multi-page tables.** A table split across a page break leaves no continuation marker. The bottom of page N and the top of page N+1 must be matched using column count, column width fingerprints, and optionally a repeated header row as a structural anchor.
**Nested tables.** A cell may contain a second table. The inner table's lines intersect with the outer table's coordinate space; naive grid reconstruction will produce spurious cells unless nested table bounding boxes are detected and isolated before the outer grid is finalized.
**Mixed cell types.** A table may contain text cells, numeric cells, and cells whose primary content is a raster image or vector graphic rather than glyphs. The reconstruction algorithm must allocate cell bounding boxes correctly even when a cell contains no glyphs at all.
---
## Line-Based Detection
The most reliable signal for table structure is explicit ruling lines drawn with PDF path operators.
### Identifying Path Operators
In a PDF content stream, lines are drawn with sequences like:
```
x0 y0 m % moveto
x1 y1 l % lineto
S % stroke
```
Rectangles are drawn with `re`:
```
x y w h re S % stroke a rectangle
x y w h re f % fill a rectangle
```
Rectangled drawn with `re` and stroked produce four line segments implicitly. When parsing, expand each `re` into its four constituent segments before analysis.
### Reconstructing the Grid
Once all horizontal and vertical line segments are collected, cluster them by orientation:
- **Horizontal:** segments where |y0 - y1| < epsilon (typically 0.5 pt in PDF space).
- **Vertical:** segments where |x0 - x1| < epsilon.
Merge collinear segments that share the same y (or x) coordinate and whose x-extents overlap or are contiguous within a small gap threshold (e.g., 2 pt). This handles dashed or dotted rules: a dashed line in PDF is typically realized as many short `l`/`S` segments that must be merged back into a logical line.
Hairline rules (line width < 0.5 pt) are visually invisible at normal zoom but still define table structure. Do not filter by line width; instead, track line width as metadata for later rendering decisions.
After merging, find all intersection points between horizontal and vertical segments. These intersections are candidate grid vertices. Build the grid by:
1. For each unique y-coordinate of a horizontal line, record its x-extent.
2. For each unique x-coordinate of a vertical line, record its y-extent.
3. A valid grid cell exists between four vertices (x0,y0), (x1,y0), (x0,y1), (x1,y1) where all four edges are present.
### Partial Borders
Many real-world tables use only top and bottom borders (no vertical separators), or only an outer frame (no interior lines). Handle this by relaxing the grid completeness requirement: a cell boundary edge need not exist as a drawn line — it may instead be inferred from whitespace gaps (see next section). A mixed detection pass first identifies all explicit lines, then applies gap analysis only in the regions where lines are absent.
---
## Whitespace Gap Analysis (Borderless Tables)
When no ruling lines are present, column boundaries must be inferred from the distribution of glyph bounding boxes.
### Projection Profiles
For each horizontal band (row) of glyphs, compute the union of all glyph x-extents. This produces an "occupied" interval set. The complement — the gaps between occupied intervals — are candidate column separators.
To find separators that are consistent across multiple rows, build a **vertical projection profile**: for each x-coordinate, count how many rows have glyph coverage at that x. A column separator is a contiguous x-range where glyph coverage across all rows falls to zero (or near-zero, to tolerate small overhangs).
### Minimum Gap Threshold
Not every gap is a column boundary. Word spacing within a cell also creates gaps. A practical threshold is:
```
min_column_gap = median_word_space * K
```
where `median_word_space` is the median inter-word gap in the document (estimated from the distribution of x-advances within text runs) and `K` is an empirically determined factor, typically 2.0 to 3.0. Gaps narrower than this threshold are word spaces, not column separators.
### Distinguishing Prose from Tabular Data
A multi-column prose layout (newspaper columns) also exhibits consistent vertical gaps. Distinguish it from a table by:
- **Row count:** Tables typically have more than 34 rows with consistent column structure. A two-column prose block may span many rows but the column boundary is not re-used at a cell level.
- **Alignment consistency:** In a table, text within a column tends to share a dominant alignment (left, right, or decimal-aligned). In prose, each column is independently left-justified without cross-column structural meaning.
- **Column count stability:** In a table, the number of occupied columns per row is near-constant. In prose, partial final paragraphs may occupy only one column.
A row is classified as tabular if at least 60% of detected rows share the same column separator positions within a ±2 pt tolerance.
---
## Hough Transform Approach
When neither explicit path operators nor clean whitespace gaps are available — for example, in scanned-and-re-embedded PDFs where glyphs are rasterized but positioned with high precision — glyph bounding box edges can serve as line evidence.
### Parameter Space
For each glyph bounding box, emit four candidate line segments: top, bottom, left, right edges. Accumulate votes in a discretized (rho, theta) Hough space, restricted to near-horizontal (|theta| < 5 degrees) and near-vertical (|theta - 90| < 5 degrees) bins. The angular restriction eliminates the need to search the full 180-degree space and reduces noise from diagonal text.
### Practical Thresholds
In PDF coordinate space (72 units per inch), a meaningful accumulator bin width is approximately 1.0 unit in rho (roughly 1/72 inch). A line is considered detected when its accumulator bin exceeds a count threshold proportional to the expected number of cells in that row or column — typically max(3, row_count * 0.5).
Post-process detected lines with non-maximum suppression in rho: within a 3-unit window, keep only the rho value with the highest accumulator count.
---
## Graph-Based Cell Reconstruction
Treat the set of detected line segments (from explicit paths, gap analysis, or Hough) as a planar straight-line graph (PSLG). Cells correspond to bounded faces of this graph.
### Finding Rectangular Faces
For each horizontal segment endpoint (x0, y), search rightward along y for the nearest vertical segment at x1 > x0 that spans y. Then search downward from x1 at y for the nearest horizontal segment at y1 < y. Then verify a closing vertical segment exists at x0 spanning [y1, y]. If all four edges are found, the region (x0, x1, y1, y) is a candidate cell.
### Junction Handling
T-junctions (three segments meeting) and L-junctions (two segments meeting at a corner) indicate partial borders. At a T-junction, the crossing segment does not divide the face; the cell extends across the missing interior edge. Track junction types during segment intersection enumeration and mark edges as "border present" or "border absent" accordingly.
### Row and Column Index Assignment
After all cells are identified, assign integer row and column indices:
1. Sort cells by top-left corner: primary key y (descending, since PDF y increases upward), secondary key x (ascending).
2. Group cells into rows by y-coordinate proximity (tolerance ±2 pt).
3. Within each row, assign column indices by x-order.
---
## Merged Cell Detection
A merged cell spanning multiple columns is identified by the absence of a vertical interior border between two adjacent column positions. When the graph traversal finds a cell whose x-extent covers more than one column-width interval, set `col_span > 1`.
A merged cell spanning multiple rows is identified by the absence of a horizontal interior border between two adjacent row positions. Set `row_span > 1` accordingly.
Validate merges by checking that the combined bounding box of the merged cell is flush with the enclosing grid lines: the outer border must exist even if the interior dividers do not.
---
## Header Row Detection
Header rows carry column labels and are distinguished from data rows by multiple signals, each assigned a weight:
| Signal | Weight |
|--------|--------|
| Font weight bold (detected from font name or `FontDescriptor.StemV`) | High |
| Font size larger than modal data row font size | High |
| Background fill color distinct from data rows (detected from `re f` operations covering the row) | High |
| First row in the table | Medium |
| Text content matches all-uppercase or title-case pattern | Low |
| Text content contains no numeric-only cells | Low |
A row scores as a header if the weighted sum exceeds a threshold. In practice, a bold font alone is usually sufficient.
---
## Multi-Page Tables
When the last detected table on page N and the first detected structure on page N+1 share a compatible column fingerprint, treat them as a continued table.
A **column fingerprint** is a sorted tuple of (normalized_x_start, normalized_x_end) pairs for each column, where coordinates are normalized to the page width. Two fingerprints match if their column count is equal and each corresponding column boundary pair differs by less than 3% of page width.
If the first row of the continuation page is a header row (detected as above) and its text content matches the header of the initial page, strip the repeated header from the continuation and record it as a `repeated_header` flag on the table.
---
## Output Representation
A reconstructed table is encoded in the extraction JSON as follows:
```json
{
"type": "table",
"page": 1,
"bounding_box": { "x0": 72.0, "y0": 400.0, "x1": 540.0, "y1": 680.0 },
"col_count": 3,
"row_count": 5,
"rows": [
{
"index": 0,
"is_header": true,
"cells": [
{
"row": 0,
"col": 0,
"row_span": 1,
"col_span": 1,
"bounding_box": { "x0": 72.0, "y0": 640.0, "x1": 216.0, "y1": 680.0 },
"text": "Product",
"border_present": { "top": true, "bottom": true, "left": true, "right": true }
}
]
}
],
"continued_from_page": null,
"continues_on_page": 2
}
```
Key field semantics:
- `bounding_box` uses PDF coordinate space (origin at bottom-left, y increases upward). Consumers converting to screen space must flip y.
- `row_span` and `col_span` are always >= 1. A standard unmerged cell has both equal to 1.
- `border_present` encodes which of the four cell edges had an explicit path operator or a sufficiently strong gap signal. This allows downstream renderers to faithfully reproduce the visual structure.
- `text` is the concatenation of glyphs within the cell bounding box, in reading order (left-to-right, top-to-bottom). Cells containing only images have an empty `text` field.
- `is_header` is set on cells in rows classified as headers; for merged header cells spanning multiple columns, all cells in the merged region carry the flag.
- `continued_from_page` and `continues_on_page` are `null` when the table fits on a single page, or contain the 1-based page index of the adjacent page fragment.
This representation is lossless with respect to the detected structure and provides sufficient metadata for downstream consumers to reconstruct a DOM-equivalent table, apply styling, or perform data extraction without re-analyzing geometry.