jedarden b805593973 Add six research documents covering output-side extraction topics

- table-structure-reconstruction: line detection, gap analysis, Hough
  transform, graph-based cell reconstruction, merged cells, multi-page tables
- mathematical-expression-handling: five encoding cases, OpenType MATH table,
  symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers
- language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi,
  CJK vertical text, ligature normalization, whatlang/lingua integration
- document-classification-and-zone-labeling: margin heuristics, font
  clustering, cross-page recurrence, footnote/caption/sidebar detection
- post-extraction-normalization: hyphen handling, ligature expansion,
  paragraph reconstruction, Unicode normalization, pipeline ordering
- chunking-for-llm-consumption: semantic snapping, heading hierarchy,
  sliding window overlap, table chunking strategies, token budget, late chunking

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 14:56:25 -04:00

12 KiB

Raw Permalink Blame History

Document Classification and Zone Labeling

Overview

After raw text extraction, each glyph or span has a position, font reference, and character content — but no semantic role. Zone labeling is the process of assigning a role to each text block: body, heading, header, footer, footnote, caption, sidebar, marginalia, or page_number. This pass runs after block assembly (grouping spans into lines and lines into paragraphs) but before reading-order resolution.

1. Why Zone Labeling Matters

Without zone labeling, extracted text is a raw positional dump. The damage is concrete:

Running headers interleaved with body paragraphs. A header reading "Chapter 3: Results" appears between sentences because its y-coordinate places it between two body blocks on the same page.
Page numbers embedded mid-sentence. A numeric "42" extracted in column order falls between the last word of one paragraph and the first word of the next.
Footnote markers disrupting prose flow. Superscript ³ extracted inline pulls the following footnote text — located at the bottom of the page — into the paragraph body.
Sidebar text inserted at random positions. A pull quote in the right margin, if read left-to-right by x-coordinate, bisects the main-column paragraph it is adjacent to.

The cost compounds downstream. Language models, search indexers, and screen readers all treat the extracted string as coherent prose. Injected non-body content corrupts sentence boundary detection, keyword density, and logical paragraph structure. Zone labeling is the gate that filters what reaches the output string.

2. Page Margin Heuristics

The simplest zone signals are geometric: headers and footers live at fixed vertical positions near the page boundary.

Threshold definition. Define header_zone_max_y as the y-coordinate below which a block must start to be considered a candidate header (measuring from the top of the page). A reliable default is 10–15% of page height. Similarly, footer_zone_min_y is the y-coordinate above which a block must end to be a footer candidate, measured from the bottom — again, 10–15%.

header_zone_max_y = page_height * 0.12
footer_zone_min_y = page_height - page_height * 0.12

Blocks whose bounding box falls entirely within these bands are header/footer candidates, not yet confirmed.

Page number pattern detection. Within the footer (or header) band, apply regex against the extracted text:

^\d+$                          // bare number: 42
^Page\s+\d+(\s+of\s+\d+)?$    // Page 3 of 12
^[ivxlcdmIVXLCDM]+$            // roman numerals: xiv
^[-–]\s*\d+\s*[-–]$           // em-dash framing: — 42 —

A block matching any of these within the margin band receives label page_number at high confidence (≥ 0.90).

Stability filter. A single page cannot confirm a header or footer — any text can appear near the top by chance. Apply the stability filter (described in section 5) before committing the label.

3. Font-Based Classification

Font metadata distinguishes heading hierarchy from body text, and body from ancillary text like captions and footnotes.

Build a font inventory. On first pass over the document, collect (font_name, font_size, is_bold, is_italic) tuples from every span. Normalize font sizes to points. Cluster sizes into bins using a simple histogram with a 0.5pt merge tolerance to collapse rounding artifacts. The bin with the highest total character count is the dominant body size — call it body_pt.

Heading detection. A block where all spans share a font size > body_pt * 1.25 and is_bold == true is a strong heading candidate. Multiple heading levels are recoverable by ordered font-size clustering: the largest non-body size is h1, the next is h2, and so on, up to three levels before the signal becomes unreliable.

Caption and footnote detection. Blocks where the font size is < body_pt * 0.85 are small-text candidates. Combine with position (bottom-of-page for footnotes, adjacent to a whitespace gap for captions) and font style (often italic for captions) to disambiguate.

Dominance rule. If a block mixes body-sized and heading-sized spans (e.g., a sentence with a bold lead word), classify by the dominant span — the one covering more than 60% of character width.

4. Positional Heuristics

Centred text as heading signal. Compute the horizontal midpoint of a block's bounding box. If it falls within 5% of page width from the page centre, and the block is a single line, raise the heading confidence. Centring alone is not sufficient — font size must also exceed body size.

Indentation patterns. Measure the left-edge x-coordinate of the first line vs. subsequent lines in a paragraph block. Standard body paragraphs have a consistent left margin with optional first-line indent (positive or negative). A hanging indent — where the first line starts further left than continuation lines — is a strong footnote or bibliography signal. A large positive indent on every line suggests a block quote.

Column boundary detection. Collect the left-edge x-coordinates of all body-candidate blocks on a page. Cluster them; two tight clusters indicate a two-column layout, defining column boundaries. Any block whose x-origin falls outside both columns and within the page margin is a marginalia candidate.

Outer margin detection. For a single-column document, define the body column as the region bounded by the median left and right x-extents of body blocks (±5% tolerance). Text that starts to the right of body_right + page_width * 0.05 is marginalia.

5. Cross-Page Consistency

A text block that recurs at the same position across multiple pages is definitionally a running element — header or footer — regardless of whether it triggered the margin-band heuristic.

Position fingerprint. For each page, record (y_normalized, height, width) for every candidate block, where y_normalized = block_top / page_height. Two blocks across pages are positionally equivalent if their y_normalized values differ by less than 0.01 and their widths are within 5%.

Sliding window. Process pages in groups of five (or fewer at document boundaries). A block position that appears in at least four of five consecutive pages is a running element. Assign header or footer based on whether it sits in the top or bottom margin band; if it falls outside both bands but recurs consistently, assign the closer one.

Recto/verso alternation. Academic and book PDFs often alternate left-aligned headers on even pages (verso) with right-aligned headers on odd pages (recto). Detect this by checking whether positionally equivalent blocks alternate between page-parity groups. When alternation is confirmed, apply the header label to both positions. Text content may differ (e.g., chapter title vs. section title); only position need match.

Recurring text fragments. Normalize extracted text (trim whitespace, collapse runs) and hash each candidate block. A hash appearing on more than 50% of pages is a strong running-element signal even if position varies slightly (e.g., centred headers on different-width pages).

6. Footnote Detection

Footnote detection requires matching two artifacts: the inline marker and the footnote body.

Inline markers. During span assembly, track spans where font_size < body_pt * 0.75 and the span baseline is raised above the line baseline by more than 2pt. These are superscript candidates. Extract the character: if it is a digit, letter, or standard footnote symbol (∗ † ‡ § ¶), record it as a marker with its position.

Footnote body location. On the same page, look for blocks in the lower region (below page_height * 0.65) that begin with a matching marker character, optionally followed by a period or space. The block's font size is typically < body_pt * 0.85.

Separator rule. Many PDF producers render a short horizontal rule (a thin rectangle path, typically 30–50% of column width, 0.5–1pt thick) immediately above the footnote area. When such a path is detected, all text blocks below it and above the footer band are footnote candidates, raising their confidence.

Overflow footnotes. A footnote body that begins on page N and continues on page N+1 has no marker on page N+1. Detect this by tracking whether the last footnote block on a page ends mid-sentence (no terminal punctuation followed by whitespace). If so, the first small-font block at the bottom of the next page inherits the footnote label.

7. Caption Detection

Proximity to image placeholders. PDF image XObjects (type /XObject, subtype /Image) and form XObjects used as figures occupy rectangular regions on the page. After extracting all XObject bounding boxes, identify text blocks whose bounding box top is within body_pt * 3 of an XObject's bottom (for below-figure captions) or whose bottom is within the same threshold of an XObject's top (for above-figure captions).

Prefix pattern. Apply regex to the block's first token:

^(Figure|Fig\.|Table|Tbl\.|Scheme|Plate|Exhibit|Supplementary\s+Figure)\s+\d+

A prefix match raises caption confidence to ≥ 0.85 independent of position.

Short block heuristic. Captions are rarely longer than three lines. If a block adjacent to an image XObject contains more than three lines, treat only the first three as caption and reclassify the remainder as body.

8. Sidebar and Pull Quote Detection

Narrow column detection. A sidebar occupies a column significantly narrower than the main body column. If body column width is W, a block whose bounding box width is < W * 0.45 and whose x-extent overlaps the body column by at least 10% is a sidebar candidate.

Font differentiation. Pull quotes are typically set in a larger or italic typeface to distinguish them visually from body text. A block that is bold or italic, centred or right-aligned, and horizontally overlaps the main column is a pull quote candidate. Assign label sidebar for narrow-column placement, or remain body with reduced confidence if the signal is ambiguous.

Bounding box overlap logic. Compute the intersection-over-union (IoU) of the candidate block and the main body column rectangle. IoU above 0.3 but below 0.9 indicates a partial overlap consistent with sidebar placement.

9. Confidence and Fallback

Each block receives a zone_confidence: f32 in [0.0, 1.0] computed from a weighted sum of signals:

Signal	Weight
Margin band (geometric)	0.30
Font size deviation from body	0.25
Cross-page recurrence	0.25
Regex / prefix pattern match	0.15
Positional heuristic (indent, centre)	0.05

Weights are normalized per label. When no label achieves confidence ≥ 0.50, default to body. This is the safe fallback: false negatives (unlabeled headers/footers passed through as body) are preferable to false positives (body text discarded as a header).

Expose the confidence in output so callers can tune their own threshold. A caller building a full-text search index may accept all blocks regardless of zone. A caller building a clean prose renderer may filter to zone == body && zone_confidence >= 0.70.

10. Output Representation

Every block in the JSON output carries:

{
  "text": "...",
  "zone": "body",
  "zone_confidence": 0.82,
  "bbox": { "x0": 72.0, "y0": 144.0, "x1": 540.0, "y1": 160.5 },
  "page": 3
}

Valid zone values:

Value	Description
`body`	Main prose content
`heading`	Section or chapter heading
`header`	Running page header
`footer`	Running page footer
`footnote`	Footnote body text
`caption`	Figure, table, or scheme caption
`sidebar`	Sidebar or pull quote
`marginalia`	Margin annotation or note
`page_number`	Standalone page number

The zone field is always present. zone_confidence is always a finite f32. Callers that want unfiltered text iterate all blocks; callers that want clean prose filter to zone == "body" or zone in ["body", "heading"]. Zone information is never used to modify text content — it is metadata only.

12 KiB Raw Permalink Blame History Unescape Escape