- table-structure-reconstruction: line detection, gap analysis, Hough transform, graph-based cell reconstruction, merged cells, multi-page tables - mathematical-expression-handling: five encoding cases, OpenType MATH table, symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers - language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi, CJK vertical text, ligature normalization, whatlang/lingua integration - document-classification-and-zone-labeling: margin heuristics, font clustering, cross-page recurrence, footnote/caption/sidebar detection - post-extraction-normalization: hyphen handling, ligature expansion, paragraph reconstruction, Unicode normalization, pipeline ordering - chunking-for-llm-consumption: semantic snapping, heading hierarchy, sliding window overlap, table chunking strategies, token budget, late chunking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
176 lines
12 KiB
Markdown
176 lines
12 KiB
Markdown
# Document Classification and Zone Labeling
|
||
|
||
## Overview
|
||
|
||
After raw text extraction, each glyph or span has a position, font reference, and character content — but no semantic role. Zone labeling is the process of assigning a role to each text block: `body`, `heading`, `header`, `footer`, `footnote`, `caption`, `sidebar`, `marginalia`, or `page_number`. This pass runs after block assembly (grouping spans into lines and lines into paragraphs) but before reading-order resolution.
|
||
|
||
---
|
||
|
||
## 1. Why Zone Labeling Matters
|
||
|
||
Without zone labeling, extracted text is a raw positional dump. The damage is concrete:
|
||
|
||
- **Running headers interleaved with body paragraphs.** A header reading "Chapter 3: Results" appears between sentences because its y-coordinate places it between two body blocks on the same page.
|
||
- **Page numbers embedded mid-sentence.** A numeric "42" extracted in column order falls between the last word of one paragraph and the first word of the next.
|
||
- **Footnote markers disrupting prose flow.** Superscript `³` extracted inline pulls the following footnote text — located at the bottom of the page — into the paragraph body.
|
||
- **Sidebar text inserted at random positions.** A pull quote in the right margin, if read left-to-right by x-coordinate, bisects the main-column paragraph it is adjacent to.
|
||
|
||
The cost compounds downstream. Language models, search indexers, and screen readers all treat the extracted string as coherent prose. Injected non-body content corrupts sentence boundary detection, keyword density, and logical paragraph structure. Zone labeling is the gate that filters what reaches the output string.
|
||
|
||
---
|
||
|
||
## 2. Page Margin Heuristics
|
||
|
||
The simplest zone signals are geometric: headers and footers live at fixed vertical positions near the page boundary.
|
||
|
||
**Threshold definition.** Define `header_zone_max_y` as the y-coordinate below which a block must start to be considered a candidate header (measuring from the top of the page). A reliable default is 10–15% of page height. Similarly, `footer_zone_min_y` is the y-coordinate above which a block must end to be a footer candidate, measured from the bottom — again, 10–15%.
|
||
|
||
```
|
||
header_zone_max_y = page_height * 0.12
|
||
footer_zone_min_y = page_height - page_height * 0.12
|
||
```
|
||
|
||
Blocks whose bounding box falls entirely within these bands are header/footer candidates, not yet confirmed.
|
||
|
||
**Page number pattern detection.** Within the footer (or header) band, apply regex against the extracted text:
|
||
|
||
```
|
||
^\d+$ // bare number: 42
|
||
^Page\s+\d+(\s+of\s+\d+)?$ // Page 3 of 12
|
||
^[ivxlcdmIVXLCDM]+$ // roman numerals: xiv
|
||
^[-–]\s*\d+\s*[-–]$ // em-dash framing: — 42 —
|
||
```
|
||
|
||
A block matching any of these within the margin band receives label `page_number` at high confidence (≥ 0.90).
|
||
|
||
**Stability filter.** A single page cannot confirm a header or footer — any text can appear near the top by chance. Apply the stability filter (described in section 5) before committing the label.
|
||
|
||
---
|
||
|
||
## 3. Font-Based Classification
|
||
|
||
Font metadata distinguishes heading hierarchy from body text, and body from ancillary text like captions and footnotes.
|
||
|
||
**Build a font inventory.** On first pass over the document, collect `(font_name, font_size, is_bold, is_italic)` tuples from every span. Normalize font sizes to points. Cluster sizes into bins using a simple histogram with a 0.5pt merge tolerance to collapse rounding artifacts. The bin with the highest total character count is the **dominant body size** — call it `body_pt`.
|
||
|
||
**Heading detection.** A block where all spans share a font size `> body_pt * 1.25` and `is_bold == true` is a strong heading candidate. Multiple heading levels are recoverable by ordered font-size clustering: the largest non-body size is `h1`, the next is `h2`, and so on, up to three levels before the signal becomes unreliable.
|
||
|
||
**Caption and footnote detection.** Blocks where the font size is `< body_pt * 0.85` are small-text candidates. Combine with position (bottom-of-page for footnotes, adjacent to a whitespace gap for captions) and font style (often italic for captions) to disambiguate.
|
||
|
||
**Dominance rule.** If a block mixes body-sized and heading-sized spans (e.g., a sentence with a bold lead word), classify by the dominant span — the one covering more than 60% of character width.
|
||
|
||
---
|
||
|
||
## 4. Positional Heuristics
|
||
|
||
**Centred text as heading signal.** Compute the horizontal midpoint of a block's bounding box. If it falls within 5% of page width from the page centre, and the block is a single line, raise the heading confidence. Centring alone is not sufficient — font size must also exceed body size.
|
||
|
||
**Indentation patterns.** Measure the left-edge x-coordinate of the first line vs. subsequent lines in a paragraph block. Standard body paragraphs have a consistent left margin with optional first-line indent (positive or negative). A hanging indent — where the first line starts further left than continuation lines — is a strong footnote or bibliography signal. A large positive indent on every line suggests a block quote.
|
||
|
||
**Column boundary detection.** Collect the left-edge x-coordinates of all body-candidate blocks on a page. Cluster them; two tight clusters indicate a two-column layout, defining column boundaries. Any block whose x-origin falls outside both columns and within the page margin is a marginalia candidate.
|
||
|
||
**Outer margin detection.** For a single-column document, define the body column as the region bounded by the median left and right x-extents of body blocks (±5% tolerance). Text that starts to the right of `body_right + page_width * 0.05` is marginalia.
|
||
|
||
---
|
||
|
||
## 5. Cross-Page Consistency
|
||
|
||
A text block that recurs at the same position across multiple pages is definitionally a running element — header or footer — regardless of whether it triggered the margin-band heuristic.
|
||
|
||
**Position fingerprint.** For each page, record `(y_normalized, height, width)` for every candidate block, where `y_normalized = block_top / page_height`. Two blocks across pages are positionally equivalent if their `y_normalized` values differ by less than 0.01 and their widths are within 5%.
|
||
|
||
**Sliding window.** Process pages in groups of five (or fewer at document boundaries). A block position that appears in at least four of five consecutive pages is a running element. Assign `header` or `footer` based on whether it sits in the top or bottom margin band; if it falls outside both bands but recurs consistently, assign the closer one.
|
||
|
||
**Recto/verso alternation.** Academic and book PDFs often alternate left-aligned headers on even pages (verso) with right-aligned headers on odd pages (recto). Detect this by checking whether positionally equivalent blocks alternate between page-parity groups. When alternation is confirmed, apply the header label to both positions. Text content may differ (e.g., chapter title vs. section title); only position need match.
|
||
|
||
**Recurring text fragments.** Normalize extracted text (trim whitespace, collapse runs) and hash each candidate block. A hash appearing on more than 50% of pages is a strong running-element signal even if position varies slightly (e.g., centred headers on different-width pages).
|
||
|
||
---
|
||
|
||
## 6. Footnote Detection
|
||
|
||
Footnote detection requires matching two artifacts: the inline marker and the footnote body.
|
||
|
||
**Inline markers.** During span assembly, track spans where `font_size < body_pt * 0.75` and the span baseline is raised above the line baseline by more than 2pt. These are superscript candidates. Extract the character: if it is a digit, letter, or standard footnote symbol (∗ † ‡ § ¶), record it as a marker with its position.
|
||
|
||
**Footnote body location.** On the same page, look for blocks in the lower region (below `page_height * 0.65`) that begin with a matching marker character, optionally followed by a period or space. The block's font size is typically `< body_pt * 0.85`.
|
||
|
||
**Separator rule.** Many PDF producers render a short horizontal rule (a thin rectangle path, typically 30–50% of column width, 0.5–1pt thick) immediately above the footnote area. When such a path is detected, all text blocks below it and above the footer band are footnote candidates, raising their confidence.
|
||
|
||
**Overflow footnotes.** A footnote body that begins on page N and continues on page N+1 has no marker on page N+1. Detect this by tracking whether the last footnote block on a page ends mid-sentence (no terminal punctuation followed by whitespace). If so, the first small-font block at the bottom of the next page inherits the footnote label.
|
||
|
||
---
|
||
|
||
## 7. Caption Detection
|
||
|
||
**Proximity to image placeholders.** PDF image XObjects (type `/XObject`, subtype `/Image`) and form XObjects used as figures occupy rectangular regions on the page. After extracting all XObject bounding boxes, identify text blocks whose bounding box top is within `body_pt * 3` of an XObject's bottom (for below-figure captions) or whose bottom is within the same threshold of an XObject's top (for above-figure captions).
|
||
|
||
**Prefix pattern.** Apply regex to the block's first token:
|
||
|
||
```
|
||
^(Figure|Fig\.|Table|Tbl\.|Scheme|Plate|Exhibit|Supplementary\s+Figure)\s+\d+
|
||
```
|
||
|
||
A prefix match raises caption confidence to ≥ 0.85 independent of position.
|
||
|
||
**Short block heuristic.** Captions are rarely longer than three lines. If a block adjacent to an image XObject contains more than three lines, treat only the first three as caption and reclassify the remainder as body.
|
||
|
||
---
|
||
|
||
## 8. Sidebar and Pull Quote Detection
|
||
|
||
**Narrow column detection.** A sidebar occupies a column significantly narrower than the main body column. If body column width is `W`, a block whose bounding box width is `< W * 0.45` and whose x-extent overlaps the body column by at least 10% is a sidebar candidate.
|
||
|
||
**Font differentiation.** Pull quotes are typically set in a larger or italic typeface to distinguish them visually from body text. A block that is bold or italic, centred or right-aligned, and horizontally overlaps the main column is a pull quote candidate. Assign label `sidebar` for narrow-column placement, or remain `body` with reduced confidence if the signal is ambiguous.
|
||
|
||
**Bounding box overlap logic.** Compute the intersection-over-union (IoU) of the candidate block and the main body column rectangle. IoU above 0.3 but below 0.9 indicates a partial overlap consistent with sidebar placement.
|
||
|
||
---
|
||
|
||
## 9. Confidence and Fallback
|
||
|
||
Each block receives a `zone_confidence: f32` in [0.0, 1.0] computed from a weighted sum of signals:
|
||
|
||
| Signal | Weight |
|
||
|---|---|
|
||
| Margin band (geometric) | 0.30 |
|
||
| Font size deviation from body | 0.25 |
|
||
| Cross-page recurrence | 0.25 |
|
||
| Regex / prefix pattern match | 0.15 |
|
||
| Positional heuristic (indent, centre) | 0.05 |
|
||
|
||
Weights are normalized per label. When no label achieves confidence ≥ 0.50, default to `body`. This is the safe fallback: false negatives (unlabeled headers/footers passed through as body) are preferable to false positives (body text discarded as a header).
|
||
|
||
Expose the confidence in output so callers can tune their own threshold. A caller building a full-text search index may accept all blocks regardless of zone. A caller building a clean prose renderer may filter to `zone == body && zone_confidence >= 0.70`.
|
||
|
||
---
|
||
|
||
## 10. Output Representation
|
||
|
||
Every block in the JSON output carries:
|
||
|
||
```json
|
||
{
|
||
"text": "...",
|
||
"zone": "body",
|
||
"zone_confidence": 0.82,
|
||
"bbox": { "x0": 72.0, "y0": 144.0, "x1": 540.0, "y1": 160.5 },
|
||
"page": 3
|
||
}
|
||
```
|
||
|
||
Valid `zone` values:
|
||
|
||
| Value | Description |
|
||
|---|---|
|
||
| `body` | Main prose content |
|
||
| `heading` | Section or chapter heading |
|
||
| `header` | Running page header |
|
||
| `footer` | Running page footer |
|
||
| `footnote` | Footnote body text |
|
||
| `caption` | Figure, table, or scheme caption |
|
||
| `sidebar` | Sidebar or pull quote |
|
||
| `marginalia` | Margin annotation or note |
|
||
| `page_number` | Standalone page number |
|
||
|
||
The `zone` field is always present. `zone_confidence` is always a finite `f32`. Callers that want unfiltered text iterate all blocks; callers that want clean prose filter to `zone == "body"` or `zone in ["body", "heading"]`. Zone information is never used to modify `text` content — it is metadata only.
|