pdftract/docs/research/document-classification-and-zone-labeling.md

# Document Classification and Zone Labeling

## Overview

After raw text extraction, each glyph or span has a position, font reference, and character content — but no semantic role. Zone labeling is the process of assigning a role to each text block: `body`, `heading`, `header`, `footer`, `footnote`, `caption`, `sidebar`, `marginalia`, or `page_number`. This pass runs after block assembly (grouping spans into lines and lines into paragraphs) but before reading-order resolution.

---

## 1. Why Zone Labeling Matters

Without zone labeling, extracted text is a raw positional dump. The damage is concrete:

- **Running headers interleaved with body paragraphs.** A header reading "Chapter 3: Results" appears between sentences because its y-coordinate places it between two body blocks on the same page.
- **Page numbers embedded mid-sentence.** A numeric "42" extracted in column order falls between the last word of one paragraph and the first word of the next.
- **Footnote markers disrupting prose flow.** Superscript `³` extracted inline pulls the following footnote text — located at the bottom of the page — into the paragraph body.
- **Sidebar text inserted at random positions.** A pull quote in the right margin, if read left-to-right by x-coordinate, bisects the main-column paragraph it is adjacent to.

The cost compounds downstream. Language models, search indexers, and screen readers all treat the extracted string as coherent prose. Injected non-body content corrupts sentence boundary detection, keyword density, and logical paragraph structure. Zone labeling is the gate that filters what reaches the output string.

---

## 2. Page Margin Heuristics

The simplest zone signals are geometric: headers and footers live at fixed vertical positions near the page boundary.

**Threshold definition.** Define `header_zone_max_y` as the y-coordinate below which a block must start to be considered a candidate header (measuring from the top of the page). A reliable default is 10–15% of page height. Similarly, `footer_zone_min_y` is the y-coordinate above which a block must end to be a footer candidate, measured from the bottom — again, 10–15%.

```
header_zone_max_y = page_height * 0.12
footer_zone_min_y = page_height - page_height * 0.12
```

Blocks whose bounding box falls entirely within these bands are header/footer candidates, not yet confirmed.

**Page number pattern detection.** Within the footer (or header) band, apply regex against the extracted text:

```
^\d+$                          // bare number: 42
^Page\s+\d+(\s+of\s+\d+)?$    // Page 3 of 12
^[ivxlcdmIVXLCDM]+$            // roman numerals: xiv
^[-–]\s*\d+\s*[-–]$           // em-dash framing: — 42 —
```

A block matching any of these within the margin band receives label `page_number` at high confidence (≥ 0.90).

**Stability filter.** A single page cannot confirm a header or footer — any text can appear near the top by chance. Apply the stability filter (described in section 5) before committing the label.

---

## 3. Font-Based Classification

Font metadata distinguishes heading hierarchy from body text, and body from ancillary text like captions and footnotes.

**Build a font inventory.** On first pass over the document, collect `(font_name, font_size, is_bold, is_italic)` tuples from every span. Normalize font sizes to points. Cluster sizes into bins using a simple histogram with a 0.5pt merge tolerance to collapse rounding artifacts. The bin with the highest total character count is the **dominant body size** — call it `body_pt`.

**Heading detection.** A block where all spans share a font size `> body_pt * 1.25` and `is_bold == true` is a strong heading candidate. Multiple heading levels are recoverable by ordered font-size clustering: the largest non-body size is `h1`, the next is `h2`, and so on, up to three levels before the signal becomes unreliable.

**Caption and footnote detection.** Blocks where the font size is `< body_pt * 0.85` are small-text candidates. Combine with position (bottom-of-page for footnotes, adjacent to a whitespace gap for captions) and font style (often italic for captions) to disambiguate.

**Dominance rule.** If a block mixes body-sized and heading-sized spans (e.g., a sentence with a bold lead word), classify by the dominant span — the one covering more than 60% of character width.

---

## 4. Positional Heuristics

**Centred text as heading signal.** Compute the horizontal midpoint of a block's bounding box. If it falls within 5% of page width from the page centre, and the block is a single line, raise the heading confidence. Centring alone is not sufficient — font size must also exceed body size.

**Indentation patterns.** Measure the left-edge x-coordinate of the first line vs. subsequent lines in a paragraph block. Standard body paragraphs have a consistent left margin with optional first-line indent (positive or negative). A hanging indent — where the first line starts further left than continuation lines — is a strong footnote or bibliography signal. A large positive indent on every line suggests a block quote.

**Column boundary detection.** Collect the left-edge x-coordinates of all body-candidate blocks on a page. Cluster them; two tight clusters indicate a two-column layout, defining column boundaries. Any block whose x-origin falls outside both columns and within the page margin is a marginalia candidate.

**Outer margin detection.** For a single-column document, define the body column as the region bounded by the median left and right x-extents of body blocks (±5% tolerance). Text that starts to the right of `body_right + page_width * 0.05` is marginalia.

---

## 5. Cross-Page Consistency

A text block that recurs at the same position across multiple pages is definitionally a running element — header or footer — regardless of whether it triggered the margin-band heuristic.

**Position fingerprint.** For each page, record `(y_normalized, height, width)` for every candidate block, where `y_normalized = block_top / page_height`. Two blocks across pages are positionally equivalent if their `y_normalized` values differ by less than 0.01 and their widths are within 5%.

**Sliding window.** Process pages in groups of five (or fewer at document boundaries). A block position that appears in at least four of five consecutive pages is a running element. Assign `header` or `footer` based on whether it sits in the top or bottom margin band; if it falls outside both bands but recurs consistently, assign the closer one.

**Recto/verso alternation.** Academic and book PDFs often alternate left-aligned headers on even pages (verso) with right-aligned headers on odd pages (recto). Detect this by checking whether positionally equivalent blocks alternate between page-parity groups. When alternation is confirmed, apply the header label to both positions. Text content may differ (e.g., chapter title vs. section title); only position need match.

**Recurring text fragments.** Normalize extracted text (trim whitespace, collapse runs) and hash each candidate block. A hash appearing on more than 50% of pages is a strong running-element signal even if position varies slightly (e.g., centred headers on different-width pages).

---

## 6. Footnote Detection

Footnote detection requires matching two artifacts: the inline marker and the footnote body.

**Inline markers.** During span assembly, track spans where `font_size < body_pt * 0.75` and the span baseline is raised above the line baseline by more than 2pt. These are superscript candidates. Extract the character: if it is a digit, letter, or standard footnote symbol (∗ † ‡ § ¶), record it as a marker with its position.

**Footnote body location.** On the same page, look for blocks in the lower region (below `page_height * 0.65`) that begin with a matching marker character, optionally followed by a period or space. The block's font size is typically `< body_pt * 0.85`.

**Separator rule.** Many PDF producers render a short horizontal rule (a thin rectangle path, typically 30–50% of column width, 0.5–1pt thick) immediately above the footnote area. When such a path is detected, all text blocks below it and above the footer band are footnote candidates, raising their confidence.

**Overflow footnotes.** A footnote body that begins on page N and continues on page N+1 has no marker on page N+1. Detect this by tracking whether the last footnote block on a page ends mid-sentence (no terminal punctuation followed by whitespace). If so, the first small-font block at the bottom of the next page inherits the footnote label.

---

## 7. Caption Detection

**Proximity to image placeholders.** PDF image XObjects (type `/XObject`, subtype `/Image`) and form XObjects used as figures occupy rectangular regions on the page. After extracting all XObject bounding boxes, identify text blocks whose bounding box top is within `body_pt * 3` of an XObject's bottom (for below-figure captions) or whose bottom is within the same threshold of an XObject's top (for above-figure captions).

**Prefix pattern.** Apply regex to the block's first token:

```
^(Figure|Fig\.|Table|Tbl\.|Scheme|Plate|Exhibit|Supplementary\s+Figure)\s+\d+
```

A prefix match raises caption confidence to ≥ 0.85 independent of position.

**Short block heuristic.** Captions are rarely longer than three lines. If a block adjacent to an image XObject contains more than three lines, treat only the first three as caption and reclassify the remainder as body.

---

## 8. Sidebar and Pull Quote Detection

**Narrow column detection.** A sidebar occupies a column significantly narrower than the main body column. If body column width is `W`, a block whose bounding box width is `< W * 0.45` and whose x-extent overlaps the body column by at least 10% is a sidebar candidate.

**Font differentiation.** Pull quotes are typically set in a larger or italic typeface to distinguish them visually from body text. A block that is bold or italic, centred or right-aligned, and horizontally overlaps the main column is a pull quote candidate. Assign label `sidebar` for narrow-column placement, or remain `body` with reduced confidence if the signal is ambiguous.

**Bounding box overlap logic.** Compute the intersection-over-union (IoU) of the candidate block and the main body column rectangle. IoU above 0.3 but below 0.9 indicates a partial overlap consistent with sidebar placement.

---

## 9. Confidence and Fallback

Each block receives a `zone_confidence: f32` in [0.0, 1.0] computed from a weighted sum of signals:

| Signal | Weight |
|---|---|
| Margin band (geometric) | 0.30 |
| Font size deviation from body | 0.25 |
| Cross-page recurrence | 0.25 |
| Regex / prefix pattern match | 0.15 |
| Positional heuristic (indent, centre) | 0.05 |

Weights are normalized per label. When no label achieves confidence ≥ 0.50, default to `body`. This is the safe fallback: false negatives (unlabeled headers/footers passed through as body) are preferable to false positives (body text discarded as a header).

Expose the confidence in output so callers can tune their own threshold. A caller building a full-text search index may accept all blocks regardless of zone. A caller building a clean prose renderer may filter to `zone == body && zone_confidence >= 0.70`.

---

## 10. Output Representation

Every block in the JSON output carries:

```json
{
  "text": "...",
  "zone": "body",
  "zone_confidence": 0.82,
  "bbox": { "x0": 72.0, "y0": 144.0, "x1": 540.0, "y1": 160.5 },
  "page": 3
}
```

Valid `zone` values:

| Value | Description |
|---|---|
| `body` | Main prose content |
| `heading` | Section or chapter heading |
| `header` | Running page header |
| `footer` | Running page footer |
| `footnote` | Footnote body text |
| `caption` | Figure, table, or scheme caption |
| `sidebar` | Sidebar or pull quote |
| `marginalia` | Margin annotation or note |
| `page_number` | Standalone page number |

The `zone` field is always present. `zone_confidence` is always a finite `f32`. Callers that want unfiltered text iterate all blocks; callers that want clean prose filter to `zone == "body"` or `zone in ["body", "heading"]`. Zone information is never used to modify `text` content — it is metadata only.