pdftract/docs/research/latex-and-scientific-pdf-patterns.md

# LaTeX and Scientific PDF Patterns

Scientific papers represent a large and important class of PDFs requiring text extraction. The vast majority are generated by LaTeX toolchains, which produce PDFs with structural and encoding characteristics that differ fundamentally from word-processor output. This document covers those characteristics in depth, with the goal of informing correct handling in the `pdftract` extraction pipeline.

---

## 1. LaTeX PDF Generators and Their Characteristics

### pdfLaTeX

pdfLaTeX is the dominant engine for academic papers. It reads `.tex` source and emits PDF directly. Key characteristics for extraction:

- Embeds **Type 1 fonts** (PostScript outlines), typically the Computer Modern family.
- Uses **OT1** (default), **T1**, **OMS**, **OML**, and **OMX** encodings — none of which are Unicode or ASCII.
- **Subsets fonts** with a six-letter uppercase prefix: `ABCDEF+CMR10`. The prefix is arbitrary; strip it when matching font names.
- Produces **untagged PDFs** by default — no logical structure tree, no reading-order metadata.
- `/ToUnicode` CMaps are present for text fonts when `fontenc` with T1 is loaded, absent or incomplete for math fonts and OT1-only documents.
- `/Producer` in the `/Info` dictionary contains `pdfTeX-1.x.x` or similar.

### XeLaTeX and LuaLaTeX

Both engines accept UTF-8 source directly and produce Unicode output.

- Embed **OpenType** or **TrueType** fonts (including system fonts via `fontspec`).
- Include well-formed `/ToUnicode` CMaps covering the full Unicode range used.
- Math via the `unicode-math` package maps to the Mathematical Alphanumeric Symbols Unicode block (U+1D400–U+1D7FF), which is far more extractable than OML/OMS.
- `/Producer` contains `XeTeX` or `LuaTeX`.

### dvips → Ghostscript (legacy)

The old `latex` → `dvips` → `ps2pdf`/`Ghostscript` pipeline:

- Produces **Type 1** or **Type 3** fonts.
- `/ToUnicode` CMaps are often absent entirely — Ghostscript does not synthesize them.
- `/Producer` contains `GPL Ghostscript` or `Acrobat Distiller` (when Distiller was used on the PostScript output).
- Extraction quality is substantially lower; glyph-name fallback (see `glyph-recognition-and-unicode-recovery.md`) is the primary recovery path.

**Detection heuristic:** Inspect `/Info` → `/Producer`. Match `pdfTeX`, `XeTeX`, `LuaTeX`, or `Ghostscript` to select the appropriate decoding path.

---

## 2. OT1 Encoding and Computer Modern Fonts

OT1 is a 128-slot encoding invented for TeX. It is **not** ASCII-compatible despite overlapping glyph shapes. Without a `/ToUnicode` CMap, byte values from the content stream must be passed through the OT1-to-Unicode table.

**Detection:** Font name (after stripping the 6-letter subset prefix) starts with `cmr`, `cmti`, `cmbx`, `cmss`, or `cmtt` and the encoding is not explicitly T1.

### OT1 → Unicode Mapping (all 128 positions)

| Hex | Unicode | Glyph / Name |
|-----|---------|--------------|
| 00 | U+0393 | Γ (Gamma) |
| 01 | U+0394 | Δ (Delta) |
| 02 | U+0398 | Θ (Theta) |
| 03 | U+039B | Λ (Lambda) |
| 04 | U+039E | Ξ (Xi) |
| 05 | U+03A0 | Π (Pi) |
| 06 | U+03A3 | Σ (Sigma) |
| 07 | U+03A5 | Υ (Upsilon) |
| 08 | U+03A6 | Φ (Phi) |
| 09 | U+03A8 | Ψ (Psi) |
| 0A | U+03A9 | Ω (Omega) |
| 0B | U+FB00 | ff ligature |
| 0C | U+FB01 | fi ligature |
| 0D | U+FB02 | fl ligature |
| 0E | U+FB03 | ffi ligature |
| 0F | U+FB04 | ffl ligature |
| 10 | U+0131 | ı (dotless i) |
| 11 | U+0237 | ȷ (dotless j) |
| 12 | U+0060 | ` (grave) |
| 13 | U+00B4 | ´ (acute) |
| 14 | U+02C7 | ˇ (caron) |
| 15 | U+02D8 | ˘ (breve) |
| 16 | U+00AF | ¯ (macron) |
| 17 | U+02DA | ˚ (ring) |
| 18 | U+02C8 | ˈ (modifier letter vert. line, used as cedilla placeholder) |
| 19 | U+00DF | ß (German sharp s) |
| 1A | U+00E6 | æ |
| 1B | U+0153 | œ |
| 1C | U+00F8 | ø |
| 1D | U+00C6 | Æ |
| 1E | U+0152 | Œ |
| 1F | U+00D8 | Ø |
| 20 | U+0020 | space |
| 21–2F | U+0021–U+002F | !"#$%&'()*+,-./ (ASCII, with exceptions) |
| 22 | U+201D | " (right double quotation mark — NOT ASCII quote) |
| 27 | U+2019 | ' (right single quotation mark) |
| 3C | U+00A1 | ¡ (inverted exclamation) |
| 3D | U+003D | = |
| 3E | U+00BF | ¿ (inverted question) |
| 60 | U+2018 | ' (left single quotation mark — NOT backtick) |
| 7B | U+2013 | – (en dash) |
| 7C | U+2014 | — (em dash) |
| 7D | U+201C | " (left double quotation mark) |
| 7E | U+02DC | ˜ (tilde accent) |
| 7F | U+00A8 | ¨ (diaeresis) |

Positions 0x30–0x39 are digits 0–9 (ASCII). Positions 0x41–0x5A and 0x61–0x7A are uppercase and lowercase Latin letters (ASCII-identical). All other positions follow the table above.

---

## 3. T1 Encoding (Cork Encoding)

T1, also called Cork encoding, is a 256-slot encoding designed for European languages. It is used by the EC (European Computer) font family: `ecrm` (roman), `ecti` (italic), `ecbx` (bold extended), `ectt` (typewriter).

Key characteristics:

- Positions **0x00–0x1F** contain precomposed accented characters (e.g., 0x00 = U+0060 grave, 0x01 = U+00E1 á, 0x02 = U+00E2 â … following a defined ordering of base+accent combinations).
- Positions **0x80–0xFF** extend the repertoire to cover most of Latin Extended-A (U+0100–U+017E).
- T1 is an improvement over OT1 for multilingual text but still requires the lookup table; it is not Unicode.

**Detection:** Font name starts with `ec` (after stripping subset prefix), or the font's `/Encoding` array contains names like `/agrave`, `/aacute`, `/acircumflex` in positions 0x00–0x1F.

When a `/ToUnicode` CMap is present for a T1-encoded font, prefer it — the CMap is authoritative. Fall back to the T1 table only when the CMap is absent or incomplete.

---

## 4. Math Font Encodings

pdfLaTeX uses three math-specific encodings for which `/ToUnicode` CMaps are almost never generated.

### OML — Math Italic (cmmi fonts)

Used for variables and Greek letters in math mode. Positions 0x00–0x7F map to italic Latin letters (U+1D41A–U+1D433 in the Mathematical Alphanumeric block) and lowercase Greek (U+03B1–U+03C9). Uppercase Greek occupies 0x00–0x0F. The `cmmi` font family (e.g., `cmmi10`, `cmmi7`) uses this encoding.

### OMS — Math Symbols (cmsy fonts)

Contains binary operators, relations, and arrows. Notable mappings: 0x00 = U+2212 (minus sign), 0x01 = U+22C5 (dot operator), 0x02 = U+00D7 (multiplication sign), 0x03 = U+002A (asterisk), 0x04 = U+00F7 (division sign), 0x0E = U+221E (infinity), 0x0F = U+220F (N-ary product). The `cmsy` family uses this encoding.

### OMX — Math Extension (cmex fonts)

Large delimiters and extensible constructions: integral signs, summation, large parentheses. The `cmex` family uses this encoding. Many glyphs have no single Unicode equivalent because they are construction pieces; map to the closest Unicode math symbol or discard if purely decorative.

**Detection:** Strip subset prefix; if font name starts with `cmmi` → OML, `cmsy` → OMS, `cmex` → OMX. Math extraction from pdfLaTeX documents without these tables produces sequences of incorrect characters. Even with the tables, reconstructing a readable math expression requires additional semantic analysis beyond character-level decoding.

---

## 5. Ligature Handling

pdfLaTeX automatically substitutes ligature glyphs at the font level. In OT1 encoding:

- 0x0B = ff (U+FB00)
- 0x0C = fi (U+FB01)
- 0x0D = fl (U+FB02)
- 0x0E = ffi (U+FB03)
- 0x0F = ffl (U+FB04)

Some font families also produce `st` (U+FB06) and `ct` ligatures. When a `/ToUnicode` CMap is present, ligatures may map to a single Unicode PUA codepoint, to the multi-character sequence ("fi"), or to the Unicode ligature codepoints (U+FB01/U+FB02).

The normalization pipeline should always **expand ligatures to their constituent characters** using Unicode compatibility decomposition (NFKD) or an explicit lookup table. Retaining U+FB01 in extracted text breaks word matching, search, and tokenization. Apply expansion after CMap decode and before any further text processing.

---

## 6. Hyperref Package and PDF Bookmarks

Documents using `\usepackage{hyperref}` gain:

- **PDF outline (bookmarks):** The `/Outlines` dictionary contains a tree of `/Title` entries encoded as **UTF-16BE** byte strings with BOM `0xFEFF`. Decode as UTF-16BE to recover the section title text. These titles are high-quality — they reflect the actual section headings and are not subject to font encoding ambiguity.
- **Named destinations:** Cross-reference links point to `/Dest` named destinations (e.g., `/section.2.3`). These are navigation artifacts, not extraction targets, but they confirm logical document structure.
- **/Info dictionary:** `hyperref` populates `/Title`, `/Author`, `/Subject`, `/Keywords` from `\hypersetup{}` or `\title{}/\author{}` commands. This metadata is reliable and should be extracted as document-level metadata. XMP metadata (if present via `hyperxmp`) duplicates this in the `/Metadata` stream.

Detecting hyperref: check for the presence of `/Outlines` in the document catalog and `/Dest` entries in link annotations.

---

## 7. Two-Column Academic Paper Layout

The standard LaTeX two-column layout (via `\twocolumn` or the `multicol` package) divides the text area into two equal columns separated by a narrow gutter (typically 10–20 pt). The page structure is:

- **Full-width zones:** title block, author list, abstract, section headings that span the page, and the bibliography in many journals.
- **Two-column zones:** body text, per-column figures and tables.

Column width formula: `col_width = (page_width - left_margin - right_margin - gutter) / 2`.

**Critical extraction problem:** The PDF content stream emits left column text first, then right column text — this is the layout engine's natural order. A naive top-to-bottom Y-sort interleaves the columns, producing unreadable output.

Correct handling: apply the XY-cut algorithm (documented in `complex-layout-reading-order.md`) with a vertical cut at the column midpoint. The cut should be detected geometrically — find the gap in X-coordinate density across all text runs on the page. Classify each text run as left-column or right-column by its left edge X coordinate relative to the cut point. Emit left column runs in reading order, then right column runs.

---

## 8. Figure and Table Placement

LaTeX floats are placed by the layout engine independently of source order. A figure defined after a paragraph may appear on the previous page. The caption is always spatially adjacent to the float content.

**Caption detection heuristics:**
- Begins with `Figure N.`, `Fig. N.`, `Table N.`, or `TABLE N.` (some journals use uppercase).
- Font size is smaller than body text (typically 9 pt vs. 10–11 pt body).
- Horizontally positioned within the float bounding box.
- For figures: caption is below the image content. For tables: caption is above in most styles, below in others.

Use the figure/table number sequence as a consistency check on reading order. If figures appear out of numeric sequence in the extracted output, the column-split or float-grouping logic has a bug.

---

## 9. Bibliography and References

The bibliography zone in LaTeX papers has a predictable structure:

- **Section heading:** "References" or "Bibliography" — detect this as a zone boundary.
- **Numbered entries:** `[1]` style (numeric, `natbib` with `\bibliographystyle{plainnat}`) or `[AuthorYY]` style (author-year). The label is in the left margin, with the reference text indented (hanging indent pattern).
- **Entry structure:** author list, title (often in quotes or small caps), venue/journal name (often italic), volume/issue, year, pages.

**Extraction issues:**
- Em dashes and en dashes: pdfLaTeX encodes `--` (en dash) as glyph 0x7B in OT1 (U+2013 correctly), but some configurations or older Ghostscript pipelines encode it as a hyphen-minus (U+002D). Apply post-extraction heuristics: a hyphen between two spaces in a bibliography context is likely an en dash.
- Author name separators use commas and `and`; do not split on these for purposes of word boundary detection.
- DOI and URL strings in references are often set in a monospace font (`cmtt`) and may contain percent-encoded characters.

---

## 10. arXiv and Preprint PDFs

arXiv is the dominant source of scientific papers requiring extraction. arXiv PDFs are produced by their own TeX installation, accepting author-submitted `.tex` source and compiling with pdfLaTeX or XeLaTeX.

**Common issues:**

1. **arXiv watermark:** Every page carries a running header of the form `arXiv:XXXX.XXXXX [cs.XX] 1 Jan 2024`. This is a separate text layer added by arXiv's post-processing. Detect by pattern match: text matching `^arXiv:\d{4}\.\d{4,5}` appearing at a consistent Y position near the top margin across all pages. Apply the watermark suppression pipeline (see `watermark-and-background-separation.md`) to exclude it from extracted body text.

2. **Page numbers in footer:** Standard footer text at a consistent Y position near the bottom margin, typically a single integer or `N of M`. Exclude from body extraction.

3. **Custom packages:** Some arXiv submissions use institutional or journal-specific TeX packages that define custom font encodings or use PUA (Private Use Area) codepoints for special symbols. These are rare but produce unmappable codepoints; flag them for glyph-name fallback.

4. **Version stamps:** arXiv v1/v2 papers may have revision watermarks. The same pattern-match heuristic applies.

5. **Two-column journals:** Many arXiv papers target double-column journal formats. Apply the two-column splitting logic described in Section 7.

**Producer detection for arXiv:** arXiv's TeX Live installation produces `/Producer` values like `pdfTeX-1.40.x` for pdfLaTeX submissions and `XeTeX` for XeLaTeX. The presence of a `/Creator` value containing `LaTeX` (set by the `hyperref` package) is a strong signal that the document is LaTeX-generated regardless of engine.

---

## Summary: Decision Tree for LaTeX PDFs

1. Read `/Info` → `/Producer` and `/Creator`.
2. If producer matches `pdfTeX`: assume OT1 by default; check font names for `ec` prefix (T1), `cmmi`/`cmsy`/`cmex` (math encodings).
3. If producer matches `XeTeX` or `LuaTeX`: trust `/ToUnicode` CMaps; expand ligatures via NFKD.
4. If producer matches `Ghostscript` or is absent: assume no ToUnicode; use glyph-name tables as primary decode path.
5. For all engines: detect two-column layout geometrically; apply XY-cut before emitting text.
6. Detect and suppress arXiv watermark and page-number footers.
7. Detect bibliography zone by heading text; apply hanging-indent reference parser.
8. Expand all ligature codepoints (U+FB00–U+FB06) to constituent characters in the normalization pass.