pdftract/docs/research/latex-and-scientific-pdf-patterns.md
jedarden 8f8138a65e Add research: font subsetting, LaTeX patterns, redaction detection
Three new extraction research documents covering subset font Unicode
recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and
proper vs. improper redaction detection with output schema.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:30:52 -04:00

239 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LaTeX and Scientific PDF Patterns
Scientific papers represent a large and important class of PDFs requiring text extraction. The vast majority are generated by LaTeX toolchains, which produce PDFs with structural and encoding characteristics that differ fundamentally from word-processor output. This document covers those characteristics in depth, with the goal of informing correct handling in the `pdftract` extraction pipeline.
---
## 1. LaTeX PDF Generators and Their Characteristics
### pdfLaTeX
pdfLaTeX is the dominant engine for academic papers. It reads `.tex` source and emits PDF directly. Key characteristics for extraction:
- Embeds **Type 1 fonts** (PostScript outlines), typically the Computer Modern family.
- Uses **OT1** (default), **T1**, **OMS**, **OML**, and **OMX** encodings — none of which are Unicode or ASCII.
- **Subsets fonts** with a six-letter uppercase prefix: `ABCDEF+CMR10`. The prefix is arbitrary; strip it when matching font names.
- Produces **untagged PDFs** by default — no logical structure tree, no reading-order metadata.
- `/ToUnicode` CMaps are present for text fonts when `fontenc` with T1 is loaded, absent or incomplete for math fonts and OT1-only documents.
- `/Producer` in the `/Info` dictionary contains `pdfTeX-1.x.x` or similar.
### XeLaTeX and LuaLaTeX
Both engines accept UTF-8 source directly and produce Unicode output.
- Embed **OpenType** or **TrueType** fonts (including system fonts via `fontspec`).
- Include well-formed `/ToUnicode` CMaps covering the full Unicode range used.
- Math via the `unicode-math` package maps to the Mathematical Alphanumeric Symbols Unicode block (U+1D400U+1D7FF), which is far more extractable than OML/OMS.
- `/Producer` contains `XeTeX` or `LuaTeX`.
### dvips → Ghostscript (legacy)
The old `latex``dvips``ps2pdf`/`Ghostscript` pipeline:
- Produces **Type 1** or **Type 3** fonts.
- `/ToUnicode` CMaps are often absent entirely — Ghostscript does not synthesize them.
- `/Producer` contains `GPL Ghostscript` or `Acrobat Distiller` (when Distiller was used on the PostScript output).
- Extraction quality is substantially lower; glyph-name fallback (see `glyph-recognition-and-unicode-recovery.md`) is the primary recovery path.
**Detection heuristic:** Inspect `/Info``/Producer`. Match `pdfTeX`, `XeTeX`, `LuaTeX`, or `Ghostscript` to select the appropriate decoding path.
---
## 2. OT1 Encoding and Computer Modern Fonts
OT1 is a 128-slot encoding invented for TeX. It is **not** ASCII-compatible despite overlapping glyph shapes. Without a `/ToUnicode` CMap, byte values from the content stream must be passed through the OT1-to-Unicode table.
**Detection:** Font name (after stripping the 6-letter subset prefix) starts with `cmr`, `cmti`, `cmbx`, `cmss`, or `cmtt` and the encoding is not explicitly T1.
### OT1 → Unicode Mapping (all 128 positions)
| Hex | Unicode | Glyph / Name |
|-----|---------|--------------|
| 00 | U+0393 | Γ (Gamma) |
| 01 | U+0394 | Δ (Delta) |
| 02 | U+0398 | Θ (Theta) |
| 03 | U+039B | Λ (Lambda) |
| 04 | U+039E | Ξ (Xi) |
| 05 | U+03A0 | Π (Pi) |
| 06 | U+03A3 | Σ (Sigma) |
| 07 | U+03A5 | Υ (Upsilon) |
| 08 | U+03A6 | Φ (Phi) |
| 09 | U+03A8 | Ψ (Psi) |
| 0A | U+03A9 | Ω (Omega) |
| 0B | U+FB00 | ff ligature |
| 0C | U+FB01 | fi ligature |
| 0D | U+FB02 | fl ligature |
| 0E | U+FB03 | ffi ligature |
| 0F | U+FB04 | ffl ligature |
| 10 | U+0131 | ı (dotless i) |
| 11 | U+0237 | ȷ (dotless j) |
| 12 | U+0060 | ` (grave) |
| 13 | U+00B4 | ´ (acute) |
| 14 | U+02C7 | ˇ (caron) |
| 15 | U+02D8 | ˘ (breve) |
| 16 | U+00AF | ¯ (macron) |
| 17 | U+02DA | ˚ (ring) |
| 18 | U+02C8 | ˈ (modifier letter vert. line, used as cedilla placeholder) |
| 19 | U+00DF | ß (German sharp s) |
| 1A | U+00E6 | æ |
| 1B | U+0153 | œ |
| 1C | U+00F8 | ø |
| 1D | U+00C6 | Æ |
| 1E | U+0152 | Œ |
| 1F | U+00D8 | Ø |
| 20 | U+0020 | space |
| 212F | U+0021U+002F | !"#$%&'()*+,-./ (ASCII, with exceptions) |
| 22 | U+201D | " (right double quotation mark — NOT ASCII quote) |
| 27 | U+2019 | ' (right single quotation mark) |
| 3C | U+00A1 | ¡ (inverted exclamation) |
| 3D | U+003D | = |
| 3E | U+00BF | ¿ (inverted question) |
| 60 | U+2018 | ' (left single quotation mark — NOT backtick) |
| 7B | U+2013 | (en dash) |
| 7C | U+2014 | — (em dash) |
| 7D | U+201C | " (left double quotation mark) |
| 7E | U+02DC | ˜ (tilde accent) |
| 7F | U+00A8 | ¨ (diaeresis) |
Positions 0x300x39 are digits 09 (ASCII). Positions 0x410x5A and 0x610x7A are uppercase and lowercase Latin letters (ASCII-identical). All other positions follow the table above.
---
## 3. T1 Encoding (Cork Encoding)
T1, also called Cork encoding, is a 256-slot encoding designed for European languages. It is used by the EC (European Computer) font family: `ecrm` (roman), `ecti` (italic), `ecbx` (bold extended), `ectt` (typewriter).
Key characteristics:
- Positions **0x000x1F** contain precomposed accented characters (e.g., 0x00 = U+0060 grave, 0x01 = U+00E1 á, 0x02 = U+00E2 â … following a defined ordering of base+accent combinations).
- Positions **0x800xFF** extend the repertoire to cover most of Latin Extended-A (U+0100U+017E).
- T1 is an improvement over OT1 for multilingual text but still requires the lookup table; it is not Unicode.
**Detection:** Font name starts with `ec` (after stripping subset prefix), or the font's `/Encoding` array contains names like `/agrave`, `/aacute`, `/acircumflex` in positions 0x000x1F.
When a `/ToUnicode` CMap is present for a T1-encoded font, prefer it — the CMap is authoritative. Fall back to the T1 table only when the CMap is absent or incomplete.
---
## 4. Math Font Encodings
pdfLaTeX uses three math-specific encodings for which `/ToUnicode` CMaps are almost never generated.
### OML — Math Italic (cmmi fonts)
Used for variables and Greek letters in math mode. Positions 0x000x7F map to italic Latin letters (U+1D41AU+1D433 in the Mathematical Alphanumeric block) and lowercase Greek (U+03B1U+03C9). Uppercase Greek occupies 0x000x0F. The `cmmi` font family (e.g., `cmmi10`, `cmmi7`) uses this encoding.
### OMS — Math Symbols (cmsy fonts)
Contains binary operators, relations, and arrows. Notable mappings: 0x00 = U+2212 (minus sign), 0x01 = U+22C5 (dot operator), 0x02 = U+00D7 (multiplication sign), 0x03 = U+002A (asterisk), 0x04 = U+00F7 (division sign), 0x0E = U+221E (infinity), 0x0F = U+220F (N-ary product). The `cmsy` family uses this encoding.
### OMX — Math Extension (cmex fonts)
Large delimiters and extensible constructions: integral signs, summation, large parentheses. The `cmex` family uses this encoding. Many glyphs have no single Unicode equivalent because they are construction pieces; map to the closest Unicode math symbol or discard if purely decorative.
**Detection:** Strip subset prefix; if font name starts with `cmmi` → OML, `cmsy` → OMS, `cmex` → OMX. Math extraction from pdfLaTeX documents without these tables produces sequences of incorrect characters. Even with the tables, reconstructing a readable math expression requires additional semantic analysis beyond character-level decoding.
---
## 5. Ligature Handling
pdfLaTeX automatically substitutes ligature glyphs at the font level. In OT1 encoding:
- 0x0B = ff (U+FB00)
- 0x0C = fi (U+FB01)
- 0x0D = fl (U+FB02)
- 0x0E = ffi (U+FB03)
- 0x0F = ffl (U+FB04)
Some font families also produce `st` (U+FB06) and `ct` ligatures. When a `/ToUnicode` CMap is present, ligatures may map to a single Unicode PUA codepoint, to the multi-character sequence ("fi"), or to the Unicode ligature codepoints (U+FB01/U+FB02).
The normalization pipeline should always **expand ligatures to their constituent characters** using Unicode compatibility decomposition (NFKD) or an explicit lookup table. Retaining U+FB01 in extracted text breaks word matching, search, and tokenization. Apply expansion after CMap decode and before any further text processing.
---
## 6. Hyperref Package and PDF Bookmarks
Documents using `\usepackage{hyperref}` gain:
- **PDF outline (bookmarks):** The `/Outlines` dictionary contains a tree of `/Title` entries encoded as **UTF-16BE** byte strings with BOM `0xFEFF`. Decode as UTF-16BE to recover the section title text. These titles are high-quality — they reflect the actual section headings and are not subject to font encoding ambiguity.
- **Named destinations:** Cross-reference links point to `/Dest` named destinations (e.g., `/section.2.3`). These are navigation artifacts, not extraction targets, but they confirm logical document structure.
- **/Info dictionary:** `hyperref` populates `/Title`, `/Author`, `/Subject`, `/Keywords` from `\hypersetup{}` or `\title{}/\author{}` commands. This metadata is reliable and should be extracted as document-level metadata. XMP metadata (if present via `hyperxmp`) duplicates this in the `/Metadata` stream.
Detecting hyperref: check for the presence of `/Outlines` in the document catalog and `/Dest` entries in link annotations.
---
## 7. Two-Column Academic Paper Layout
The standard LaTeX two-column layout (via `\twocolumn` or the `multicol` package) divides the text area into two equal columns separated by a narrow gutter (typically 1020 pt). The page structure is:
- **Full-width zones:** title block, author list, abstract, section headings that span the page, and the bibliography in many journals.
- **Two-column zones:** body text, per-column figures and tables.
Column width formula: `col_width = (page_width - left_margin - right_margin - gutter) / 2`.
**Critical extraction problem:** The PDF content stream emits left column text first, then right column text — this is the layout engine's natural order. A naive top-to-bottom Y-sort interleaves the columns, producing unreadable output.
Correct handling: apply the XY-cut algorithm (documented in `complex-layout-reading-order.md`) with a vertical cut at the column midpoint. The cut should be detected geometrically — find the gap in X-coordinate density across all text runs on the page. Classify each text run as left-column or right-column by its left edge X coordinate relative to the cut point. Emit left column runs in reading order, then right column runs.
---
## 8. Figure and Table Placement
LaTeX floats are placed by the layout engine independently of source order. A figure defined after a paragraph may appear on the previous page. The caption is always spatially adjacent to the float content.
**Caption detection heuristics:**
- Begins with `Figure N.`, `Fig. N.`, `Table N.`, or `TABLE N.` (some journals use uppercase).
- Font size is smaller than body text (typically 9 pt vs. 1011 pt body).
- Horizontally positioned within the float bounding box.
- For figures: caption is below the image content. For tables: caption is above in most styles, below in others.
Use the figure/table number sequence as a consistency check on reading order. If figures appear out of numeric sequence in the extracted output, the column-split or float-grouping logic has a bug.
---
## 9. Bibliography and References
The bibliography zone in LaTeX papers has a predictable structure:
- **Section heading:** "References" or "Bibliography" — detect this as a zone boundary.
- **Numbered entries:** `[1]` style (numeric, `natbib` with `\bibliographystyle{plainnat}`) or `[AuthorYY]` style (author-year). The label is in the left margin, with the reference text indented (hanging indent pattern).
- **Entry structure:** author list, title (often in quotes or small caps), venue/journal name (often italic), volume/issue, year, pages.
**Extraction issues:**
- Em dashes and en dashes: pdfLaTeX encodes `--` (en dash) as glyph 0x7B in OT1 (U+2013 correctly), but some configurations or older Ghostscript pipelines encode it as a hyphen-minus (U+002D). Apply post-extraction heuristics: a hyphen between two spaces in a bibliography context is likely an en dash.
- Author name separators use commas and `and`; do not split on these for purposes of word boundary detection.
- DOI and URL strings in references are often set in a monospace font (`cmtt`) and may contain percent-encoded characters.
---
## 10. arXiv and Preprint PDFs
arXiv is the dominant source of scientific papers requiring extraction. arXiv PDFs are produced by their own TeX installation, accepting author-submitted `.tex` source and compiling with pdfLaTeX or XeLaTeX.
**Common issues:**
1. **arXiv watermark:** Every page carries a running header of the form `arXiv:XXXX.XXXXX [cs.XX] 1 Jan 2024`. This is a separate text layer added by arXiv's post-processing. Detect by pattern match: text matching `^arXiv:\d{4}\.\d{4,5}` appearing at a consistent Y position near the top margin across all pages. Apply the watermark suppression pipeline (see `watermark-and-background-separation.md`) to exclude it from extracted body text.
2. **Page numbers in footer:** Standard footer text at a consistent Y position near the bottom margin, typically a single integer or `N of M`. Exclude from body extraction.
3. **Custom packages:** Some arXiv submissions use institutional or journal-specific TeX packages that define custom font encodings or use PUA (Private Use Area) codepoints for special symbols. These are rare but produce unmappable codepoints; flag them for glyph-name fallback.
4. **Version stamps:** arXiv v1/v2 papers may have revision watermarks. The same pattern-match heuristic applies.
5. **Two-column journals:** Many arXiv papers target double-column journal formats. Apply the two-column splitting logic described in Section 7.
**Producer detection for arXiv:** arXiv's TeX Live installation produces `/Producer` values like `pdfTeX-1.40.x` for pdfLaTeX submissions and `XeTeX` for XeLaTeX. The presence of a `/Creator` value containing `LaTeX` (set by the `hyperref` package) is a strong signal that the document is LaTeX-generated regardless of engine.
---
## Summary: Decision Tree for LaTeX PDFs
1. Read `/Info` → `/Producer` and `/Creator`.
2. If producer matches `pdfTeX`: assume OT1 by default; check font names for `ec` prefix (T1), `cmmi`/`cmsy`/`cmex` (math encodings).
3. If producer matches `XeTeX` or `LuaTeX`: trust `/ToUnicode` CMaps; expand ligatures via NFKD.
4. If producer matches `Ghostscript` or is absent: assume no ToUnicode; use glyph-name tables as primary decode path.
5. For all engines: detect two-column layout geometrically; apply XY-cut before emitting text.
6. Detect and suppress arXiv watermark and page-number footers.
7. Detect bibliography zone by heading text; apply hanging-indent reference parser.
8. Expand all ligature codepoints (U+FB00U+FB06) to constituent characters in the normalization pass.