Three new extraction research documents covering subset font Unicode recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and proper vs. improper redaction detection with output schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
239 lines
14 KiB
Markdown
239 lines
14 KiB
Markdown
# LaTeX and Scientific PDF Patterns
|
||
|
||
Scientific papers represent a large and important class of PDFs requiring text extraction. The vast majority are generated by LaTeX toolchains, which produce PDFs with structural and encoding characteristics that differ fundamentally from word-processor output. This document covers those characteristics in depth, with the goal of informing correct handling in the `pdftract` extraction pipeline.
|
||
|
||
---
|
||
|
||
## 1. LaTeX PDF Generators and Their Characteristics
|
||
|
||
### pdfLaTeX
|
||
|
||
pdfLaTeX is the dominant engine for academic papers. It reads `.tex` source and emits PDF directly. Key characteristics for extraction:
|
||
|
||
- Embeds **Type 1 fonts** (PostScript outlines), typically the Computer Modern family.
|
||
- Uses **OT1** (default), **T1**, **OMS**, **OML**, and **OMX** encodings — none of which are Unicode or ASCII.
|
||
- **Subsets fonts** with a six-letter uppercase prefix: `ABCDEF+CMR10`. The prefix is arbitrary; strip it when matching font names.
|
||
- Produces **untagged PDFs** by default — no logical structure tree, no reading-order metadata.
|
||
- `/ToUnicode` CMaps are present for text fonts when `fontenc` with T1 is loaded, absent or incomplete for math fonts and OT1-only documents.
|
||
- `/Producer` in the `/Info` dictionary contains `pdfTeX-1.x.x` or similar.
|
||
|
||
### XeLaTeX and LuaLaTeX
|
||
|
||
Both engines accept UTF-8 source directly and produce Unicode output.
|
||
|
||
- Embed **OpenType** or **TrueType** fonts (including system fonts via `fontspec`).
|
||
- Include well-formed `/ToUnicode` CMaps covering the full Unicode range used.
|
||
- Math via the `unicode-math` package maps to the Mathematical Alphanumeric Symbols Unicode block (U+1D400–U+1D7FF), which is far more extractable than OML/OMS.
|
||
- `/Producer` contains `XeTeX` or `LuaTeX`.
|
||
|
||
### dvips → Ghostscript (legacy)
|
||
|
||
The old `latex` → `dvips` → `ps2pdf`/`Ghostscript` pipeline:
|
||
|
||
- Produces **Type 1** or **Type 3** fonts.
|
||
- `/ToUnicode` CMaps are often absent entirely — Ghostscript does not synthesize them.
|
||
- `/Producer` contains `GPL Ghostscript` or `Acrobat Distiller` (when Distiller was used on the PostScript output).
|
||
- Extraction quality is substantially lower; glyph-name fallback (see `glyph-recognition-and-unicode-recovery.md`) is the primary recovery path.
|
||
|
||
**Detection heuristic:** Inspect `/Info` → `/Producer`. Match `pdfTeX`, `XeTeX`, `LuaTeX`, or `Ghostscript` to select the appropriate decoding path.
|
||
|
||
---
|
||
|
||
## 2. OT1 Encoding and Computer Modern Fonts
|
||
|
||
OT1 is a 128-slot encoding invented for TeX. It is **not** ASCII-compatible despite overlapping glyph shapes. Without a `/ToUnicode` CMap, byte values from the content stream must be passed through the OT1-to-Unicode table.
|
||
|
||
**Detection:** Font name (after stripping the 6-letter subset prefix) starts with `cmr`, `cmti`, `cmbx`, `cmss`, or `cmtt` and the encoding is not explicitly T1.
|
||
|
||
### OT1 → Unicode Mapping (all 128 positions)
|
||
|
||
| Hex | Unicode | Glyph / Name |
|
||
|-----|---------|--------------|
|
||
| 00 | U+0393 | Γ (Gamma) |
|
||
| 01 | U+0394 | Δ (Delta) |
|
||
| 02 | U+0398 | Θ (Theta) |
|
||
| 03 | U+039B | Λ (Lambda) |
|
||
| 04 | U+039E | Ξ (Xi) |
|
||
| 05 | U+03A0 | Π (Pi) |
|
||
| 06 | U+03A3 | Σ (Sigma) |
|
||
| 07 | U+03A5 | Υ (Upsilon) |
|
||
| 08 | U+03A6 | Φ (Phi) |
|
||
| 09 | U+03A8 | Ψ (Psi) |
|
||
| 0A | U+03A9 | Ω (Omega) |
|
||
| 0B | U+FB00 | ff ligature |
|
||
| 0C | U+FB01 | fi ligature |
|
||
| 0D | U+FB02 | fl ligature |
|
||
| 0E | U+FB03 | ffi ligature |
|
||
| 0F | U+FB04 | ffl ligature |
|
||
| 10 | U+0131 | ı (dotless i) |
|
||
| 11 | U+0237 | ȷ (dotless j) |
|
||
| 12 | U+0060 | ` (grave) |
|
||
| 13 | U+00B4 | ´ (acute) |
|
||
| 14 | U+02C7 | ˇ (caron) |
|
||
| 15 | U+02D8 | ˘ (breve) |
|
||
| 16 | U+00AF | ¯ (macron) |
|
||
| 17 | U+02DA | ˚ (ring) |
|
||
| 18 | U+02C8 | ˈ (modifier letter vert. line, used as cedilla placeholder) |
|
||
| 19 | U+00DF | ß (German sharp s) |
|
||
| 1A | U+00E6 | æ |
|
||
| 1B | U+0153 | œ |
|
||
| 1C | U+00F8 | ø |
|
||
| 1D | U+00C6 | Æ |
|
||
| 1E | U+0152 | Œ |
|
||
| 1F | U+00D8 | Ø |
|
||
| 20 | U+0020 | space |
|
||
| 21–2F | U+0021–U+002F | !"#$%&'()*+,-./ (ASCII, with exceptions) |
|
||
| 22 | U+201D | " (right double quotation mark — NOT ASCII quote) |
|
||
| 27 | U+2019 | ' (right single quotation mark) |
|
||
| 3C | U+00A1 | ¡ (inverted exclamation) |
|
||
| 3D | U+003D | = |
|
||
| 3E | U+00BF | ¿ (inverted question) |
|
||
| 60 | U+2018 | ' (left single quotation mark — NOT backtick) |
|
||
| 7B | U+2013 | – (en dash) |
|
||
| 7C | U+2014 | — (em dash) |
|
||
| 7D | U+201C | " (left double quotation mark) |
|
||
| 7E | U+02DC | ˜ (tilde accent) |
|
||
| 7F | U+00A8 | ¨ (diaeresis) |
|
||
|
||
Positions 0x30–0x39 are digits 0–9 (ASCII). Positions 0x41–0x5A and 0x61–0x7A are uppercase and lowercase Latin letters (ASCII-identical). All other positions follow the table above.
|
||
|
||
---
|
||
|
||
## 3. T1 Encoding (Cork Encoding)
|
||
|
||
T1, also called Cork encoding, is a 256-slot encoding designed for European languages. It is used by the EC (European Computer) font family: `ecrm` (roman), `ecti` (italic), `ecbx` (bold extended), `ectt` (typewriter).
|
||
|
||
Key characteristics:
|
||
|
||
- Positions **0x00–0x1F** contain precomposed accented characters (e.g., 0x00 = U+0060 grave, 0x01 = U+00E1 á, 0x02 = U+00E2 â … following a defined ordering of base+accent combinations).
|
||
- Positions **0x80–0xFF** extend the repertoire to cover most of Latin Extended-A (U+0100–U+017E).
|
||
- T1 is an improvement over OT1 for multilingual text but still requires the lookup table; it is not Unicode.
|
||
|
||
**Detection:** Font name starts with `ec` (after stripping subset prefix), or the font's `/Encoding` array contains names like `/agrave`, `/aacute`, `/acircumflex` in positions 0x00–0x1F.
|
||
|
||
When a `/ToUnicode` CMap is present for a T1-encoded font, prefer it — the CMap is authoritative. Fall back to the T1 table only when the CMap is absent or incomplete.
|
||
|
||
---
|
||
|
||
## 4. Math Font Encodings
|
||
|
||
pdfLaTeX uses three math-specific encodings for which `/ToUnicode` CMaps are almost never generated.
|
||
|
||
### OML — Math Italic (cmmi fonts)
|
||
|
||
Used for variables and Greek letters in math mode. Positions 0x00–0x7F map to italic Latin letters (U+1D41A–U+1D433 in the Mathematical Alphanumeric block) and lowercase Greek (U+03B1–U+03C9). Uppercase Greek occupies 0x00–0x0F. The `cmmi` font family (e.g., `cmmi10`, `cmmi7`) uses this encoding.
|
||
|
||
### OMS — Math Symbols (cmsy fonts)
|
||
|
||
Contains binary operators, relations, and arrows. Notable mappings: 0x00 = U+2212 (minus sign), 0x01 = U+22C5 (dot operator), 0x02 = U+00D7 (multiplication sign), 0x03 = U+002A (asterisk), 0x04 = U+00F7 (division sign), 0x0E = U+221E (infinity), 0x0F = U+220F (N-ary product). The `cmsy` family uses this encoding.
|
||
|
||
### OMX — Math Extension (cmex fonts)
|
||
|
||
Large delimiters and extensible constructions: integral signs, summation, large parentheses. The `cmex` family uses this encoding. Many glyphs have no single Unicode equivalent because they are construction pieces; map to the closest Unicode math symbol or discard if purely decorative.
|
||
|
||
**Detection:** Strip subset prefix; if font name starts with `cmmi` → OML, `cmsy` → OMS, `cmex` → OMX. Math extraction from pdfLaTeX documents without these tables produces sequences of incorrect characters. Even with the tables, reconstructing a readable math expression requires additional semantic analysis beyond character-level decoding.
|
||
|
||
---
|
||
|
||
## 5. Ligature Handling
|
||
|
||
pdfLaTeX automatically substitutes ligature glyphs at the font level. In OT1 encoding:
|
||
|
||
- 0x0B = ff (U+FB00)
|
||
- 0x0C = fi (U+FB01)
|
||
- 0x0D = fl (U+FB02)
|
||
- 0x0E = ffi (U+FB03)
|
||
- 0x0F = ffl (U+FB04)
|
||
|
||
Some font families also produce `st` (U+FB06) and `ct` ligatures. When a `/ToUnicode` CMap is present, ligatures may map to a single Unicode PUA codepoint, to the multi-character sequence ("fi"), or to the Unicode ligature codepoints (U+FB01/U+FB02).
|
||
|
||
The normalization pipeline should always **expand ligatures to their constituent characters** using Unicode compatibility decomposition (NFKD) or an explicit lookup table. Retaining U+FB01 in extracted text breaks word matching, search, and tokenization. Apply expansion after CMap decode and before any further text processing.
|
||
|
||
---
|
||
|
||
## 6. Hyperref Package and PDF Bookmarks
|
||
|
||
Documents using `\usepackage{hyperref}` gain:
|
||
|
||
- **PDF outline (bookmarks):** The `/Outlines` dictionary contains a tree of `/Title` entries encoded as **UTF-16BE** byte strings with BOM `0xFEFF`. Decode as UTF-16BE to recover the section title text. These titles are high-quality — they reflect the actual section headings and are not subject to font encoding ambiguity.
|
||
- **Named destinations:** Cross-reference links point to `/Dest` named destinations (e.g., `/section.2.3`). These are navigation artifacts, not extraction targets, but they confirm logical document structure.
|
||
- **/Info dictionary:** `hyperref` populates `/Title`, `/Author`, `/Subject`, `/Keywords` from `\hypersetup{}` or `\title{}/\author{}` commands. This metadata is reliable and should be extracted as document-level metadata. XMP metadata (if present via `hyperxmp`) duplicates this in the `/Metadata` stream.
|
||
|
||
Detecting hyperref: check for the presence of `/Outlines` in the document catalog and `/Dest` entries in link annotations.
|
||
|
||
---
|
||
|
||
## 7. Two-Column Academic Paper Layout
|
||
|
||
The standard LaTeX two-column layout (via `\twocolumn` or the `multicol` package) divides the text area into two equal columns separated by a narrow gutter (typically 10–20 pt). The page structure is:
|
||
|
||
- **Full-width zones:** title block, author list, abstract, section headings that span the page, and the bibliography in many journals.
|
||
- **Two-column zones:** body text, per-column figures and tables.
|
||
|
||
Column width formula: `col_width = (page_width - left_margin - right_margin - gutter) / 2`.
|
||
|
||
**Critical extraction problem:** The PDF content stream emits left column text first, then right column text — this is the layout engine's natural order. A naive top-to-bottom Y-sort interleaves the columns, producing unreadable output.
|
||
|
||
Correct handling: apply the XY-cut algorithm (documented in `complex-layout-reading-order.md`) with a vertical cut at the column midpoint. The cut should be detected geometrically — find the gap in X-coordinate density across all text runs on the page. Classify each text run as left-column or right-column by its left edge X coordinate relative to the cut point. Emit left column runs in reading order, then right column runs.
|
||
|
||
---
|
||
|
||
## 8. Figure and Table Placement
|
||
|
||
LaTeX floats are placed by the layout engine independently of source order. A figure defined after a paragraph may appear on the previous page. The caption is always spatially adjacent to the float content.
|
||
|
||
**Caption detection heuristics:**
|
||
- Begins with `Figure N.`, `Fig. N.`, `Table N.`, or `TABLE N.` (some journals use uppercase).
|
||
- Font size is smaller than body text (typically 9 pt vs. 10–11 pt body).
|
||
- Horizontally positioned within the float bounding box.
|
||
- For figures: caption is below the image content. For tables: caption is above in most styles, below in others.
|
||
|
||
Use the figure/table number sequence as a consistency check on reading order. If figures appear out of numeric sequence in the extracted output, the column-split or float-grouping logic has a bug.
|
||
|
||
---
|
||
|
||
## 9. Bibliography and References
|
||
|
||
The bibliography zone in LaTeX papers has a predictable structure:
|
||
|
||
- **Section heading:** "References" or "Bibliography" — detect this as a zone boundary.
|
||
- **Numbered entries:** `[1]` style (numeric, `natbib` with `\bibliographystyle{plainnat}`) or `[AuthorYY]` style (author-year). The label is in the left margin, with the reference text indented (hanging indent pattern).
|
||
- **Entry structure:** author list, title (often in quotes or small caps), venue/journal name (often italic), volume/issue, year, pages.
|
||
|
||
**Extraction issues:**
|
||
- Em dashes and en dashes: pdfLaTeX encodes `--` (en dash) as glyph 0x7B in OT1 (U+2013 correctly), but some configurations or older Ghostscript pipelines encode it as a hyphen-minus (U+002D). Apply post-extraction heuristics: a hyphen between two spaces in a bibliography context is likely an en dash.
|
||
- Author name separators use commas and `and`; do not split on these for purposes of word boundary detection.
|
||
- DOI and URL strings in references are often set in a monospace font (`cmtt`) and may contain percent-encoded characters.
|
||
|
||
---
|
||
|
||
## 10. arXiv and Preprint PDFs
|
||
|
||
arXiv is the dominant source of scientific papers requiring extraction. arXiv PDFs are produced by their own TeX installation, accepting author-submitted `.tex` source and compiling with pdfLaTeX or XeLaTeX.
|
||
|
||
**Common issues:**
|
||
|
||
1. **arXiv watermark:** Every page carries a running header of the form `arXiv:XXXX.XXXXX [cs.XX] 1 Jan 2024`. This is a separate text layer added by arXiv's post-processing. Detect by pattern match: text matching `^arXiv:\d{4}\.\d{4,5}` appearing at a consistent Y position near the top margin across all pages. Apply the watermark suppression pipeline (see `watermark-and-background-separation.md`) to exclude it from extracted body text.
|
||
|
||
2. **Page numbers in footer:** Standard footer text at a consistent Y position near the bottom margin, typically a single integer or `N of M`. Exclude from body extraction.
|
||
|
||
3. **Custom packages:** Some arXiv submissions use institutional or journal-specific TeX packages that define custom font encodings or use PUA (Private Use Area) codepoints for special symbols. These are rare but produce unmappable codepoints; flag them for glyph-name fallback.
|
||
|
||
4. **Version stamps:** arXiv v1/v2 papers may have revision watermarks. The same pattern-match heuristic applies.
|
||
|
||
5. **Two-column journals:** Many arXiv papers target double-column journal formats. Apply the two-column splitting logic described in Section 7.
|
||
|
||
**Producer detection for arXiv:** arXiv's TeX Live installation produces `/Producer` values like `pdfTeX-1.40.x` for pdfLaTeX submissions and `XeTeX` for XeLaTeX. The presence of a `/Creator` value containing `LaTeX` (set by the `hyperref` package) is a strong signal that the document is LaTeX-generated regardless of engine.
|
||
|
||
---
|
||
|
||
## Summary: Decision Tree for LaTeX PDFs
|
||
|
||
1. Read `/Info` → `/Producer` and `/Creator`.
|
||
2. If producer matches `pdfTeX`: assume OT1 by default; check font names for `ec` prefix (T1), `cmmi`/`cmsy`/`cmex` (math encodings).
|
||
3. If producer matches `XeTeX` or `LuaTeX`: trust `/ToUnicode` CMaps; expand ligatures via NFKD.
|
||
4. If producer matches `Ghostscript` or is absent: assume no ToUnicode; use glyph-name tables as primary decode path.
|
||
5. For all engines: detect two-column layout geometrically; apply XY-cut before emitting text.
|
||
6. Detect and suppress arXiv watermark and page-number footers.
|
||
7. Detect bibliography zone by heading text; apply hanging-indent reference parser.
|
||
8. Expand all ligature codepoints (U+FB00–U+FB06) to constituent characters in the normalization pass.
|