jedarden 8f8138a65e Add research: font subsetting, LaTeX patterns, redaction detection

Three new extraction research documents covering subset font Unicode
recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and
proper vs. improper redaction detection with output schema.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:30:52 -04:00

14 KiB

Raw Permalink Blame History

LaTeX and Scientific PDF Patterns

Scientific papers represent a large and important class of PDFs requiring text extraction. The vast majority are generated by LaTeX toolchains, which produce PDFs with structural and encoding characteristics that differ fundamentally from word-processor output. This document covers those characteristics in depth, with the goal of informing correct handling in the pdftract extraction pipeline.

1. LaTeX PDF Generators and Their Characteristics

pdfLaTeX

pdfLaTeX is the dominant engine for academic papers. It reads .tex source and emits PDF directly. Key characteristics for extraction:

Embeds Type 1 fonts (PostScript outlines), typically the Computer Modern family.
Uses OT1 (default), T1, OMS, OML, and OMX encodings — none of which are Unicode or ASCII.
Subsets fonts with a six-letter uppercase prefix: ABCDEF+CMR10. The prefix is arbitrary; strip it when matching font names.
Produces untagged PDFs by default — no logical structure tree, no reading-order metadata.
/ToUnicode CMaps are present for text fonts when fontenc with T1 is loaded, absent or incomplete for math fonts and OT1-only documents.
/Producer in the /Info dictionary contains pdfTeX-1.x.x or similar.

XeLaTeX and LuaLaTeX

Both engines accept UTF-8 source directly and produce Unicode output.

Embed OpenType or TrueType fonts (including system fonts via fontspec).
Include well-formed /ToUnicode CMaps covering the full Unicode range used.
Math via the unicode-math package maps to the Mathematical Alphanumeric Symbols Unicode block (U+1D400–U+1D7FF), which is far more extractable than OML/OMS.
/Producer contains XeTeX or LuaTeX.

dvips → Ghostscript (legacy)

The old latex → dvips → ps2pdf/Ghostscript pipeline:

Produces Type 1 or Type 3 fonts.
/ToUnicode CMaps are often absent entirely — Ghostscript does not synthesize them.
/Producer contains GPL Ghostscript or Acrobat Distiller (when Distiller was used on the PostScript output).
Extraction quality is substantially lower; glyph-name fallback (see glyph-recognition-and-unicode-recovery.md) is the primary recovery path.

Detection heuristic: Inspect /Info → /Producer. Match pdfTeX, XeTeX, LuaTeX, or Ghostscript to select the appropriate decoding path.

2. OT1 Encoding and Computer Modern Fonts

OT1 is a 128-slot encoding invented for TeX. It is not ASCII-compatible despite overlapping glyph shapes. Without a /ToUnicode CMap, byte values from the content stream must be passed through the OT1-to-Unicode table.

Detection: Font name (after stripping the 6-letter subset prefix) starts with cmr, cmti, cmbx, cmss, or cmtt and the encoding is not explicitly T1.

OT1 → Unicode Mapping (all 128 positions)

Hex	Unicode	Glyph / Name
00	U+0393	Γ (Gamma)
01	U+0394	Δ (Delta)
02	U+0398	Θ (Theta)
03	U+039B	Λ (Lambda)
04	U+039E	Ξ (Xi)
05	U+03A0	Π (Pi)
06	U+03A3	Σ (Sigma)
07	U+03A5	Υ (Upsilon)
08	U+03A6	Φ (Phi)
09	U+03A8	Ψ (Psi)
0A	U+03A9	Ω (Omega)
0B	U+FB00	ff ligature
0C	U+FB01	fi ligature
0D	U+FB02	fl ligature
0E	U+FB03	ffi ligature
0F	U+FB04	ffl ligature
10	U+0131	ı (dotless i)
11	U+0237	ȷ (dotless j)
12	U+0060	` (grave)
13	U+00B4	´ (acute)
14	U+02C7	ˇ (caron)
15	U+02D8	˘ (breve)
16	U+00AF	¯ (macron)
17	U+02DA	˚ (ring)
18	U+02C8	ˈ (modifier letter vert. line, used as cedilla placeholder)
19	U+00DF	ß (German sharp s)
1A	U+00E6	æ
1B	U+0153	œ
1C	U+00F8	ø
1D	U+00C6	Æ
1E	U+0152	Œ
1F	U+00D8	Ø
20	U+0020	space
21–2F	U+0021–U+002F	!"#$%&'()*+,-./ (ASCII, with exceptions)
22	U+201D	" (right double quotation mark — NOT ASCII quote)
27	U+2019	' (right single quotation mark)
3C	U+00A1	¡ (inverted exclamation)
3D	U+003D	=
3E	U+00BF	¿ (inverted question)
60	U+2018	' (left single quotation mark — NOT backtick)
7B	U+2013	– (en dash)
7C	U+2014	— (em dash)
7D	U+201C	" (left double quotation mark)
7E	U+02DC	˜ (tilde accent)
7F	U+00A8	¨ (diaeresis)

Positions 0x30–0x39 are digits 0–9 (ASCII). Positions 0x41–0x5A and 0x61–0x7A are uppercase and lowercase Latin letters (ASCII-identical). All other positions follow the table above.

3. T1 Encoding (Cork Encoding)

T1, also called Cork encoding, is a 256-slot encoding designed for European languages. It is used by the EC (European Computer) font family: ecrm (roman), ecti (italic), ecbx (bold extended), ectt (typewriter).

Key characteristics:

Positions 0x00–0x1F contain precomposed accented characters (e.g., 0x00 = U+0060 grave, 0x01 = U+00E1 á, 0x02 = U+00E2 â … following a defined ordering of base+accent combinations).
Positions 0x80–0xFF extend the repertoire to cover most of Latin Extended-A (U+0100–U+017E).
T1 is an improvement over OT1 for multilingual text but still requires the lookup table; it is not Unicode.

Detection: Font name starts with ec (after stripping subset prefix), or the font's /Encoding array contains names like /agrave, /aacute, /acircumflex in positions 0x00–0x1F.

When a /ToUnicode CMap is present for a T1-encoded font, prefer it — the CMap is authoritative. Fall back to the T1 table only when the CMap is absent or incomplete.

4. Math Font Encodings

pdfLaTeX uses three math-specific encodings for which /ToUnicode CMaps are almost never generated.

OML — Math Italic (cmmi fonts)

Used for variables and Greek letters in math mode. Positions 0x00–0x7F map to italic Latin letters (U+1D41A–U+1D433 in the Mathematical Alphanumeric block) and lowercase Greek (U+03B1–U+03C9). Uppercase Greek occupies 0x00–0x0F. The cmmi font family (e.g., cmmi10, cmmi7) uses this encoding.

OMS — Math Symbols (cmsy fonts)

Contains binary operators, relations, and arrows. Notable mappings: 0x00 = U+2212 (minus sign), 0x01 = U+22C5 (dot operator), 0x02 = U+00D7 (multiplication sign), 0x03 = U+002A (asterisk), 0x04 = U+00F7 (division sign), 0x0E = U+221E (infinity), 0x0F = U+220F (N-ary product). The cmsy family uses this encoding.

OMX — Math Extension (cmex fonts)

Large delimiters and extensible constructions: integral signs, summation, large parentheses. The cmex family uses this encoding. Many glyphs have no single Unicode equivalent because they are construction pieces; map to the closest Unicode math symbol or discard if purely decorative.

Detection: Strip subset prefix; if font name starts with cmmi → OML, cmsy → OMS, cmex → OMX. Math extraction from pdfLaTeX documents without these tables produces sequences of incorrect characters. Even with the tables, reconstructing a readable math expression requires additional semantic analysis beyond character-level decoding.

5. Ligature Handling

pdfLaTeX automatically substitutes ligature glyphs at the font level. In OT1 encoding:

0x0B = ff (U+FB00)
0x0C = fi (U+FB01)
0x0D = fl (U+FB02)
0x0E = ffi (U+FB03)
0x0F = ffl (U+FB04)

Some font families also produce st (U+FB06) and ct ligatures. When a /ToUnicode CMap is present, ligatures may map to a single Unicode PUA codepoint, to the multi-character sequence ("fi"), or to the Unicode ligature codepoints (U+FB01/U+FB02).

The normalization pipeline should always expand ligatures to their constituent characters using Unicode compatibility decomposition (NFKD) or an explicit lookup table. Retaining U+FB01 in extracted text breaks word matching, search, and tokenization. Apply expansion after CMap decode and before any further text processing.

6. Hyperref Package and PDF Bookmarks

Documents using \usepackage{hyperref} gain:

PDF outline (bookmarks): The /Outlines dictionary contains a tree of /Title entries encoded as UTF-16BE byte strings with BOM 0xFEFF. Decode as UTF-16BE to recover the section title text. These titles are high-quality — they reflect the actual section headings and are not subject to font encoding ambiguity.
Named destinations: Cross-reference links point to /Dest named destinations (e.g., /section.2.3). These are navigation artifacts, not extraction targets, but they confirm logical document structure.
/Info dictionary: hyperref populates /Title, /Author, /Subject, /Keywords from \hypersetup{} or \title{}/\author{} commands. This metadata is reliable and should be extracted as document-level metadata. XMP metadata (if present via hyperxmp) duplicates this in the /Metadata stream.

Detecting hyperref: check for the presence of /Outlines in the document catalog and /Dest entries in link annotations.

7. Two-Column Academic Paper Layout

The standard LaTeX two-column layout (via \twocolumn or the multicol package) divides the text area into two equal columns separated by a narrow gutter (typically 10–20 pt). The page structure is:

Full-width zones: title block, author list, abstract, section headings that span the page, and the bibliography in many journals.
Two-column zones: body text, per-column figures and tables.

Column width formula: col_width = (page_width - left_margin - right_margin - gutter) / 2.

Critical extraction problem: The PDF content stream emits left column text first, then right column text — this is the layout engine's natural order. A naive top-to-bottom Y-sort interleaves the columns, producing unreadable output.

Correct handling: apply the XY-cut algorithm (documented in complex-layout-reading-order.md) with a vertical cut at the column midpoint. The cut should be detected geometrically — find the gap in X-coordinate density across all text runs on the page. Classify each text run as left-column or right-column by its left edge X coordinate relative to the cut point. Emit left column runs in reading order, then right column runs.

8. Figure and Table Placement

LaTeX floats are placed by the layout engine independently of source order. A figure defined after a paragraph may appear on the previous page. The caption is always spatially adjacent to the float content.

Caption detection heuristics:

Begins with Figure N., Fig. N., Table N., or TABLE N. (some journals use uppercase).
Font size is smaller than body text (typically 9 pt vs. 10–11 pt body).
Horizontally positioned within the float bounding box.
For figures: caption is below the image content. For tables: caption is above in most styles, below in others.

Use the figure/table number sequence as a consistency check on reading order. If figures appear out of numeric sequence in the extracted output, the column-split or float-grouping logic has a bug.

9. Bibliography and References

The bibliography zone in LaTeX papers has a predictable structure:

Section heading: "References" or "Bibliography" — detect this as a zone boundary.
Numbered entries: [1] style (numeric, natbib with \bibliographystyle{plainnat}) or [AuthorYY] style (author-year). The label is in the left margin, with the reference text indented (hanging indent pattern).
Entry structure: author list, title (often in quotes or small caps), venue/journal name (often italic), volume/issue, year, pages.

Extraction issues:

Em dashes and en dashes: pdfLaTeX encodes -- (en dash) as glyph 0x7B in OT1 (U+2013 correctly), but some configurations or older Ghostscript pipelines encode it as a hyphen-minus (U+002D). Apply post-extraction heuristics: a hyphen between two spaces in a bibliography context is likely an en dash.
Author name separators use commas and and; do not split on these for purposes of word boundary detection.
DOI and URL strings in references are often set in a monospace font (cmtt) and may contain percent-encoded characters.

10. arXiv and Preprint PDFs

arXiv is the dominant source of scientific papers requiring extraction. arXiv PDFs are produced by their own TeX installation, accepting author-submitted .tex source and compiling with pdfLaTeX or XeLaTeX.

Common issues:

arXiv watermark: Every page carries a running header of the form arXiv:XXXX.XXXXX [cs.XX] 1 Jan 2024. This is a separate text layer added by arXiv's post-processing. Detect by pattern match: text matching ^arXiv:\d{4}\.\d{4,5} appearing at a consistent Y position near the top margin across all pages. Apply the watermark suppression pipeline (see watermark-and-background-separation.md) to exclude it from extracted body text.
Page numbers in footer: Standard footer text at a consistent Y position near the bottom margin, typically a single integer or N of M. Exclude from body extraction.
Custom packages: Some arXiv submissions use institutional or journal-specific TeX packages that define custom font encodings or use PUA (Private Use Area) codepoints for special symbols. These are rare but produce unmappable codepoints; flag them for glyph-name fallback.
Version stamps: arXiv v1/v2 papers may have revision watermarks. The same pattern-match heuristic applies.
Two-column journals: Many arXiv papers target double-column journal formats. Apply the two-column splitting logic described in Section 7.

Producer detection for arXiv: arXiv's TeX Live installation produces /Producer values like pdfTeX-1.40.x for pdfLaTeX submissions and XeTeX for XeLaTeX. The presence of a /Creator value containing LaTeX (set by the hyperref package) is a strong signal that the document is LaTeX-generated regardless of engine.

Summary: Decision Tree for LaTeX PDFs

Read /Info → /Producer and /Creator.
If producer matches pdfTeX: assume OT1 by default; check font names for ec prefix (T1), cmmi/cmsy/cmex (math encodings).
If producer matches XeTeX or LuaTeX: trust /ToUnicode CMaps; expand ligatures via NFKD.
If producer matches Ghostscript or is absent: assume no ToUnicode; use glyph-name tables as primary decode path.
For all engines: detect two-column layout geometrically; apply XY-cut before emitting text.
Detect and suppress arXiv watermark and page-number footers.
Detect bibliography zone by heading text; apply hanging-indent reference parser.
Expand all ligature codepoints (U+FB00–U+FB06) to constituent characters in the normalization pass.

14 KiB Raw Permalink Blame History Unescape Escape