Add research: font subsetting, LaTeX patterns, redaction detection

Three new extraction research documents covering subset font Unicode
recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and
proper vs. improper redaction detection with output schema.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-16 15:30:52 -04:00
parent 04b60a1cf7
commit 8f8138a65e
3 changed files with 619 additions and 0 deletions

View file

@ -0,0 +1,180 @@
# Font Subsetting and Extraction
## Overview
Font subsetting is among the most consequential sources of text extraction failure in practice, yet it receives less attention than encoding tables or CMap parsing. The failure mode is subtle: a font can carry a valid ToUnicode CMap, a well-formed glyph table, and still produce incorrect or missing text because the subset was constructed in a way that breaks the assumptions the extractor relies on. This document covers the mechanics of subsetting, the naming conventions that identify subset fonts, the specific failure modes at each stage of the extraction pipeline, and the recovery strategies a Rust extractor should implement.
---
## 1. What Font Subsetting Is
Embedding an entire font in a PDF is rarely practical. A full OpenType CJK font routinely occupies 1525 MB. A full Latin font with all OpenType features is 200800 KB. Most documents use a fraction of those glyphs: a business letter uses roughly 70120 distinct characters; a CJK document with 500 unique characters may draw on 0.52% of a full font's glyph repertoire.
Authoring tools solve this with **subsetting**: only the glyph programs actually referenced in the document's content streams are embedded. The authoring tool collects every character code appearing in `Tj`, `TJ`, and related text operators, resolves each to a glyph index in the source font, then extracts only those glyph programs into the embedded font. Additional glyphs may be included if the font's shaping rules require them — composite glyphs in TrueType (a glyph that references component glyphs via `glyf` entries), or ligature alternates that the layout engine applied during composition.
Subsetting ratios vary widely:
- **CJK, small document:** 0.52% of full font (200 CIDs from 20,000+)
- **Latin, typical document:** 1550% of full font (80300 glyphs from ~600)
- **Latin, near-exhaustive use:** 7095% of full font (a typeset book using most of the character set)
The practical consequence for extraction: any glyph not in the subset is inaccessible. Attempting to render or name it yields `.notdef`. Knowing that a font is subsetted tells the extractor that absent glyph entries are not bugs — they are by design — and that only the embedded population can be relied upon.
---
## 2. Subset Font Naming
The PDF specification (ISO 32000-2, §9.6.4) mandates a specific naming convention for subset fonts. The `/BaseFont` value in the font dictionary and the `/FontName` value in the `/FontDescriptor` dictionary must both carry a **six-uppercase-letter prefix followed by a plus sign**, e.g.:
```
ABCDEF+Helvetica
XYZQRT+NotoSansCJK-Regular
TMWVPK+CMR10
```
The six letters are chosen arbitrarily by the authoring tool; they carry no semantic content. They are not reproducible across invocations or tool versions — the same document saved twice may produce different prefixes. Their only function is to distinguish this subset instance from the full font and from other subsets of the same font within the same document.
In a Rust extractor, detecting subset fonts reduces to a pattern match against the `/BaseFont` or `/FontName` name object:
```rust
fn is_subset_font(name: &str) -> bool {
let bytes = name.as_bytes();
bytes.len() > 7
&& bytes[6] == b'+'
&& bytes[..6].iter().all(|b| b.is_ascii_uppercase())
}
fn extract_subset_prefix(name: &str) -> Option<&str> {
is_subset_font(name).then(|| &name[..6])
}
```
When both `/BaseFont` and `/FontDescriptor /FontName` are present, they should carry the same prefix. A mismatch indicates a malformed font dictionary; the extractor should prefer the `/FontDescriptor` value for identification purposes and log a warning.
---
## 3. Glyph Re-encoding in Subsets
Subsetting tools frequently re-assign character codes. In the source font, glyph `A` occupies code point 0x41 in the font's encoding or cmap. In the subset, the tool may compact the code space, assigning the glyphs sequential codes starting at 0x01 or 0x20. This is valid: the content stream uses whatever codes the authoring tool wrote, and the font's encoding machinery maps those codes to glyph indices. The critical link is the **ToUnicode CMap** (§9.10.3): it maps the reassigned in-PDF character codes back to Unicode scalar values. If the ToUnicode CMap is present and covers all codes used in the content stream, re-encoding is fully transparent to the extractor.
If the ToUnicode CMap is absent or incomplete, the extractor cannot recover Unicode values by examining the embedded font's cmap table alone, because that cmap reflects the subset's internal code assignments, not Unicode. The embedded cmap is useful for **cross-validating** ToUnicode entries but cannot substitute for it when codes have been reassigned.
---
## 4. CIDFont Subsetting
Type 0 (composite) fonts wrap a CIDFont. The CIDFont embeds glyph data indexed by **CID** (character identifier). For Identity-H and Identity-V CMaps, the CID equals the two-byte character code in the content stream. For other predefined CMaps, the CID is looked up via the CMap's code space ranges.
When a CIDFont is subsetted, the embedded font data contains only the CIDs that were used. The `/CIDToGIDMap` stream (when present) maps CIDs to glyph indices within the embedded font file; for a subset, only entries for included CIDs are meaningful. CIDs outside the subset either have no entry in the CIDToGIDMap or map to GID 0 (`.notdef`).
The ToUnicode CMap for a Type 0 font maps CIDs (or character codes) to Unicode. For subsetted CIDFonts, the ToUnicode CMap should cover exactly the CIDs present in the subset. A ToUnicode entry for a CID not in the subset is harmless but unreachable. A CID present in the content stream but absent from both ToUnicode and the embedded font is an unmapped extraction failure.
---
## 5. OpenType CFF and Type 1 Glyph Table Subsetting
CFF-based fonts (Type 1 fonts and OpenType fonts with a `CFF ` table) store glyph programs as **charstrings** in a `CharStrings` dictionary keyed by glyph name. In a subset, only the charstring entries for included glyphs are present. The extractor can enumerate present glyph names by iterating the CharStrings dict.
This property is useful: even in a heavily subsetted CFF font, the glyph names remain available (e.g., `A`, `fi`, `uni0041`, `uniE001`). For AGL (Adobe Glyph List) lookup, the glyph name is sufficient to recover Unicode without consulting ToUnicode. For shape fingerprinting (rendering the charstring to an outline and matching against a glyph database), only present charstrings can be rendered — the extractor must skip absent glyphs rather than treating their absence as an error.
---
## 6. TrueType Glyph Table Subsetting
TrueType fonts (embedded as `/FontFile2` streams) store glyph outlines in the `glyf` table, with an index in the `loca` table mapping each GID to its offset and length. After subsetting:
- **`loca` entries for excluded GIDs** point to zero-length regions (the GID is present in the index but has no glyph data).
- **`maxp.numGlyphs`** reflects the total GID range in the subset, not the full font.
- **`cmap` table** may be present and contains character-to-GID mappings for the subsetted characters only; non-subsetted characters either have no entry or map to GID 0.
The subset's `cmap` is a useful validation tool: for each code in the ToUnicode CMap, the extractor can verify that the Unicode scalar maps back to a GID with a non-empty `glyf` entry. Discrepancies surface authoring tool bugs or intentional re-encoding.
Composite TrueType glyphs (those with the `COMPOSITE` flag in the `glyf` header) reference component GIDs. If a component GID was not included in the subset, the composite glyph's rendering breaks. Well-behaved subsetting tools always include required components, but the extractor should treat a composite glyph with a missing component as a rendering failure, not a parsing error, and fall back to ToUnicode for the character identity.
---
## 7. Incomplete ToUnicode CMaps
The most operationally significant failure mode is a ToUnicode CMap that covers only a subset of the codes actually used in the document. This happens when:
- The authoring tool generates ToUnicode incrementally and stops before covering all codes.
- The document was assembled from multiple sources with inconsistent encoding tables.
- The font was substituted late in the rendering pipeline without regenerating ToUnicode.
From the extractor's perspective, a code not present in the ToUnicode CMap is **unmapped**: the lookup returns nothing. This is distinct from a code that explicitly maps to U+FFFD (replacement character) or U+0000, both of which are valid (if uninformative) mappings. The extractor must distinguish:
1. `lookup(code) == None` — code absent from CMap; attempt fallback
2. `lookup(code) == Some(0xFFFD)` — explicit no-mapping; still attempt fallback
3. `lookup(code) == Some(c)` where `c` is a valid Unicode scalar — accept
For unmapped codes in a subset font, the recovery path is:
1. **AGL glyph name lookup**: if the embedded font (CFF or Type 1) has a glyph name for the GID, look it up in the Adobe Glyph List. Names like `A`, `fi`, `uni0041`, `uniE001` resolve to Unicode directly.
2. **Shape fingerprinting**: render the glyph outline from the embedded font (charstring execution for CFF, `glyf` parsing for TrueType) and match the normalized path against a reference glyph database. This is computationally expensive and reserved for high-value recovery scenarios.
3. **Unextractable**: if both fail, report the span as unextractable with the raw character code preserved for inspection.
AGL lookup and shape fingerprinting are always worth attempting for partially unmapped subset fonts, even if the majority of codes are mapped via ToUnicode. Partial coverage is common enough that implementing the fallback path yields meaningful improvements in extraction completeness.
---
## 8. Synthetic Glyphs and Outlines
Some subset tools — particularly those processing DRM-restricted fonts — cannot legally embed glyph outlines. Instead, they substitute **synthetic glyphs**: a charstring or `glyf` entry that traces a blank path or a simple rectangle matching the advance width of the original glyph. The glyph occupies the correct horizontal space and the ToUnicode CMap may map the code to the correct Unicode value, but the outline contains no recoverable character identity.
Detection in CFF charstrings: after executing the charstring, if the path operation list is empty (the glyph is `endchar` with no prior drawing operators), or contains only a single rectangular path (four `lineto` calls forming a closed rectangle), the glyph is synthetic. In TrueType, a `glyf` entry consisting solely of a single contour with four on-curve points arranged as a rectangle at the advance-width boundary is the equivalent indicator.
When a synthetic glyph is detected, the extractor should:
- Use the ToUnicode mapping if present (the Unicode value is likely correct even if the outline is not).
- Flag the span as `synthetic_glyph: true` in the extraction output.
- Report character identity confidence as lower, since the Unicode mapping was placed by the DRM tool and may be incorrect.
---
## 9. Re-subsetting and Incremental Update Interactions
A PDF incremental update appends a new cross-reference table and body section without rewriting earlier content. When a page is added that uses an already-subsetted font, the update may extend the subset by appending new glyph data and a revised ToUnicode CMap. The extended font object (with a new object number or an updated generation) replaces the original in the cross-reference resolution order — the **last definition of an object in the file wins**.
The extractor must parse the cross-reference chain from tail to head, ensuring that the most recent font dictionary and ToUnicode CMap are used for each font reference. A common mistake is merging ToUnicode CMaps additively across incremental updates; the correct behavior is to use only the latest CMap for the given font object, which should already incorporate all prior mappings.
---
## 10. Detection and Reporting
Every font processed by the extractor should carry structured subsetting metadata:
```rust
pub struct FontSubsetInfo {
pub is_subset: bool,
pub subset_prefix: Option<String>, // e.g. "ABCDEF"
pub glyphs_embedded: usize, // from maxp.numGlyphs or CharStrings len
}
```
Every extracted span should reference the font's subsetting state and record the source of its Unicode mapping:
```rust
pub enum UnicodeSource {
ToUnicode,
AglGlyphName,
ShapeFingerprint,
Unmapped,
SyntheticGlyph,
}
pub struct SpanMetadata {
pub font_is_subset: bool,
pub unicode_source: UnicodeSource,
// ... other fields
}
```
Confidence assessment:
- `ToUnicode` from a subset font with verified complete CMap coverage: **high confidence**.
- `ToUnicode` from a subset font with partial CMap coverage, this span's codes all mapped: **high confidence**.
- `AglGlyphName`: **medium confidence** (glyph name may be generic or incorrect in the subset).
- `ShapeFingerprint`: **medium confidence** (outline matching has false-positive risk).
- `Unmapped` or `SyntheticGlyph`: **low confidence / unextractable**; preserve raw code for downstream inspection.
Reporting per-span confidence rather than per-font allows the caller to make document-level decisions: a document that is 98% `ToUnicode`-sourced with 2% `Unmapped` spans in footnotes is far more usable than one with 40% unmapped spans in body text.

View file

@ -0,0 +1,239 @@
# LaTeX and Scientific PDF Patterns
Scientific papers represent a large and important class of PDFs requiring text extraction. The vast majority are generated by LaTeX toolchains, which produce PDFs with structural and encoding characteristics that differ fundamentally from word-processor output. This document covers those characteristics in depth, with the goal of informing correct handling in the `pdftract` extraction pipeline.
---
## 1. LaTeX PDF Generators and Their Characteristics
### pdfLaTeX
pdfLaTeX is the dominant engine for academic papers. It reads `.tex` source and emits PDF directly. Key characteristics for extraction:
- Embeds **Type 1 fonts** (PostScript outlines), typically the Computer Modern family.
- Uses **OT1** (default), **T1**, **OMS**, **OML**, and **OMX** encodings — none of which are Unicode or ASCII.
- **Subsets fonts** with a six-letter uppercase prefix: `ABCDEF+CMR10`. The prefix is arbitrary; strip it when matching font names.
- Produces **untagged PDFs** by default — no logical structure tree, no reading-order metadata.
- `/ToUnicode` CMaps are present for text fonts when `fontenc` with T1 is loaded, absent or incomplete for math fonts and OT1-only documents.
- `/Producer` in the `/Info` dictionary contains `pdfTeX-1.x.x` or similar.
### XeLaTeX and LuaLaTeX
Both engines accept UTF-8 source directly and produce Unicode output.
- Embed **OpenType** or **TrueType** fonts (including system fonts via `fontspec`).
- Include well-formed `/ToUnicode` CMaps covering the full Unicode range used.
- Math via the `unicode-math` package maps to the Mathematical Alphanumeric Symbols Unicode block (U+1D400U+1D7FF), which is far more extractable than OML/OMS.
- `/Producer` contains `XeTeX` or `LuaTeX`.
### dvips → Ghostscript (legacy)
The old `latex``dvips``ps2pdf`/`Ghostscript` pipeline:
- Produces **Type 1** or **Type 3** fonts.
- `/ToUnicode` CMaps are often absent entirely — Ghostscript does not synthesize them.
- `/Producer` contains `GPL Ghostscript` or `Acrobat Distiller` (when Distiller was used on the PostScript output).
- Extraction quality is substantially lower; glyph-name fallback (see `glyph-recognition-and-unicode-recovery.md`) is the primary recovery path.
**Detection heuristic:** Inspect `/Info``/Producer`. Match `pdfTeX`, `XeTeX`, `LuaTeX`, or `Ghostscript` to select the appropriate decoding path.
---
## 2. OT1 Encoding and Computer Modern Fonts
OT1 is a 128-slot encoding invented for TeX. It is **not** ASCII-compatible despite overlapping glyph shapes. Without a `/ToUnicode` CMap, byte values from the content stream must be passed through the OT1-to-Unicode table.
**Detection:** Font name (after stripping the 6-letter subset prefix) starts with `cmr`, `cmti`, `cmbx`, `cmss`, or `cmtt` and the encoding is not explicitly T1.
### OT1 → Unicode Mapping (all 128 positions)
| Hex | Unicode | Glyph / Name |
|-----|---------|--------------|
| 00 | U+0393 | Γ (Gamma) |
| 01 | U+0394 | Δ (Delta) |
| 02 | U+0398 | Θ (Theta) |
| 03 | U+039B | Λ (Lambda) |
| 04 | U+039E | Ξ (Xi) |
| 05 | U+03A0 | Π (Pi) |
| 06 | U+03A3 | Σ (Sigma) |
| 07 | U+03A5 | Υ (Upsilon) |
| 08 | U+03A6 | Φ (Phi) |
| 09 | U+03A8 | Ψ (Psi) |
| 0A | U+03A9 | Ω (Omega) |
| 0B | U+FB00 | ff ligature |
| 0C | U+FB01 | fi ligature |
| 0D | U+FB02 | fl ligature |
| 0E | U+FB03 | ffi ligature |
| 0F | U+FB04 | ffl ligature |
| 10 | U+0131 | ı (dotless i) |
| 11 | U+0237 | ȷ (dotless j) |
| 12 | U+0060 | ` (grave) |
| 13 | U+00B4 | ´ (acute) |
| 14 | U+02C7 | ˇ (caron) |
| 15 | U+02D8 | ˘ (breve) |
| 16 | U+00AF | ¯ (macron) |
| 17 | U+02DA | ˚ (ring) |
| 18 | U+02C8 | ˈ (modifier letter vert. line, used as cedilla placeholder) |
| 19 | U+00DF | ß (German sharp s) |
| 1A | U+00E6 | æ |
| 1B | U+0153 | œ |
| 1C | U+00F8 | ø |
| 1D | U+00C6 | Æ |
| 1E | U+0152 | Œ |
| 1F | U+00D8 | Ø |
| 20 | U+0020 | space |
| 212F | U+0021U+002F | !"#$%&'()*+,-./ (ASCII, with exceptions) |
| 22 | U+201D | " (right double quotation mark — NOT ASCII quote) |
| 27 | U+2019 | ' (right single quotation mark) |
| 3C | U+00A1 | ¡ (inverted exclamation) |
| 3D | U+003D | = |
| 3E | U+00BF | ¿ (inverted question) |
| 60 | U+2018 | ' (left single quotation mark — NOT backtick) |
| 7B | U+2013 | (en dash) |
| 7C | U+2014 | — (em dash) |
| 7D | U+201C | " (left double quotation mark) |
| 7E | U+02DC | ˜ (tilde accent) |
| 7F | U+00A8 | ¨ (diaeresis) |
Positions 0x300x39 are digits 09 (ASCII). Positions 0x410x5A and 0x610x7A are uppercase and lowercase Latin letters (ASCII-identical). All other positions follow the table above.
---
## 3. T1 Encoding (Cork Encoding)
T1, also called Cork encoding, is a 256-slot encoding designed for European languages. It is used by the EC (European Computer) font family: `ecrm` (roman), `ecti` (italic), `ecbx` (bold extended), `ectt` (typewriter).
Key characteristics:
- Positions **0x000x1F** contain precomposed accented characters (e.g., 0x00 = U+0060 grave, 0x01 = U+00E1 á, 0x02 = U+00E2 â … following a defined ordering of base+accent combinations).
- Positions **0x800xFF** extend the repertoire to cover most of Latin Extended-A (U+0100U+017E).
- T1 is an improvement over OT1 for multilingual text but still requires the lookup table; it is not Unicode.
**Detection:** Font name starts with `ec` (after stripping subset prefix), or the font's `/Encoding` array contains names like `/agrave`, `/aacute`, `/acircumflex` in positions 0x000x1F.
When a `/ToUnicode` CMap is present for a T1-encoded font, prefer it — the CMap is authoritative. Fall back to the T1 table only when the CMap is absent or incomplete.
---
## 4. Math Font Encodings
pdfLaTeX uses three math-specific encodings for which `/ToUnicode` CMaps are almost never generated.
### OML — Math Italic (cmmi fonts)
Used for variables and Greek letters in math mode. Positions 0x000x7F map to italic Latin letters (U+1D41AU+1D433 in the Mathematical Alphanumeric block) and lowercase Greek (U+03B1U+03C9). Uppercase Greek occupies 0x000x0F. The `cmmi` font family (e.g., `cmmi10`, `cmmi7`) uses this encoding.
### OMS — Math Symbols (cmsy fonts)
Contains binary operators, relations, and arrows. Notable mappings: 0x00 = U+2212 (minus sign), 0x01 = U+22C5 (dot operator), 0x02 = U+00D7 (multiplication sign), 0x03 = U+002A (asterisk), 0x04 = U+00F7 (division sign), 0x0E = U+221E (infinity), 0x0F = U+220F (N-ary product). The `cmsy` family uses this encoding.
### OMX — Math Extension (cmex fonts)
Large delimiters and extensible constructions: integral signs, summation, large parentheses. The `cmex` family uses this encoding. Many glyphs have no single Unicode equivalent because they are construction pieces; map to the closest Unicode math symbol or discard if purely decorative.
**Detection:** Strip subset prefix; if font name starts with `cmmi` → OML, `cmsy` → OMS, `cmex` → OMX. Math extraction from pdfLaTeX documents without these tables produces sequences of incorrect characters. Even with the tables, reconstructing a readable math expression requires additional semantic analysis beyond character-level decoding.
---
## 5. Ligature Handling
pdfLaTeX automatically substitutes ligature glyphs at the font level. In OT1 encoding:
- 0x0B = ff (U+FB00)
- 0x0C = fi (U+FB01)
- 0x0D = fl (U+FB02)
- 0x0E = ffi (U+FB03)
- 0x0F = ffl (U+FB04)
Some font families also produce `st` (U+FB06) and `ct` ligatures. When a `/ToUnicode` CMap is present, ligatures may map to a single Unicode PUA codepoint, to the multi-character sequence ("fi"), or to the Unicode ligature codepoints (U+FB01/U+FB02).
The normalization pipeline should always **expand ligatures to their constituent characters** using Unicode compatibility decomposition (NFKD) or an explicit lookup table. Retaining U+FB01 in extracted text breaks word matching, search, and tokenization. Apply expansion after CMap decode and before any further text processing.
---
## 6. Hyperref Package and PDF Bookmarks
Documents using `\usepackage{hyperref}` gain:
- **PDF outline (bookmarks):** The `/Outlines` dictionary contains a tree of `/Title` entries encoded as **UTF-16BE** byte strings with BOM `0xFEFF`. Decode as UTF-16BE to recover the section title text. These titles are high-quality — they reflect the actual section headings and are not subject to font encoding ambiguity.
- **Named destinations:** Cross-reference links point to `/Dest` named destinations (e.g., `/section.2.3`). These are navigation artifacts, not extraction targets, but they confirm logical document structure.
- **/Info dictionary:** `hyperref` populates `/Title`, `/Author`, `/Subject`, `/Keywords` from `\hypersetup{}` or `\title{}/\author{}` commands. This metadata is reliable and should be extracted as document-level metadata. XMP metadata (if present via `hyperxmp`) duplicates this in the `/Metadata` stream.
Detecting hyperref: check for the presence of `/Outlines` in the document catalog and `/Dest` entries in link annotations.
---
## 7. Two-Column Academic Paper Layout
The standard LaTeX two-column layout (via `\twocolumn` or the `multicol` package) divides the text area into two equal columns separated by a narrow gutter (typically 1020 pt). The page structure is:
- **Full-width zones:** title block, author list, abstract, section headings that span the page, and the bibliography in many journals.
- **Two-column zones:** body text, per-column figures and tables.
Column width formula: `col_width = (page_width - left_margin - right_margin - gutter) / 2`.
**Critical extraction problem:** The PDF content stream emits left column text first, then right column text — this is the layout engine's natural order. A naive top-to-bottom Y-sort interleaves the columns, producing unreadable output.
Correct handling: apply the XY-cut algorithm (documented in `complex-layout-reading-order.md`) with a vertical cut at the column midpoint. The cut should be detected geometrically — find the gap in X-coordinate density across all text runs on the page. Classify each text run as left-column or right-column by its left edge X coordinate relative to the cut point. Emit left column runs in reading order, then right column runs.
---
## 8. Figure and Table Placement
LaTeX floats are placed by the layout engine independently of source order. A figure defined after a paragraph may appear on the previous page. The caption is always spatially adjacent to the float content.
**Caption detection heuristics:**
- Begins with `Figure N.`, `Fig. N.`, `Table N.`, or `TABLE N.` (some journals use uppercase).
- Font size is smaller than body text (typically 9 pt vs. 1011 pt body).
- Horizontally positioned within the float bounding box.
- For figures: caption is below the image content. For tables: caption is above in most styles, below in others.
Use the figure/table number sequence as a consistency check on reading order. If figures appear out of numeric sequence in the extracted output, the column-split or float-grouping logic has a bug.
---
## 9. Bibliography and References
The bibliography zone in LaTeX papers has a predictable structure:
- **Section heading:** "References" or "Bibliography" — detect this as a zone boundary.
- **Numbered entries:** `[1]` style (numeric, `natbib` with `\bibliographystyle{plainnat}`) or `[AuthorYY]` style (author-year). The label is in the left margin, with the reference text indented (hanging indent pattern).
- **Entry structure:** author list, title (often in quotes or small caps), venue/journal name (often italic), volume/issue, year, pages.
**Extraction issues:**
- Em dashes and en dashes: pdfLaTeX encodes `--` (en dash) as glyph 0x7B in OT1 (U+2013 correctly), but some configurations or older Ghostscript pipelines encode it as a hyphen-minus (U+002D). Apply post-extraction heuristics: a hyphen between two spaces in a bibliography context is likely an en dash.
- Author name separators use commas and `and`; do not split on these for purposes of word boundary detection.
- DOI and URL strings in references are often set in a monospace font (`cmtt`) and may contain percent-encoded characters.
---
## 10. arXiv and Preprint PDFs
arXiv is the dominant source of scientific papers requiring extraction. arXiv PDFs are produced by their own TeX installation, accepting author-submitted `.tex` source and compiling with pdfLaTeX or XeLaTeX.
**Common issues:**
1. **arXiv watermark:** Every page carries a running header of the form `arXiv:XXXX.XXXXX [cs.XX] 1 Jan 2024`. This is a separate text layer added by arXiv's post-processing. Detect by pattern match: text matching `^arXiv:\d{4}\.\d{4,5}` appearing at a consistent Y position near the top margin across all pages. Apply the watermark suppression pipeline (see `watermark-and-background-separation.md`) to exclude it from extracted body text.
2. **Page numbers in footer:** Standard footer text at a consistent Y position near the bottom margin, typically a single integer or `N of M`. Exclude from body extraction.
3. **Custom packages:** Some arXiv submissions use institutional or journal-specific TeX packages that define custom font encodings or use PUA (Private Use Area) codepoints for special symbols. These are rare but produce unmappable codepoints; flag them for glyph-name fallback.
4. **Version stamps:** arXiv v1/v2 papers may have revision watermarks. The same pattern-match heuristic applies.
5. **Two-column journals:** Many arXiv papers target double-column journal formats. Apply the two-column splitting logic described in Section 7.
**Producer detection for arXiv:** arXiv's TeX Live installation produces `/Producer` values like `pdfTeX-1.40.x` for pdfLaTeX submissions and `XeTeX` for XeLaTeX. The presence of a `/Creator` value containing `LaTeX` (set by the `hyperref` package) is a strong signal that the document is LaTeX-generated regardless of engine.
---
## Summary: Decision Tree for LaTeX PDFs
1. Read `/Info``/Producer` and `/Creator`.
2. If producer matches `pdfTeX`: assume OT1 by default; check font names for `ec` prefix (T1), `cmmi`/`cmsy`/`cmex` (math encodings).
3. If producer matches `XeTeX` or `LuaTeX`: trust `/ToUnicode` CMaps; expand ligatures via NFKD.
4. If producer matches `Ghostscript` or is absent: assume no ToUnicode; use glyph-name tables as primary decode path.
5. For all engines: detect two-column layout geometrically; apply XY-cut before emitting text.
6. Detect and suppress arXiv watermark and page-number footers.
7. Detect bibliography zone by heading text; apply hanging-indent reference parser.
8. Expand all ligature codepoints (U+FB00U+FB06) to constituent characters in the normalization pass.

View file

@ -0,0 +1,200 @@
# Redaction Detection and Recovery
## Overview
PDF redaction is the process of permanently removing sensitive content from a document before publication. The operative word is "permanently" — proper redaction destroys the underlying data. In practice, a significant fraction of published "redacted" documents fail this requirement: the content is visually obscured but remains fully accessible in the content stream. `pdftract` must handle both cases correctly, surfacing recoverable text while accurately representing the extraction state to the caller.
---
## 1. Proper vs. Improper Redaction
**Proper redaction** modifies the content stream itself. The text operators covering the redacted region are removed and replaced with an opaque fill (typically black). The original characters are gone; no amount of content-stream inspection will recover them.
**Improper redaction** leaves the original text operators intact in the content stream and merely paints a covering graphic on top — a black rectangle, a dark raster image, or an opaque layer element. The text is fully present and extractable without any special technique; it simply is not rendered visibly.
The prevalence of improper redaction in government and legal documents is well-documented. Entire classified passages, witness names, and financial figures have been recovered from "redacted" PDFs produced by government agencies, law firms, and courts — the producing party drew a black box in Word or Acrobat without invoking the actual redaction workflow. `pdftract` must distinguish which case it is in, both to recover text where possible and to label that text with appropriate provenance warnings.
---
## 2. PDF Redaction Annotations (`/Redact`)
PDF 1.7 (ISO 32000-1) introduced the `/Redact` annotation subtype. A redaction annotation marks a region for removal and carries metadata about the intended replacement appearance. Key dictionary entries:
| Key | Type | Description |
|---|---|---|
| `/Subtype` | name | Must be `/Redact` |
| `/QuadPoints` | array of numbers | Pairs of x,y coordinates defining the covered quadrilaterals |
| `/IC` | array | Interior fill color (DeviceRGB), typically `[0 0 0]` (black) |
| `/OverlayText` | text string | Replacement text rendered after apply (often empty or `"[REDACTED]"`) |
| `/Repeat` | boolean | Tile overlay text to fill the region |
| `/DA` | string | Default appearance string for overlay text (font, size, color) |
| `/RO` | stream | Rollover appearance XObject |
The critical distinction for `pdftract` is **applied vs. unapplied**:
- **Unapplied**: The annotation exists in the page `Annots` array but "Apply Redactions" has never been invoked. The content stream is unmodified. The text under `QuadPoints` is fully present and extractable.
- **Applied**: The application consumed the annotation (removed it from `Annots`), deleted the covered text operators from the content stream, and rendered the fill rectangle and overlay text directly into the stream. The annotation no longer exists. The text is genuinely absent.
When a `/Redact` annotation is still present in `Annots`, the document was not properly redacted. This is a detection opportunity.
---
## 3. Detecting Unapplied Redaction Annotations
During page object parsing, after collecting the `Annots` array, filter for entries where `/Subtype` equals `/Redact`. Each such entry represents intended but unapplied redaction.
**Algorithm:**
1. Resolve the `Annots` indirect references for the page.
2. For each annotation dictionary, check `/Subtype /Redact`.
3. Extract the `QuadPoints` array. Each group of eight values `[x1 y1 x2 y2 x3 y3 x4 y4]` defines one quadrilateral in page space (bottom-left origin).
4. Compute the axis-aligned bounding box of each quadrilateral.
5. After content-stream extraction, intersect these bounding boxes with the extracted text spans using the overlap test from Section 4.
6. Collect all spans whose bounding boxes overlap significantly with any redaction quadrilateral.
**Output fields** for each discovered unapplied annotation:
```rust
RedactionEvent {
event_type: RedactionType::UnappliedAnnotation,
bbox: Rect, // from QuadPoints
annotation_ref: ObjRef, // indirect reference to the annotation dict
recovered_text: Option<String>,
warning: "unapplied_redaction_detected",
}
```
The recovered text must be included in page output with `zone: "redacted_content"` and `redaction_warning: true`. The caller can suppress it with `include_redacted_content: false`, but the `redaction_events` entry is always emitted regardless of that flag.
---
## 4. Detecting Improper Redaction via Black Rectangle Overlap
The most common improper redaction draws a filled black path over text using the PDF graphics operators `f`, `F`, `f*`, or `B` (fill or fill-and-stroke).
**Detection algorithm:**
1. During graphics state tracking, maintain a list of closed filled paths with their current fill color.
2. A path qualifies as a candidate redaction rectangle when:
- The current fill colorspace is DeviceGray, DeviceRGB, or a DeviceCMYK equivalent resolving to near-black (luminance < 0.05 after conversion to linear sRGB).
- The path's axis-aligned bounding box has area > 100 square points (roughly 1.4 cm², filtering out hairlines and thin rules).
- The path is convex (or is literally a rectangle: four straight segments forming a closed loop).
3. After both path collection and text span extraction are complete, test each text span against each candidate rectangle.
4. Overlap test: compute the Intersection over Union (IoU) of the span's bounding box and the rectangle's bounding box. An IoU > 0.5 indicates the span is substantially covered.
Painting order matters. A black rectangle drawn **after** the text (later in the content stream) visually covers it but leaves the text operators intact. A rectangle drawn **before** the text would be painted over by the text, not covering it. Track the stream position index of each element to enforce the ordering requirement: the covering rectangle must have a higher stream position than the text spans it overlaps.
**Output:** `RedactionEvent { event_type: RedactionType::CoveringRectangle, covering_element: CoveringElement::Rectangle, ... }`.
---
## 5. Detecting Improper Redaction via Image Overlay
A raster image XObject (placed with the `Do` operator) can serve as a covering black patch. This is common when screen-captured redaction tools export to PDF.
**Detection algorithm:**
1. During content stream processing, when `Do` is encountered with an XObject of `/Subtype /Image`, record the image's position and dimensions in page space (derived from the current transformation matrix at the time of `Do`).
2. Decode the image into grayscale (or convert from its native colorspace). Compute the mean pixel luminance.
3. A covering image candidate satisfies:
- Mean luminance < 30/255 (approximately 12% brightness).
- Rendered area > 100 square points (same threshold as rectangles).
4. Apply the same IoU > 0.5 overlap test against text spans, with the same stream-position ordering requirement (image rendered after text).
For inline images (`BI`/`EI`), apply identical criteria.
**Output:** `covering_element: CoveringElement::Image`.
---
## 6. Layer-Based Redaction
Optional Content Groups (OCGs, PDF 1.5+) can implement redaction by placing covering graphics on a visible layer above a text layer. The default OCG configuration (`/D` dictionary in the `/OCProperties` dictionary) controls which layers are visible on open.
**Detection algorithm:**
1. Parse the `OCProperties` dictionary from the document catalog.
2. Enumerate all OCGs and their default visibility (`/ON` vs. `/OFF` in the `/D/OFF` and `/D/ON` arrays).
3. For each content stream element, note its associated OCG (from enclosing `BDC` marked-content sequences with `/OC` property or from `/OC` entries on XObjects).
4. Identify OCGs that consist entirely of near-black filled rectangles or dark images (using the criteria from Sections 4 and 5). Call these "redaction layers."
5. Identify OCGs that contain text spans at the same page positions. Call these "content layers."
6. If a redaction layer is in the default-on set and a content layer at the same position is in the default-on set (both visible simultaneously), the text is covered but present.
Note that text on any layer — regardless of its visibility in the default configuration — is present in the content stream and extractable. The layer's visibility state is a rendering hint, not a data presence indicator.
**Output:** `covering_element: CoveringElement::Layer`, plus the OCG name in the event metadata.
---
## 7. Text Under Transparency
A translucent dark rectangle (fill color near black, but painted into an ExtGState with `ca` < 1.0, or using blend mode `Multiply`) obscures text visually but does not remove it from the content stream.
Detection follows the same bounding-box overlap logic as Section 4, with the additional criterion that the ExtGState's `ca` (non-stroking alpha) is less than 1.0. The luminance threshold may be relaxed slightly: a 50% opaque black rectangle has an effective luminance of ~0.5 against a white background, but the intent is still concealment. Apply a threshold of effective luminance (alpha × fill_luminance) < 0.3.
The text is fully extractable regardless. Emit the event with `event_type: RedactionType::TransparentOverlay`.
---
## 8. Color-Match Concealment in Redaction Context
White text on a white background (or any text whose fill color matches the page background) is covered in the invisible-text document; however, in a redaction context it takes on additional significance. When white-on-white text appears in a region that immediately follows a `/Redact` annotation in the annotation list, or where a same-color filled rectangle was drawn, this combination is a deliberate concealment pattern rather than an incidental rendering artifact.
Detect this by noting the position of white-on-white spans and correlating against: (a) nearby unapplied `/Redact` annotations, and (b) same-color background rectangles drawn at the same position. When the correlation fires, emit `event_type: RedactionType::ColorMatchConcealment` in addition to the standard invisible-text warning.
---
## 9. Properly Applied Redaction: What Remains
When redaction is correctly applied, the authoring tool modifies the content stream: text operators in the covered region are deleted, and a filled rectangle (in the redaction color) is inserted in their place. The `/Redact` annotation is consumed and removed from `Annots`. There is no annotation trail remaining in the live document.
Evidence of past redaction may appear in:
- **XMP metadata**: The `xmpMM:History` array may contain `stEvt:action = "saved"` entries with software like "Acrobat Redact," or a `pdfx:Marked` field indicating the document was reviewed.
- **Content stream gaps**: Regions of the page that contain only filled black rectangles with no surrounding text activity, especially when the surrounding text flow suggests missing words.
- **Structural gaps in tagged PDFs**: `/Artifact` tagged elements covering regions with no associated `ActualText` where surrounding structure implies content should be present.
`pdftract` cannot recover properly applied redaction — the data is gone. The extractor will encounter the black fill rectangle (a graphics element, not a covering graphic over text), produce no text spans for that region, and may optionally note the apparent gap as `event_type: RedactionType::AppliedRedaction` when heuristics are confident.
---
## 10. Output and Policy
All redaction events are gathered into a per-page `redaction_events: Vec<RedactionEvent>` field, always populated regardless of `include_redacted_content`.
```rust
pub struct RedactionEvent {
pub event_type: RedactionType,
pub bbox: Rect,
pub covering_element: Option<CoveringElement>,
pub recovered_text: Option<String>,
pub redaction_warning: bool,
pub annotation_ref: Option<ObjRef>,
}
pub enum RedactionType {
UnappliedAnnotation,
CoveringRectangle,
CoveringImage,
LayerBased,
TransparentOverlay,
ColorMatchConcealment,
AppliedRedaction,
}
pub enum CoveringElement {
Rectangle,
Image,
Layer,
}
```
Text spans recovered from improper redaction carry:
- `zone: "redacted_content"` for unapplied `/Redact` annotations.
- `zone: "covered_content"` for rectangle, image, or layer-based improper redaction.
- `redaction_warning: true` on the span.
When `include_redacted_content: false`, these spans are omitted from the text output but their `RedactionEvent` entries remain. This allows callers (e.g., a compliance tool) to detect and report improper redaction without inadvertently re-publishing the content.
The default is `include_redacted_content: true``pdftract`'s goal is maximum text recovery, and suppression is an explicit caller decision.