From 4e72c66763a15c09ddc3c6edb85663554cbdd597 Mon Sep 17 00:00:00 2001 From: jedarden Date: Sat, 16 May 2026 16:18:03 -0400 Subject: [PATCH] Add research: Indic scripts, adversarial parser security Two new research documents covering Indic script extraction (abugida structure, ToUnicode CMap failures for shaped glyphs, ActualText fast-path, GSUB lookup reversal, pre-base matra reordering, virama placement, Tesseract fallback with script-specific models) and adversarial input handling (decompression bombs, circular references, malformed stream lengths, path traversal in attachments, content stream loop detection, O(n log n) algorithm requirements, output sanitization). Co-Authored-By: Claude Sonnet 4.6 --- .../adversarial-inputs-and-parser-security.md | 67 ++++++++++++++ docs/research/indic-script-extraction.md | 91 +++++++++++++++++++ 2 files changed, 158 insertions(+) create mode 100644 docs/research/adversarial-inputs-and-parser-security.md create mode 100644 docs/research/indic-script-extraction.md diff --git a/docs/research/adversarial-inputs-and-parser-security.md b/docs/research/adversarial-inputs-and-parser-security.md new file mode 100644 index 0000000..fa330d5 --- /dev/null +++ b/docs/research/adversarial-inputs-and-parser-security.md @@ -0,0 +1,67 @@ +# Adversarial PDF Inputs, Resource Exhaustion, and Parser Security + +PDF is a rich, decades-old format designed for faithful document reproduction. That richness makes it an attractive attack surface: a single file can encode compressed streams, cross-referenced object graphs, recursive content descriptions, and embedded attachments — all with a parser that must handle malformed or hostile inputs without crashing, hanging, or consuming unbounded resources. pdftract is designed to run in production environments where the input is fully untrusted. This document catalogs the concrete threats and the specific defensive techniques required to handle them safely. + +## Decompression Bombs + +FlateDecode (zlib) compressed streams are the most common compression method in PDF. A decompression bomb is a stream where a small compressed payload — sometimes under 1 KB — expands to an enormous output, often multiple gigabytes. The PDF specification places no upper bound on the decompression ratio, so a naive implementation that decompresses the full stream into memory before inspecting it will exhaust available RAM. + +pdftract must enforce limits during incremental decompression. The correct approach is a streaming decompress loop that writes into a fixed-size output buffer, tracking total bytes emitted. If the decompressed size exceeds an absolute ceiling — 512 MB per stream is a reasonable production default — the stream is truncated and an error is recorded on the page or object being parsed. A ratio check provides an earlier warning: if the decompressed output reaches 1,000 times the compressed size before hitting the absolute cap, that stream is suspicious and can be flagged for logging even if it is still within the absolute limit. The key invariant is that the limit is enforced incrementally, not retroactively; pdftract must never allocate a buffer sized to the expected decompressed length before any data has been read. + +## Deeply Nested Object Structures + +PDF dictionaries and arrays can nest arbitrarily. A crafted file can encode an array of dictionaries of arrays thousands of levels deep. A recursive parser descending into this structure will overflow the thread stack before reaching the leaf values. The same hazard appears in object reference chains: object 1 references object 2, which references object 3, continuing to object N, where N can be crafted to exceed the stack depth available. + +pdftract must parse object structures iteratively, maintaining an explicit stack on the heap rather than using the call stack. A maximum nesting depth of 512 levels is enforced at parse time; attempting to descend past this limit causes the current container to be returned as-is with its remaining children omitted. This depth limit applies uniformly to dictionaries, arrays, and the implicit nesting created by following indirect object references during value resolution. + +## Circular Object References + +Circular references — where object A references B, B references C, and C references A — create infinite loops during resolution if not detected. PDF's cross-reference mechanism makes these straightforward to construct: any object can reference any other by number, and the specification does not prohibit cycles. + +Detection requires a thread-local resolution stack implemented as a `HashSet` of object numbers currently being resolved. Before following any indirect reference, pdftract inserts the target object number into the set. If the insert fails because the number is already present, a cycle has been detected; the lookup returns a null value immediately and the cycle is broken without recursing. The object number is removed from the set when resolution returns — this is a depth-first visited set, not a permanent memoization table, so different call paths can legitimately visit the same object independently. + +## Enormous Object Counts + +A PDF trailer's `/Size` entry declares the number of objects in the cross-reference table. A hostile file can set this to an extreme value such as 10,000,000, hoping the parser allocates a dense array of that size at startup. At even 8 bytes per slot, that is 80 MB of zero-initialized memory for a file that may contain only a dozen real objects. + +pdftract uses a lazy, sparse object table. During startup, xref entries are recorded in a `HashMap` mapping object number to byte offset — a structure that grows proportionally to actual entries, not to the declared `/Size`. Objects are not loaded or deserialized until they are explicitly requested. An LRU object cache bounds the number of simultaneously resident deserialized objects; objects evicted from the cache are re-parsed from their byte offsets on next access. This architecture means that a file with a fraudulent `/Size` of ten million but only a hundred real objects costs only the memory for those hundred cache entries. + +## Malformed Stream Lengths + +A PDF stream dictionary must contain a `/Length` key indicating the number of bytes before the `endstream` marker. Two failure modes exist. First, a `/Length` that is smaller than the actual stream content — perhaps `/Length 0` — leaves a parser that trusts the declared length stopping before the real data ends, potentially treating stream body bytes as top-level syntax and generating confusing parse errors downstream. Second, a `/Length` that is larger than the actual file — perhaps `/Length 100000000` pointing past end-of-file — causes a parser that naively allocates or reads that many bytes to either crash or consume excessive I/O. + +pdftract validates stream lengths by never allocating more than is available from the current file position to EOF. When a declared length would overrun the file, pdftract clamps the read to the remaining bytes and searches forward for the `endstream` token to determine the actual boundary. For under-declared lengths, pdftract scans ahead for `endstream` as well, reading up to a configurable scan limit beyond the declared end. The correct stream boundary is always determined by physical token search, with the `/Length` value treated as a hint for buffer sizing only — never as an authoritative allocation size. + +## Path Traversal in Embedded Filenames + +PDF supports embedded file attachments via file specification dictionaries. These dictionaries include a `/F` (filename) entry that may contain arbitrary string data, including path traversal sequences such as `../../etc/passwd` or absolute paths like `/etc/shadow`. A consumer that writes attachment metadata or extracted files using the raw embedded filename without sanitization creates an arbitrary write primitive. + +pdftract sanitizes all embedded filenames at the point of extraction. Only the final path component is retained — equivalent to `Path::file_name()` in Rust — and any string containing a path separator character (forward slash, backslash, or null byte) is rejected outright rather than stripped. Filenames that resolve to an empty string after sanitization are replaced with a generated placeholder. This sanitization is applied uniformly to both the `/F` and `/UF` (Unicode filename) fields; the Unicode form must be decoded before the separator check is applied. + +## Content Stream Infinite Loops via Form XObjects + +Page content streams can invoke reusable Form XObjects with the `Do` operator. A Form XObject is itself a content stream, and it can invoke other Form XObjects — including itself. A cycle in this graph causes a recursive descent that terminates only when the stack overflows. + +pdftract tracks Form XObject invocations per page render using two complementary mechanisms. A `HashSet` of currently active Form XObject object numbers detects direct and indirect cycles: before invoking a Form XObject's content stream, its object number is inserted; if already present, the invocation is skipped. A separate hard counter on total `Do` operator invocations per page is enforced at 10,000 invocations; beyond this threshold, all further `Do` calls on the page are no-ops. The invocation count is not reset when descending into a Form XObject — it is a global budget for the entire page render. + +## Integer Overflow in Coordinate Calculations + +PDF coordinate spaces use arbitrary-precision floating-point values. A page may declare a coordinate transformation matrix with values on the order of 1e30, or glyph positions may be expressed as extremely large or small numbers. When these values are composed through a chain of transformations and ultimately converted to output coordinates, naive f32 arithmetic overflows silently, and even f64 can produce infinities or NaN values that propagate through a rendering pipeline. + +pdftract uses `f64` throughout all coordinate calculations. After each transformation step, coordinates are clamped to a finite range before they are used in further arithmetic; any value that is infinite or NaN is replaced with zero. At the point where coordinates are mapped to output page dimensions, values are clamped to the page bounding box. This ensures that no coordinate value can exceed representable range at any stage, and that extreme input values produce bounded, predictable output rather than silent integer wraparound or floating-point exceptions. + +## Time-Based Denial of Service + +Quadratic or worse algorithms are a common source of parser DoS. A concrete example is span merging during text extraction: if each character operator on a page produces an individual span, and the merging pass compares each span against all previous spans to find candidates for concatenation, a page with 100,000 character operators requires 10 billion comparisons. Processing time becomes catastrophic on crafted inputs while remaining imperceptible on typical documents. + +pdftract requires that all hot-path algorithms on per-page data run in O(n log n) or better. Span merging uses a sort-then-linear-scan approach: spans are sorted by position once, then a single left-to-right pass merges adjacent spans in O(n) time after the O(n log n) sort. Cross-reference table parsing, object graph traversal, and font encoding resolution are each analyzed for worst-case complexity before inclusion. Where complexity cannot be bounded analytically, per-operation counters with configurable ceilings are used to enforce a total work budget. + +## Output Sanitization + +Text extracted from PDF content streams may contain null bytes, control characters in the C0 and C1 ranges, bidirectional override sequences, or other byte sequences that are benign in PDF context but harmful when passed to downstream consumers. JSON parsers, databases, and logging systems all have different sensitivities; a null byte that terminates a C string in one layer may produce a truncated record that corrupts a lookup in another. + +pdftract treats itself as the last trust boundary before extracted text enters a system. The output post-processing pipeline strips null bytes unconditionally, replaces control characters outside the normal whitespace set (tab, newline, carriage return) with the Unicode replacement character U+FFFD, and normalizes line endings. Unicode bidirectional override characters are logged and stripped by default, with an option to preserve them for applications that handle them explicitly. This is not optional behavior applied only when a caller opts in — it is applied to all text output because the cost of missing a hostile sequence in an edge case is higher than the marginal cost of cleaning safe inputs. + +## Summary + +Safe PDF parsing in a production environment is not primarily a correctness problem — it is a resource and trust boundary problem. The ten threat categories above each have a specific, bounded mitigation: decompression limits enforced during streaming, iterative parsing with depth caps, cycle detection via hash sets, sparse lazy object tables, physical token search for stream boundaries, filename sanitization at extraction time, invocation budgets for content stream recursion, f64 arithmetic with finite clamping, O(n log n) algorithm requirements on hot paths, and unconditional output sanitization. pdftract implements all of these as non-optional defaults, ensuring that untrusted PDF input cannot be used to exhaust memory, saturate CPU, traverse the filesystem, or inject hostile byte sequences into downstream systems. diff --git a/docs/research/indic-script-extraction.md b/docs/research/indic-script-extraction.md new file mode 100644 index 0000000..ff4620c --- /dev/null +++ b/docs/research/indic-script-extraction.md @@ -0,0 +1,91 @@ +# Indic Script PDF Extraction: Devanagari, Tamil, Telugu, Bengali, and Related Scripts + +## 1. Indic Script Complexity + +Indic scripts — Devanagari, Tamil, Telugu, Bengali, Kannada, Malayalam, Gujarati, and Gurmukhi — are abugidas, a writing system family distinct from alphabets and syllabaries. In an abugida, each base character represents a consonant with an implicit inherent vowel (typically "a" in most Indic scripts). Explicit vowel sounds are written as combining marks called vowel signs, or matras, that attach to the base consonant. + +This structure creates the first layer of extraction difficulty: a single visual unit on the page may correspond to multiple Unicode code points, and the visual arrangement of those code points does not follow a simple left-to-right spatial order. Matras can appear to the left, right, above, or below the base consonant, and some vowels are split, appearing on both sides simultaneously (e.g., certain Malayalam and Tamil vowels). A naïve extraction that maps glyph positions directly to reading order will produce character sequences in the wrong order. + +Conjunct consonants compound the problem significantly. When two or more consonants appear in sequence without an intervening vowel, the shaping engine merges them into a ligated or stacked conjunct form. A conjunct may be rendered as a single glyph by the font or as a tightly composed cluster of sub-glyphs. The Unicode logical encoding of a conjunct is always the sequence of base consonant code points separated by the virama (halant), but the visual glyph or glyph cluster has no visible halant. Extracting such sequences correctly requires knowledge of which glyph or glyph combination corresponds to which phoneme sequence. + +Compared to Latin scripts, where glyph-to-character correspondence is nearly one-to-one, Indic extraction is fundamentally a shaping reversal problem. Compared to CJK, where large glyph inventories are used but each glyph maps to a single Unicode scalar, Indic presents many-to-one and one-to-many mappings at the glyph level. + +## 2. Unicode Encoding of Indic Scripts + +The Unicode Standard assigns dedicated blocks to each major Indic script: Devanagari (U+0900–U+097F), Bengali (U+0980–U+09FF), Gurmukhi (U+0A00–U+0A7F), Gujarati (U+0A80–U+0AFF), Tamil (U+0B80–U+0BFF), Telugu (U+0C00–U+0C7F), Kannada (U+0C80–U+0CFF), and Malayalam (U+0D00–U+0D7F). + +A critical property of Unicode Indic encoding is that the logical order is always phonemic order, not visual order. The Unicode code point sequence for a syllable is: base consonant(s) joined by viramas, followed by any vowel sign, followed by any anusvara or visarga. This ordering is independent of where the matra visually appears on the page. The "i" vowel sign in Devanagari (U+093F) appears to the left of its base consonant visually, but in Unicode it follows the consonant. The Tamil "i" vowel (U+0BBF) behaves identically. + +pdftract must produce output that conforms to this phonemic ordering contract. Any extraction path that assembles text by reading glyph positions from left to right on the page will violate this contract for pre-base matras and split vowels. + +## 3. PDF Generator Behavior for Indic Text + +When a PDF is generated from an authoring application with proper Indic language support — such as InDesign with an Indic language pack, Word with Devanagari or Tamil fonts, or a properly configured web-to-PDF renderer — the application invokes an OpenType shaping engine (typically HarfBuzz or the platform's native shaper) before embedding text into the PDF. + +The shaping engine applies GSUB (glyph substitution) and GPOS (glyph positioning) rules from the font's OpenType tables to transform the logical Unicode sequence into a visual glyph sequence suitable for rendering. The output of shaping is a sequence of glyph IDs in the font's internal numbering space. These glyph IDs, not Unicode code points, are what the PDF content stream encodes in its text-showing operators (Tj, TJ, etc.). + +To allow text extraction, the PDF should include a ToUnicode CMap for each embedded font. This CMap maps glyph IDs back to Unicode character sequences. The failure modes are numerous and common: + +- The ToUnicode CMap may be absent entirely. Applications that treat Indic text as purely graphical, or that use embedded bitmaps, produce PDFs with no extraction path. +- The CMap maps each glyph to a single Unicode scalar, but a conjunct glyph represents a multi-codepoint sequence. The extractor then produces a single codepoint where a sequence is required. +- The CMap maps shaped glyphs to the pre-shaping Unicode sequence but records only the first consonant of a conjunct, silently discarding the virama and subsequent consonants. +- Multiple glyphs collectively represent one phoneme sequence (e.g., a split vowel rendered as two separate glyphs for the left and right halves), but each glyph's CMap entry maps to a partial or empty sequence, producing duplicated or missing vowels. +- The CMap entries are in visual order rather than phonemic order, reflecting the order glyphs appear on the rendered line rather than the logical Unicode order. + +## 4. ToUnicode CMap Problems with Indic Scripts + +The PDF specification allows ToUnicode CMap entries to map a single glyph to a sequence of Unicode code points using the `bfrange` and `bfchar` syntax with bracket notation. This supports the one-to-many case needed for conjuncts. However, the many-to-one case — where a cluster of glyphs collectively encodes one phoneme sequence — is not directly representable; the CMap format is per-glyph. + +In practice, many PDF generators produce ToUnicode CMaps using tools that were designed for Latin or CJK scripts. These tools enumerate shaped glyph sequences from the font's cmap table, which maps Unicode code points to glyph IDs, not the reverse. The reversal is often naively implemented: for each Unicode code point in the font's cmap, record a CMap entry from the default glyph for that code point to the code point. Shaped glyphs (conjuncts, half-forms, vowel sign glyphs in alternate positions) are not reachable from the Unicode cmap and receive no ToUnicode entry. + +pdftract must detect the incomplete-CMap condition for Indic fonts. Indicators include: a font with a large glyph count but ToUnicode entries covering only the base consonant range, ToUnicode entries that produce only base consonants with no vowel sign coverage, or extracted text that contains no virama characters despite the document clearly containing conjunct-heavy text (as detectable from glyph count per word-spacing unit). + +## 5. ActualText as the Reliable Extraction Path + +For PDFs produced with proper accessibility tagging — typically InDesign with tagged PDF export, or XSL-FO renderers with accessibility output — the correct extraction path is ActualText. The PDF specification allows content to carry an ActualText attribute on marked content sequences (via marked content points with `Span` tags) or on structure elements in the document's logical structure tree. + +ActualText provides the Unicode string that the marked content visually represents, encoded by the document author or by the application's accessibility layer, independent of glyph encoding. For Indic text, a properly set ActualText value gives the phonemically ordered Unicode sequence for the entire sequence of shaped glyphs — including conjuncts, half-forms, and reordered matras — as a single string. + +pdftract's extraction pipeline should check for ActualText on every span of content before attempting glyph-level CMap lookup. If ActualText is present and non-empty, it should be used directly as the canonical text for that content unit. The glyph-level extraction pipeline is then a fallback for untagged PDFs. + +## 6. OpenType GSUB Lookup for Shaping Reversal + +When neither ActualText nor a complete ToUnicode CMap is available, pdftract can attempt to reverse-engineer the glyph-to-Unicode mapping using the font's embedded OpenType GSUB tables. + +The Indic shaping specification defines a mandatory sequence of GSUB feature lookups applied in order. For most Indic scripts (using the USE — Universal Shaping Engine — or the older Indic v2 specification), the ordered feature sequence includes: `akhn` (akhand, pre-shaping conjuncts), `rphf` (reph form), `blwf` (below-base form), `half` (half form), `pstf` (post-base form), `vatu` (vattu variant), `cjct` (conjunct form), followed by presentational features `pres`, `abvs`, `blws`, `psts`, and `haln`. Each GSUB lookup is a substitution table mapping input glyph sequences to output glyph sequences or single glyphs. + +To reverse a GSUB substitution, pdftract can build an inverted lookup index: for each single-substitution or ligature-substitution table, record the mapping from output glyph ID to input glyph ID sequence. Given a shaped glyph from the PDF content stream, the inverted index can recover the pre-shaping glyph sequence, which can then be mapped to Unicode via the font's standard cmap table. Multi-step shaping chains require following the inversion through multiple lookup levels. + +This approach is computationally feasible for single-substitution and ligature-substitution (LookupType 1 and 4) tables. Contextual substitutions (LookupType 5, 6) are harder to invert and may require heuristics or exhaustive candidate search. + +## 7. Font-Specific Glyph Naming for Indic Fonts + +Older Indic fonts — particularly those designed for legacy encoding systems before Unicode became prevalent — may use custom PostScript glyph naming conventions. Common patterns include glyph names like `ka`, `kha`, `ga`, encoding phoneme names directly; `uni0915`, `uni0916`, encoding Unicode code points in the standard `uni` prefix format; or purely numeric names like `glyph0032`. + +The Adobe Glyph List (AGL) covers Latin, Greek, Cyrillic, and some other scripts but does not include Indic phoneme names. The `uni` prefix convention is defined in the AGL specification and maps directly to Unicode code points, making it reliable when present. Numeric glyph names convey no semantic information. + +When pdftract encounters unrecognized glyph names in an Indic font with no ToUnicode CMap, it applies the following fallback order: parse `uni`-prefixed names to extract Unicode scalars; attempt lookup in a supplementary Indic glyph name table derived from common font naming conventions; flag glyphs with purely numeric or unrecognized names as unresolvable, emitting placeholder markers in the extraction output rather than silently dropping content or substituting incorrect characters. + +## 8. Vowel Sign Reordering in Extraction + +Pre-base vowel signs present a specific reordering requirement that must be handled explicitly. In Tamil, the short "i" vowel (U+0BBF), "ii" vowel (U+0BC0), and several others visually appear to the left of their base consonant. In Devanagari, the "i" vowel (U+093F) appears similarly. The shaping engine places these glyphs to the left during layout, but Unicode requires them to follow the base consonant in the code point sequence. + +A PDF content stream that records glyphs in visual left-to-right order will therefore encode a pre-base matra before its consonant, violating Unicode phonemic order. pdftract must detect this condition and reorder: after assembling a candidate character sequence from CMap lookup, identify any vowel sign code points that fall in known pre-base ranges for their respective script, and relocate them to their correct position after the base consonant. + +Split vowels — those rendered as two separate glyphs on either side of the consonant — require reassembly: the left glyph and the right glyph must be recognized as halves of a single vowel sign and merged into the correct single Unicode code point (or two-code-point canonical sequence where applicable). + +## 9. Halant (Virama) Handling + +The virama (halant) is the diacritic that suppresses the inherent vowel of a consonant. In Devanagari it is U+094D; each other Indic script has its own corresponding code point. In properly encoded Indic text, the virama appears between consonants of a conjunct cluster, and also appears at the end of a word-final consonant when that consonant has no inherent vowel. + +Several failure modes are common in PDF extraction of viramas. When a conjunct is represented as a ligature glyph, the ToUnicode entry should include the virama between the constituent consonants (e.g., `<0915 094D 0916>` for the ka-halant-kha conjunct), but many generators omit the virama, producing `<0915 0916>` instead. This sequence is phonemically distinct — it implies a vowel between the consonants — and will render differently in word processors, cause incorrect spell-checking, and produce wrong search results. + +Conversely, a virama that is encoded in the ToUnicode but placed at the wrong position in the output sequence (e.g., before the first consonant rather than between them) breaks downstream Indic text processing entirely. pdftract's GSUB reversal logic must validate virama placement against the logical structure of each conjunct it decodes. + +## 10. OCR Fallback for Indic Scripts + +When ToUnicode is absent, ActualText is not present, and GSUB reversal fails to produce a coherent Unicode sequence (as indicated by high rates of unresolvable glyph names, invalid code point sequences, or detection of a purely graphical rendering pipeline), pdftract falls back to OCR using Tesseract with script-specific language models. + +Tesseract provides dedicated trained models for major Indic scripts: `deva` for Devanagari, `tam` for Tamil, `tel` for Telugu, `ben` for Bengali, `kan` for Kannada, `mal` for Malayalam, `guj` for Gujarati, and `pan` for Gurmukhi. These models produce Unicode output in logical (phonemic) order directly, as Tesseract's Indic pipeline includes its own shaping-aware recognition layer. + +pdftract must render the affected PDF page region to a raster image at sufficient resolution (minimum 300 DPI, preferably 400 DPI for small body text) before passing it to Tesseract. The script of the content must be identified to select the correct model; this can be done heuristically from any glyph name information or from the font name embedded in the PDF. All text extracted via the OCR path is annotated in pdftract's output with a `source: ocr` field and a per-word or per-line confidence score as reported by Tesseract's HOCR output, so consumers can distinguish OCR-derived text from CMap-derived or ActualText-derived text and apply appropriate downstream validation.