Add parallel extraction research and comprehensive research index

New research document covering parallel extraction architecture:
rayon page-level parallelism, Arc<> shared xref/font/object-stream
caches, RwLock font cache design, Tesseract thread-local OCR pool,
semaphore memory budget, ordered NDJSON streaming slot array, and
catch_unwind error isolation per page.

Also adds docs/research-index.md: a 622-line navigable index of all
83 research documents grouped into 9 thematic categories, with a
"Start Here" reading path, per-phase implementation reading tables,
and an alphabetical lookup table covering every document.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-16 16:30:35 -04:00
parent 92e6196ac5
commit 8753630bc3
2 changed files with 716 additions and 0 deletions

622
docs/research-index.md Normal file
View file

@ -0,0 +1,622 @@
# pdftract Research Index
**82 documents** covering PDF internals, font encoding, OCR, layout reconstruction, specialized document types, security, output schema, and multilingual extraction.
This is a navigable reading guide, not a summary. Use it to find the right document when implementing a specific feature. All links are relative to this file's location (`docs/`).
---
## Start Here
Read these six documents first, in order, before touching any other research:
1. **[research/extraction-pipeline-overview.md](research/extraction-pipeline-overview.md)** — End-to-end architectural blueprint: all 9 pipeline stages, decision points, and data transformations. The canonical integration reference that ties every other document together.
2. **[research/pdf-specification.md](research/pdf-specification.md)** — ISO 32000-1/2 implementation reference: file structure, xref tables, object model, content streams, and encoding. The foundation for Phase 1.
3. **[research/pdf-fonts-and-encoding.md](research/pdf-fonts-and-encoding.md)** — Every font type in PDF, character code → Unicode resolution, and the four-level fallback chain. Essential before any font work.
4. **[research/content-stream-operators.md](research/content-stream-operators.md)** — Complete operator reference for text extraction: text state operators, positioning, rendering modes, and their effect on glyph output.
5. **[research/extraction-output-schema.md](research/extraction-output-schema.md)** — Stable v1.0 JSON schema: document, page, span, block, annotation, and form field structures. Read this before writing any serialization code.
6. **[plan/implementation-plan.md](plan/implementation-plan.md)** — Seven-phase build plan with crate dependencies, critical tests, and milestone targets. The execution roadmap.
---
## Reading Path for Implementation
### Phase 1: Core PDF Parser
Build the lexer, object parser, xref resolution, document model, and stream decoder.
| Document | Why |
|---|---|
| [research/pdf-specification.md](research/pdf-specification.md) | File structure, xref tables, object streams, linearized layout |
| [research/pdf-object-model-and-data-types.md](research/pdf-object-model-and-data-types.md) | The eight PDF object types, reference semantics, generation numbers |
| [research/xref-table-parsing-and-object-lookup.md](research/xref-table-parsing-and-object-lookup.md) | Traditional xref, xref streams, hybrid files, incremental update chains |
| [research/malformed-pdf-repair-and-recovery.md](research/malformed-pdf-repair-and-recovery.md) | Forward scan fallback, truncated file recovery, error taxonomy |
| [research/document-catalog-and-structure.md](research/document-catalog-and-structure.md) | Catalog keys, page tree traversal, inherited attributes |
| [research/page-geometry-and-document-structure.md](research/page-geometry-and-document-structure.md) | MediaBox/CropBox/BleedBox, rotation, coordinate systems |
| [research/image-compression-and-filter-decoding.md](research/image-compression-and-filter-decoding.md) | FlateDecode, LZW, ASCII85, RunLength, DCT, JBIG2, JPX filter chain |
| [research/pdf-encryption-and-security.md](research/pdf-encryption-and-security.md) | Standard handler, RC4/AES decryption, password attempt sequence |
| [research/error-handling-and-robustness.md](research/error-handling-and-robustness.md) | Recoverable error model, diagnostic codes, graceful degradation |
| [research/adversarial-inputs-and-parser-security.md](research/adversarial-inputs-and-parser-security.md) | Decompression bombs, circular references, resource exhaustion limits |
| [research/linearized-pdf-and-streaming.md](research/linearized-pdf-and-streaming.md) | Two-xref-table layout, hint streams, fast-web-view parsing |
| [research/incremental-updates-and-versioning.md](research/incremental-updates-and-versioning.md) | Append-only update model, revision history, redaction detection |
### Phase 2: Font and Encoding Pipeline
Map every character code to a Unicode scalar value with a confidence score.
| Document | Why |
|---|---|
| [research/pdf-fonts-and-encoding.md](research/pdf-fonts-and-encoding.md) | All font types, encoding vectors, AGL lookup, four-level fallback |
| [research/cmap-format-and-cid-encoding.md](research/cmap-format-and-cid-encoding.md) | ToUnicode CMap syntax: bfchar, bfrange, usecmap, UTF-16BE sequences |
| [research/font-descriptor-and-metrics.md](research/font-descriptor-and-metrics.md) | FontDescriptor keys, width arrays, hmtx metrics, descriptor flags |
| [research/font-subsetting-and-extraction.md](research/font-subsetting-and-extraction.md) | Subset naming convention, glyph table gaps, metric inference |
| [research/glyph-recognition-and-unicode-recovery.md](research/glyph-recognition-and-unicode-recovery.md) | Shape-hash database, perceptual hashing, Level 4 fallback |
| [research/type3-font-extraction.md](research/type3-font-extraction.md) | CharProcs content streams, per-glyph rasterization, shape recognition |
| [research/cjk-and-asian-script-encoding.md](research/cjk-and-asian-script-encoding.md) | CIDFont encoding, predefined CMaps, Shift-JIS/GB18030/Big5/EUC-KR |
| [research/resource-dictionary-and-inheritance.md](research/resource-dictionary-and-inheritance.md) | Font namespace resolution, inherited resource merging, XObject refs |
### Phase 3: Content Stream Processing
Execute content stream operators to produce a raw glyph list with positions.
| Document | Why |
|---|---|
| [research/content-stream-operators.md](research/content-stream-operators.md) | Full operator reference: Tj, TJ, Td, Tm, Tf, Tr, BT/ET, and all text operators |
| [research/graphics-state-tracking.md](research/graphics-state-tracking.md) | CTM, text matrix, q/Q stack, Tr rendering mode, color state |
| [research/text-positioning-and-font-metrics.md](research/text-positioning-and-font-metrics.md) | Text matrix accumulation, leading, Td/TD/T* semantics, rise |
| [research/content-stream-concatenation.md](research/content-stream-concatenation.md) | Multi-stream pages, /Length mismatch handling, resource namespace scoping |
| [research/optional-content-groups.md](research/optional-content-groups.md) | OCG layer state, BDC/EMC marked content sequences, visibility filtering |
| [research/invisible-and-hidden-text.md](research/invisible-and-hidden-text.md) | Tr=3 invisible text, OCR layer patterns, include_invisible_text flag |
| [research/stroke-and-outlined-text.md](research/stroke-and-outlined-text.md) | Rendering modes 17, stroke-only glyphs, clipping path text |
| [research/shading-pattern-and-text-visibility.md](research/shading-pattern-and-text-visibility.md) | Luminance estimation for text-on-background visibility filtering |
| [research/word-boundary-reconstruction.md](research/word-boundary-reconstruction.md) | Inter-glyph gap thresholds, TJ kern detection, synthetic space insertion |
### Phase 4: Text Assembly and Layout
Transform raw glyph lists into structured blocks in reading order.
| Document | Why |
|---|---|
| [research/span-merging-and-text-run-assembly.md](research/span-merging-and-text-run-assembly.md) | Span boundary rules, run assembly, line merging, block formation |
| [research/complex-layout-reading-order.md](research/complex-layout-reading-order.md) | XY-cut algorithm, multi-column detection, sidebar/caption disambiguation |
| [research/tagged-pdf-structure-and-reading-order.md](research/tagged-pdf-structure-and-reading-order.md) | Structure tree walk, MCID-to-span mapping, ActualText override |
| [research/document-classification-and-zone-labeling.md](research/document-classification-and-zone-labeling.md) | Zone classification: body/header/footer/caption/margin heuristics |
| [research/watermark-and-background-separation.md](research/watermark-and-background-separation.md) | Watermark detection, repeated-pattern suppression, z-order analysis |
| [research/semantic-text-reconstruction.md](research/semantic-text-reconstruction.md) | Hyphen removal, soft-hyphen handling, ligature expansion, line joining |
| [research/post-extraction-normalization.md](research/post-extraction-normalization.md) | NFC normalization, whitespace cleanup, paragraph boundary detection |
| [research/confidence-scoring-and-aggregation.md](research/confidence-scoring-and-aggregation.md) | Per-glyph confidence, span aggregation, readability score computation |
| [research/text-readability-validation.md](research/text-readability-validation.md) | Printable-character ratio, dictionary word rate, readability threshold |
| [research/page-labels-and-outline-extraction.md](research/page-labels-and-outline-extraction.md) | PageLabels number tree, outline bookmark walk, named destinations |
### Phase 5: OCR Integration
Extract text from scanned pages; improve broken-vector pages via Tesseract.
| Document | Why |
|---|---|
| [research/scanned-vs-vector-page-classification.md](research/scanned-vs-vector-page-classification.md) | Classification signals, PageClass enum, confidence scoring per signal |
| [research/raster-ocr-pipeline.md](research/raster-ocr-pipeline.md) | 300 DPI render, Sauvola binarization, deskew, Tesseract HOCR integration |
| [research/post-ocr-text-correction.md](research/post-ocr-text-correction.md) | Systematic OCR errors, dictionary-based correction, confidence re-scoring |
| [research/image-and-figure-extraction.md](research/image-and-figure-extraction.md) | Image XObject identification, inline images, figure region detection |
| [research/historical-and-degraded-document-extraction.md](research/historical-and-degraded-document-extraction.md) | Microfilm, low-quality scan preprocessing, multi-pass OCR strategies |
### Phase 6: Output and API
Full JSON schema, PyO3 bindings, HTTP serve mode, NDJSON streaming.
| Document | Why |
|---|---|
| [research/extraction-output-schema.md](research/extraction-output-schema.md) | Complete v1.0 schema: all fields, types, and serialization constraints |
| [research/performance-and-streaming-architecture.md](research/performance-and-streaming-architecture.md) | mmap I/O, rayon page parallelism, BufWriter NDJSON, LRU object cache |
| [research/hyperlinks-and-named-destinations.md](research/hyperlinks-and-named-destinations.md) | URI annotations, GoTo actions, cross-document links, named dest resolution |
| [research/xmp-and-document-metadata.md](research/xmp-and-document-metadata.md) | XMP RDF/XML parsing, /Info dict fallback, Dublin Core fields, conflict resolution |
| [research/chunking-for-llm-consumption.md](research/chunking-for-llm-consumption.md) | Chunk size strategy, block-boundary splitting, overlap, token estimation |
| [research/benchmark-and-test-methodology.md](research/benchmark-and-test-methodology.md) | Test corpus design, accuracy metrics, performance benchmarks, regression suite |
### Phase 7: Advanced Features
StructTree exploitation, table detection, AcroForm/XFA, attachments, signatures.
| Document | Why |
|---|---|
| [research/accessibility-and-tagged-pdf-deep-dive.md](research/accessibility-and-tagged-pdf-deep-dive.md) | StructTree walk, role mapping, artifact suppression, ActualText semantics |
| [research/tagged-pdf-structure-and-reading-order.md](research/tagged-pdf-structure-and-reading-order.md) | ParentTree, MCID resolution, Suspects flag fallback |
| [research/table-structure-reconstruction.md](research/table-structure-reconstruction.md) | Ruling-line detection, borderless heuristics, cell assignment, merged cells |
| [research/form-fields-and-annotations.md](research/form-fields-and-annotations.md) | AcroForm field walk, XFA extraction, annotation types and values |
| [research/digital-signatures-and-certification.md](research/digital-signatures-and-certification.md) | Sig field metadata, ByteRange, SubFilter, validation_status reporting |
| [research/embedded-files-and-portfolios.md](research/embedded-files-and-portfolios.md) | EmbeddedFiles name tree, portfolio navigation, attachment metadata |
| [research/pdf-portfolio-and-attachments.md](research/pdf-portfolio-and-attachments.md) | PDF Portfolio collections, navigator schema, attachment content access |
| [research/article-threads-and-reading-order.md](research/article-threads-and-reading-order.md) | /Threads bead chains, multi-page article flow, reading order override |
---
## Document Categories
### PDF File Format and Parsing
Core specification, file structure, object model, and xref resolution.
- **[research/pdf-specification.md](research/pdf-specification.md)** — ISO 32000-1/2 implementation reference covering file structure, xref tables, object streams, and linearized layout
- File header, body, xref, trailer structure
- Xref streams (PDF 1.5+), object streams, incremental updates
- Content stream grammar and operator categorization
- **[research/pdf-object-model-and-data-types.md](research/pdf-object-model-and-data-types.md)** — The eight fundamental PDF object types with serialization details and reference semantics
- Boolean, integer, real, string, name, array, dictionary, stream
- Indirect object references, generation numbers, object identity
- **[research/xref-table-parsing-and-object-lookup.md](research/xref-table-parsing-and-object-lookup.md)** — Cross-reference parsing strategies: traditional table, xref streams, hybrid files, and incremental chains
- Traditional 20-byte xref entry format and subsection merging
- Type-0/1/2 xref stream entries, `/W` field widths
- `/Prev` chain traversal for incremental updates
- **[research/document-catalog-and-structure.md](research/document-catalog-and-structure.md)** — Document catalog keys, page tree traversal, and inherited attribute resolution
- `/Root`, `/Pages`, `/Outlines`, `/AcroForm`, `/MarkInfo`, `/OCProperties`
- Page tree flattening, per-key inheritance walk, attribute override rules
- **[research/page-geometry-and-document-structure.md](research/page-geometry-and-document-structure.md)** — Page boxes, coordinate systems, rotation, and media/crop/bleed/trim/art box semantics
- MediaBox, CropBox, BleedBox, TrimBox, ArtBox inheritance
- User space to device space transformation, rotation matrix application
- **[research/linearized-pdf-and-streaming.md](research/linearized-pdf-and-streaming.md)** — Linearized ("fast web view") PDF layout: two-xref structure, hint streams, and first-page parsing
- Linearization dictionary keys and their parsing implications
- Hint table decoding, page offset table, shared object table
- **[research/incremental-updates-and-versioning.md](research/incremental-updates-and-versioning.md)** — Non-destructive append-only PDF modification: revision history, object shadowing, and redaction forensics
- Incremental save mechanics, object number reuse across revisions
- Detecting "soft redaction" where content is hidden but not deleted
- **[research/malformed-pdf-repair-and-recovery.md](research/malformed-pdf-repair-and-recovery.md)** — Forward scan fallback, truncation recovery, and the full taxonomy of real-world PDF corruption
- Forward object scan from `obj`/`endobj` markers when xref fails
- Per-corruption diagnostic codes and recovery strategies
- **[research/pdf-generator-quirks.md](research/pdf-generator-quirks.md)** — Per-generator fingerprinting and known deviations from spec for Word, LibreOffice, Chrome, LaTeX, and others
- Generator detection via `/Producer` field patterns
- Quirk-specific workarounds keyed to generator fingerprint
- **[research/resource-dictionary-and-inheritance.md](research/resource-dictionary-and-inheritance.md)** — Font, XObject, ExtGState, and pattern namespace resolution with inheritance from ancestor page nodes
- Multi-level resource merging, last-write-wins semantics at page level
- Namespace isolation between content streams and Form XObjects
- **[research/image-compression-and-filter-decoding.md](research/image-compression-and-filter-decoding.md)** — All PDF stream filters: FlateDecode predictors, LZW, ASCII85, RunLength, DCT, JBIG2, JPX, CCITT
- Filter pipeline chaining, `/DecodeParms` alignment
- Partial decode recovery on zlib truncation errors
---
### Font and Encoding
Everything needed to map a character code to a Unicode codepoint.
- **[research/pdf-fonts-and-encoding.md](research/pdf-fonts-and-encoding.md)** — Complete font type reference and the four-level Unicode resolution fallback chain
- Type1, TrueType, Type0/CID, Type3, OpenType font loading strategies
- ToUnicode → AGL → fingerprint → shape recognition fallback
- **[research/cmap-format-and-cid-encoding.md](research/cmap-format-and-cid-encoding.md)** — ToUnicode CMap program syntax and CID encoding for composite fonts
- `beginbfchar`/`beginbfrange`, `usecmap`, UTF-16BE ligature expansion
- Predefined CMap names for CJK scripts, CID-to-GID mapping
- **[research/font-descriptor-and-metrics.md](research/font-descriptor-and-metrics.md)** — FontDescriptor dictionary keys, width arrays, hmtx table access, and descriptor flags
- `/Widths`, `/FirstChar`, `/LastChar`, `/MissingWidth` for Type1/TrueType
- `/DW`, `/W` sparse width encoding for CIDFonts
- **[research/font-subsetting-and-extraction.md](research/font-subsetting-and-extraction.md)** — Subset font naming convention, glyph table gaps, and metrics inference for missing glyphs
- Six-uppercase-letter prefix stripping for Standard 14 lookup
- Identifying and handling subsetting-induced ToUnicode gaps
- **[research/glyph-recognition-and-unicode-recovery.md](research/glyph-recognition-and-unicode-recovery.md)** — Shape-hash database construction and perceptual hashing for Level 4 Unicode recovery
- 32×32 bitmap rendering, perceptual hash lookup database
- Shape-match confidence scoring and known-font fingerprint cache
- **[research/type3-font-extraction.md](research/type3-font-extraction.md)** — Type 3 fonts: glyph shapes as content stream fragments requiring per-glyph rasterization
- `/CharProcs` dictionary parsing, glyph content stream execution
- Color/grayscale Type 3 glyph rendering for shape recognition
- **[research/cjk-and-asian-script-encoding.md](research/cjk-and-asian-script-encoding.md)** — CJK CIDFont encoding, predefined CMaps, and multi-byte character code parsing
- Shift-JIS, GB18030, Big5, EUC-KR codepage decoding via `encoding_rs`
- Vertical writing mode detection and glyph substitution
- **[research/opentype-math-and-formula-extraction.md](research/opentype-math-and-formula-extraction.md)** — OpenType MATH table layout and text-extraction strategy for mathematical formulas
- MathVariants, MathKern, script/superscript glyph positioning
- Linearized formula extraction vs. MathML reconstruction tradeoffs
---
### Content Stream Processing
Executing the PDF painting model to produce glyphs with positions.
- **[research/content-stream-operators.md](research/content-stream-operators.md)** — Full PDF operator reference for text extraction: every text, positioning, and state operator
- Tj, TJ, ', " operators; Td, TD, Tm, T* positioning
- BT/ET block semantics, Tf font selection, Tr rendering mode
- **[research/graphics-state-tracking.md](research/graphics-state-tracking.md)** — Complete graphics state machine: CTM, text matrix stack, color state, and rendering mode
- q/Q push/pop, CTM concatenation via `cm` operator
- Tr modes 07 and their visibility implications for extraction
- **[research/text-positioning-and-font-metrics.md](research/text-positioning-and-font-metrics.md)** — Text state scalar accumulation, leading, Td/TD/T* semantics, and rise
- Text matrix vs. text line matrix distinction
- Character spacing (Tc), word spacing (Tw), horizontal scaling (Tz)
- **[research/content-stream-concatenation.md](research/content-stream-concatenation.md)** — Multi-stream page assembly, `/Length` mismatch handling, and resource namespace scoping
- Pages with array `/Contents`, stream boundary handling
- Form XObject execution and resource inheritance within sub-streams
- **[research/optional-content-groups.md](research/optional-content-groups.md)** — OCG layer state tracking, BDC/EMC marked content, and visibility-based glyph suppression
- `/OCProperties` catalog entry, OCG on/off state resolution
- OCMD (optional content membership dictionary) logic operators
- **[research/invisible-and-hidden-text.md](research/invisible-and-hidden-text.md)** — Tr=3 invisible text, PDF/A OCR layer patterns, and the `include_invisible_text` flag
- Scanned-PDF OCR text layer architecture (visible image + hidden text)
- White-on-white and zero-font-size hidden text detection
- **[research/stroke-and-outlined-text.md](research/stroke-and-outlined-text.md)** — Text rendering modes 17, stroke-only glyphs, and clipping path text handling
- Mode 1 (stroke), mode 2 (fill+stroke), mode 4 (invisible clip)
- Outlined text in logos and headings where fill is absent
- **[research/word-boundary-reconstruction.md](research/word-boundary-reconstruction.md)** — Inter-glyph gap thresholds, TJ kern detection, and synthetic space character insertion
- TeX/LaTeX character-per-glyph patterns, missing inter-word spaces
- Horizontal gap normalized by font size as word-boundary signal
- **[research/shading-pattern-and-text-visibility.md](research/shading-pattern-and-text-visibility.md)** — Color space luminance estimation for text-on-background visibility and watermark filtering
- ICC profile color space normalization to approximate luminance
- Pattern and shading fills as background suppression candidates
---
### Text Assembly and Layout Reconstruction
Transforming raw glyph lists into ordered, structured text.
- **[research/span-merging-and-text-run-assembly.md](research/span-merging-and-text-run-assembly.md)** — Span boundary detection, text run assembly, line merging, and block formation pipeline
- Font/size/color/mode change triggers for span splits
- Ascending y-position sort, baseline alignment for line grouping
- **[research/complex-layout-reading-order.md](research/complex-layout-reading-order.md)** — XY-cut recursive page partitioning for multi-column, sidebar, and mixed-layout documents
- Whitespace gap detection as column separator, minimum gap thresholds
- Caption-to-figure association, margin note classification
- **[research/tagged-pdf-structure-and-reading-order.md](research/tagged-pdf-structure-and-reading-order.md)** — Structure tree as authoritative reading order source for tagged documents
- StructElem type mapping to block kinds, ParentTree MCID lookup
- Suspects flag validation and XY-cut fallback conditions
- **[research/document-classification-and-zone-labeling.md](research/document-classification-and-zone-labeling.md)** — Page zone classification: body text, header, footer, caption, margin, and running head
- Spatial heuristics: y-position thresholds, font size ratios, repetition
- Zone label influence on reading order and block kind assignment
- **[research/watermark-and-background-separation.md](research/watermark-and-background-separation.md)** — Watermark detection, repeated-pattern suppression, and z-order layering analysis
- Low-opacity text, diagonal text, large-font centered text as watermark signals
- Suppression vs. flagging strategies, `is_artifact` span flag
- **[research/semantic-text-reconstruction.md](research/semantic-text-reconstruction.md)** — Hyphen removal, soft-hyphen handling, ligature expansion, and line-end join heuristics
- End-of-line hyphen detection and word reconstitution
- Ligature Unicode expansion (fi, fl, ffi, ffl, st)
- **[research/post-extraction-normalization.md](research/post-extraction-normalization.md)** — NFC normalization, whitespace cleanup, and paragraph boundary detection
- Unicode combining character normalization pipeline
- Trailing/leading whitespace removal, control character stripping
- **[research/confidence-scoring-and-aggregation.md](research/confidence-scoring-and-aggregation.md)** — Per-glyph confidence scoring, span aggregation, and page-level readability score computation
- Source-weighted confidence: to_unicode=1.0, agl=0.9, fingerprint=0.85, shape_match=0.7
- Span minimum confidence, page aggregate, extraction_quality rollup
- **[research/text-readability-validation.md](research/text-readability-validation.md)** — Printable-character ratio, dictionary word rate, and readability threshold for OCR fallback triggering
- Character validity checks: printable Unicode, non-sentinel codepoints
- Readability score thresholds for `ocr_fallback_threshold` comparison
- **[research/page-labels-and-outline-extraction.md](research/page-labels-and-outline-extraction.md)** — PageLabels number tree parsing, outline bookmark walk, and named destination resolution
- Roman/Arabic/alphabetic label prefix and style encoding
- `/Outlines` recursive walk, `/Dests` name tree lookup
- **[research/table-structure-reconstruction.md](research/table-structure-reconstruction.md)** — Ruling-line grid detection, borderless column alignment, cell assignment, and merged cell inference
- Path segment clustering, intersection point detection, grid construction
- Colspan/rowspan inference from missing interior edges
---
### OCR and Image Processing
Handling scanned and raster pages.
- **[research/scanned-vs-vector-page-classification.md](research/scanned-vs-vector-page-classification.md)** — Classification signals and the PageClass enum (Vector/Scanned/Hybrid/BrokenVector)
- Image coverage fraction threshold, character validity rate signals
- 8×8 grid cell per-region classification for Hybrid detection
- **[research/raster-ocr-pipeline.md](research/raster-ocr-pipeline.md)** — Full OCR pipeline: 300 DPI rendering, Sauvola binarization, Hough deskew, Tesseract HOCR
- leptonica-plumbing preprocessing, HOCR confidence parsing
- BrokenVector assisted-OCR mode: bounding box seeding from vector positions
- **[research/post-ocr-text-correction.md](research/post-ocr-text-correction.md)** — Systematic OCR error patterns, dictionary-based correction, and post-OCR confidence re-scoring
- Common substitution errors (l/1/I, O/0, rn/m), context correction
- Language model probability scoring for candidate correction ranking
- **[research/image-and-figure-extraction.md](research/image-and-figure-extraction.md)** — Image XObject identification, inline image parsing, and figure region demarcation
- XObject type detection, image placement matrix, DPI computation
- Figure caption association and alt-text extraction from StructTree
- **[research/historical-and-degraded-document-extraction.md](research/historical-and-degraded-document-extraction.md)** — Preprocessing for microfilm, low-quality photocopies, and physically degraded originals
- Multi-pass binarization strategies, fold/crease artifact suppression
- Tesseract PSM mode selection for degraded layout recognition
- **[research/color-management-and-icc-profiles.md](research/color-management-and-icc-profiles.md)** — ICC profile color spaces and luminance estimation for text visibility determination
- CMYK/Lab/ICCBased color space normalization to approximate RGB
- Spot color handling, DeviceGray/DeviceRGB identity conversion
---
### Specialized Document Types
Documents with structural patterns that require targeted handling.
- **[research/latex-and-scientific-pdf-patterns.md](research/latex-and-scientific-pdf-patterns.md)** — LaTeX toolchain patterns: pdflatex, XeLaTeX, LuaLaTeX, and their encoding behaviors
- Type1/OTF font stacks, microtype spacing, missing ToUnicode patterns
- Figure/table float placement, bibliography link detection
- **[research/medical-and-scientific-pdf-patterns.md](research/medical-and-scientific-pdf-patterns.md)** — Dense mixed-content scientific documents: figures, tables, equations, citations, footnotes
- Multi-column layout with equation regions, journal template patterns
- Citation/reference block detection and DOI link extraction
- **[research/mathematical-expression-handling.md](research/mathematical-expression-handling.md)** — Mathematical notation extraction strategies across encoding schemes
- Symbol font mapping (Symbol, STIX, XITS, Computer Modern math)
- Subscript/superscript detection, operator precedence linearization
- **[research/legal-and-financial-pdf-patterns.md](research/legal-and-financial-pdf-patterns.md)** — Legal briefs, contracts, financial filings: line numbers, Bates stamps, footnote styles
- Court filing format patterns (federal, state), header/footer extraction
- Financial table dense-number extraction, currency symbol handling
- **[research/government-form-pdf-patterns.md](research/government-form-pdf-patterns.md)** — Government form PDFs: IRS, regulatory filings, mixed AcroForm/print-field layouts
- Form field label-to-value association across non-AcroForm "flat" forms
- Instruction text vs. fillable field disambiguation
- **[research/book-and-publishing-pdf-patterns.md](research/book-and-publishing-pdf-patterns.md)** — Book PDF structural complexity: running headers, footnotes, sidebars, indices, TOC
- Chapter/section boundary detection, page number extraction
- Index entry reconstruction, cross-reference link resolution
- **[research/engineering-document-extraction.md](research/engineering-document-extraction.md)** — PDF/E-1 engineering documents: CAD-exported PDFs, technical drawing annotation extraction
- Dimension annotation text, title block field extraction
- Revision table parsing, BOM (bill of materials) table detection
- **[research/presentation-and-spreadsheet-pdfs.md](research/presentation-and-spreadsheet-pdfs.md)** — PowerPoint and Excel PDFs: slide structure, speaker notes, sheet grids, frozen headers
- Slide bounding box as implicit zone boundary, note text association
- Spreadsheet cell grid reconstruction from absolute-positioned text
- **[research/pdfa-compliance-and-extraction.md](research/pdfa-compliance-and-extraction.md)** — PDF/A conformance levels and their extraction guarantees and fast-path optimizations
- PDF/A-1a/1b, 2a/2b/2u, 3a/3b/3u conformance constraints
- ToUnicode guarantee in PDF/A-1a, mandatory tagging in PDF/A-2a
- **[research/pdfa-archival-extraction-guarantees.md](research/pdfa-archival-extraction-guarantees.md)** — Specific extraction guarantees derivable from PDF/A conformance, enabling fast-path skips
- Level-specific ToUnicode presence guarantees, XMP metadata mandates
- Conformance-driven fallback skipping to improve throughput
- **[research/pdfx-prepress-extraction.md](research/pdfx-prepress-extraction.md)** — PDF/X print production formats: spot colors, bleed marks, output intent profiles
- PDF/X-1a, X-3, X-4, X-6 conformance constraints
- OutputIntent ICC profile, TrimBox/BleedBox as canonical page boundaries
- **[research/pdfua2-and-accessibility-standards.md](research/pdfua2-and-accessibility-standards.md)** — PDF/UA-2 (ISO 14289-2) built on PDF 2.0: updated structure requirements and WCAG alignment
- Namespace-qualified structure types, artifact classification changes
- Associated file attachment for MathML, pronunciation dictionaries
- **[research/pdfvt-variable-transactional-printing.md](research/pdfvt-variable-transactional-printing.md)** — PDF/VT variable and transactional printing: DPart tree, record boundary, reusable content
- DPart metadata extraction, record-per-recipient text variation
- Reusable content stream (RCS) handling, page piece dictionary
---
### Security and Robustness
Handling adversarial inputs, encryption, redaction, and JavaScript.
- **[research/adversarial-inputs-and-parser-security.md](research/adversarial-inputs-and-parser-security.md)** — Concrete attack classes and defensive techniques for production PDF parsing
- Decompression bombs: stream size limits, inflation ratio caps
- Circular reference guards, stack depth limits, object count caps
- **[research/pdf-encryption-and-security.md](research/pdf-encryption-and-security.md)** — Standard security handler, RC4 and AES decryption, certificate handlers, and password resolution
- `/V`, `/R`, `/KeyLength`, `/CF`/`/StmF`/`/StrF` handler fields
- Empty-password-first attempt sequence, unsupported handler error path
- **[research/error-handling-and-robustness.md](research/error-handling-and-robustness.md)** — Recoverable error model, diagnostic code taxonomy, and graceful degradation across all stages
- No-panic guarantee in library code, per-error diagnostic entries
- Stage-level error isolation: one page failure does not abort others
- **[research/redaction-detection-and-recovery.md](research/redaction-detection-and-recovery.md)** — Distinguishing true redaction from soft redaction; detecting content beneath covering rectangles
- Black rectangle over text detection, opacity-0 text identification
- /Redact annotation type, incremental update soft-redaction forensics
- **[research/javascript-and-interactive-pdf-extraction.md](research/javascript-and-interactive-pdf-extraction.md)** — JavaScript detection, dynamic content identification, and extraction strategy for interactive PDFs
- `/JS` action detection, `contains_javascript` metadata flag
- XFA dynamic form extraction vs. static snapshot fallback
- **[research/digital-signatures-and-certification.md](research/digital-signatures-and-certification.md)** — Digital signature field metadata extraction and ByteRange coverage reporting
- Sig field walk, `/ByteRange`, `/SubFilter` format identification
- Certification vs. approval signature distinction, validation_status field
---
### Output, API, and Metadata
Schema, serialization, and document-level metadata extraction.
- **[research/extraction-output-schema.md](research/extraction-output-schema.md)** — Stable v1.0 JSON schema: full field inventory for document, page, span, block, form, and annotation output
- Document-level metadata, outline, page array, extraction_quality
- Span and block structs, confidence sources, block kind enum
- **[research/xmp-and-document-metadata.md](research/xmp-and-document-metadata.md)** — XMP RDF/XML parsing, /Info dict fallback, Dublin Core fields, and XMP-vs-Info conflict resolution
- `pdfaid:conformance`, `dc:title`, `pdf:Producer` namespace fields
- XMP priority over /Info in PDF 1.4+ documents
- **[research/hyperlinks-and-named-destinations.md](research/hyperlinks-and-named-destinations.md)** — URI annotations, GoTo actions, named destination resolution, and internal navigation link extraction
- `/Annots` Link annotation type, `/A` action dictionary
- `/Dests` name tree, `/Names` catalog entry, cross-document GoToR
- **[research/page-labels-and-outline-extraction.md](research/page-labels-and-outline-extraction.md)** — PageLabels number tree, outline bookmark traversal, destination types
- `/S` label style (D/r/R/A/a), `/P` prefix, `/St` start value
- `/Outlines` `/First`/`/Next`/`/Last` linked list walk
- **[research/form-fields-and-annotations.md](research/form-fields-and-annotations.md)** — AcroForm field hierarchy, XFA extraction, and annotation text (highlights, stamps, notes)
- `/Fields` array walk, field type detection (Tx/Btn/Ch/Sig)
- Annotation subtypes: Highlight, StrikeOut, FreeText, Stamp, Link
- **[research/embedded-files-and-portfolios.md](research/embedded-files-and-portfolios.md)** — EmbeddedFiles name tree navigation, attachment metadata, and portfolio structure
- `/EmbeddedFiles` name tree, `/EF` dictionary, file stream access
- Portfolio `/Collection` schema, navigator sort order
- **[research/pdf-portfolio-and-attachments.md](research/pdf-portfolio-and-attachments.md)** — PDF Portfolio collections: navigator schema, attachment content access, and sub-document extraction
- `/Collection` fields array, sort key extraction
- Recursive extraction of PDF attachments within portfolios
- **[research/performance-and-streaming-architecture.md](research/performance-and-streaming-architecture.md)** — Memory-mapped I/O, rayon page parallelism, NDJSON streaming, and LRU object cache design
- mmap + `madvise(MADV_SEQUENTIAL)` on content streams
- `BufWriter<Stdout>` NDJSON, page-level rayon scatter/gather
- **[research/chunking-for-llm-consumption.md](research/chunking-for-llm-consumption.md)** — Block-boundary chunk splitting, overlap strategy, and token count estimation for RAG ingestion
- Heading-aware chunk boundaries, table/figure keep-together rules
- Overlap window sizing, chunk metadata (page_index, block_ids)
- **[research/benchmark-and-test-methodology.md](research/benchmark-and-test-methodology.md)** — Test corpus design, extraction accuracy metrics, performance benchmarks, and regression suite
- Ground-truth corpus construction, character error rate (CER) metric
- Performance targets: throughput pages/sec, memory ceiling per process
---
### Languages, Scripts, and Multilingual Documents
Non-Latin script handling, bidirectional text, and language detection.
- **[research/multilingual-document-extraction.md](research/multilingual-document-extraction.md)** — Mixed-script documents combining Latin with Arabic, Hebrew, CJK, and other scripts
- Per-span language detection, BCP-47 tag assignment
- Bidi paragraph detection and RTL reading order handling
- **[research/language-detection-and-script-handling.md](research/language-detection-and-script-handling.md)** — Unicode script identification, `whichlang` integration, and language tag propagation
- Script block ranges for Latin/Arabic/Hebrew/CJK/Devanagari/Thai
- Language tag inheritance from StructTree `/Lang` attribute
- **[research/cjk-and-asian-script-encoding.md](research/cjk-and-asian-script-encoding.md)** — CJK font encoding, multi-byte character code parsing, and vertical writing mode
- Shift-JIS, GB18030 (GBK), Big5, EUC-KR code page decoding
- Vertical glyph substitution, column-major reading order
- **[research/indic-script-extraction.md](research/indic-script-extraction.md)** — Devanagari, Tamil, Telugu, Bengali, and related abugida script extraction
- Akhand/matra glyph cluster reconstruction, halant handling
- Visual-order to logical-order reordering for Indic scripts
- **[research/southeast-asian-script-extraction.md](research/southeast-asian-script-extraction.md)** — Thai, Lao, Khmer, Burmese: scripts without inter-word spaces requiring segmentation
- Dictionary-based word segmentation for Thai/Lao/Khmer
- Stacked consonant cluster handling in Burmese/Khmer
- **[research/ruby-text-and-east-asian-typography.md](research/ruby-text-and-east-asian-typography.md)** — Japanese ruby (furigana) annotation extraction and East Asian typography conventions
- Ruby base/annotation text pair reconstruction
- Tate-chu-yoko (horizontal-in-vertical) mixed direction handling
- **[research/unicode-normalization-and-text-cleanup.md](research/unicode-normalization-and-text-cleanup.md)** — NFC normalization pipeline, combining character handling, and post-extraction cleanup
- Canonical decomposition + canonical composition (NFC) via `unicode-normalization`
- Zero-width joiner/non-joiner, byte order mark stripping
---
### Accessibility and Tagged PDF
Structure tree exploitation and accessibility standard compliance.
- **[research/accessibility-and-tagged-pdf-deep-dive.md](research/accessibility-and-tagged-pdf-deep-dive.md)** — PDF/UA-1 deep dive: structure tree contract, reading order derivation, and artifact suppression
- StructTreeRoot walk, RoleMap normalization, ActualText semantics
- Artifact classification (/Pagination, /Layout, /Background)
- **[research/tagged-pdf-structure-and-reading-order.md](research/tagged-pdf-structure-and-reading-order.md)** — Tagged PDF structure tree as authoritative reading order with MCID-to-span mapping
- ParentTree reverse lookup, StructElem type-to-block-kind mapping
- Suspects flag: when to fall back to XY-cut for coverage gaps
- **[research/pdfua2-and-accessibility-standards.md](research/pdfua2-and-accessibility-standards.md)** — PDF/UA-2 standard built on PDF 2.0 with updated structure requirements and WCAG alignment
- New artifact classification rules, associated file for MathML
- Namespace-qualified structure element types in PDF 2.0
- **[research/article-threads-and-reading-order.md](research/article-threads-and-reading-order.md)** — PDF article thread bead chains as multi-page reading order override for magazine layouts
- `/Threads` array, bead rect chains across non-contiguous pages
- Priority relative to structure tree and XY-cut ordering
---
## Full Document List (Alphabetical)
| Document | Category | One-Line Description |
|---|---|---|
| [accessibility-and-tagged-pdf-deep-dive.md](research/accessibility-and-tagged-pdf-deep-dive.md) | Accessibility | PDF/UA-1 structure tree contract, artifact suppression, ActualText semantics |
| [adversarial-inputs-and-parser-security.md](research/adversarial-inputs-and-parser-security.md) | Security | Decompression bombs, circular refs, resource exhaustion defense |
| [article-threads-and-reading-order.md](research/article-threads-and-reading-order.md) | Accessibility | Article thread bead chains as multi-page reading order override |
| [benchmark-and-test-methodology.md](research/benchmark-and-test-methodology.md) | Output & API | Test corpus design, CER metric, performance targets |
| [book-and-publishing-pdf-patterns.md](research/book-and-publishing-pdf-patterns.md) | Specialized Docs | Book PDFs: running headers, footnotes, sidebars, indices, TOC |
| [chunking-for-llm-consumption.md](research/chunking-for-llm-consumption.md) | Output & API | Block-boundary chunk splitting and overlap for RAG ingestion |
| [cjk-and-asian-script-encoding.md](research/cjk-and-asian-script-encoding.md) | Languages | CJK CIDFont encoding, multi-byte codes, vertical writing mode |
| [cmap-format-and-cid-encoding.md](research/cmap-format-and-cid-encoding.md) | Font & Encoding | ToUnicode CMap syntax: bfchar, bfrange, usecmap, ligature expansion |
| [color-management-and-icc-profiles.md](research/color-management-and-icc-profiles.md) | OCR & Image | ICC profile normalization for text visibility luminance estimation |
| [complex-layout-reading-order.md](research/complex-layout-reading-order.md) | Text Assembly | XY-cut algorithm for multi-column and mixed-layout documents |
| [confidence-scoring-and-aggregation.md](research/confidence-scoring-and-aggregation.md) | Text Assembly | Per-glyph confidence, span aggregation, readability score |
| [content-stream-concatenation.md](research/content-stream-concatenation.md) | Content Stream | Multi-stream pages, /Length mismatches, Form XObject sub-streams |
| [content-stream-operators.md](research/content-stream-operators.md) | Content Stream | Complete text operator reference: Tj, TJ, Td, Tm, Tf, Tr, BT/ET |
| [digital-signatures-and-certification.md](research/digital-signatures-and-certification.md) | Security | Sig field metadata, ByteRange, SubFilter, validation_status |
| [document-catalog-and-structure.md](research/document-catalog-and-structure.md) | File Format | Catalog keys, page tree traversal, inherited attribute resolution |
| [document-classification-and-zone-labeling.md](research/document-classification-and-zone-labeling.md) | Text Assembly | Body/header/footer/caption/margin zone heuristics |
| [embedded-files-and-portfolios.md](research/embedded-files-and-portfolios.md) | Output & API | EmbeddedFiles name tree, attachment metadata, portfolio structure |
| [engineering-document-extraction.md](research/engineering-document-extraction.md) | Specialized Docs | PDF/E-1 CAD exports: dimension annotations, title blocks, BOM tables |
| [error-handling-and-robustness.md](research/error-handling-and-robustness.md) | Security | Recoverable error model, diagnostic taxonomy, stage isolation |
| [extraction-output-schema.md](research/extraction-output-schema.md) | Output & API | Stable v1.0 JSON schema for all output fields |
| [extraction-pipeline-overview.md](research/extraction-pipeline-overview.md) | Start Here | End-to-end 9-stage architectural blueprint |
| [font-descriptor-and-metrics.md](research/font-descriptor-and-metrics.md) | Font & Encoding | FontDescriptor keys, Widths arrays, hmtx metrics |
| [font-subsetting-and-extraction.md](research/font-subsetting-and-extraction.md) | Font & Encoding | Subset naming, glyph table gaps, Standard 14 prefix stripping |
| [form-fields-and-annotations.md](research/form-fields-and-annotations.md) | Output & API | AcroForm field walk, XFA, annotation text types |
| [glyph-recognition-and-unicode-recovery.md](research/glyph-recognition-and-unicode-recovery.md) | Font & Encoding | Shape-hash Level 4 fallback, perceptual hash database |
| [government-form-pdf-patterns.md](research/government-form-pdf-patterns.md) | Specialized Docs | IRS/regulatory forms: flat print fields vs. AcroForm disambiguation |
| [graphics-state-tracking.md](research/graphics-state-tracking.md) | Content Stream | CTM, text matrix, q/Q stack, rendering mode state machine |
| [historical-and-degraded-document-extraction.md](research/historical-and-degraded-document-extraction.md) | OCR & Image | Microfilm/photocopy preprocessing, multi-pass OCR strategies |
| [hyperlinks-and-named-destinations.md](research/hyperlinks-and-named-destinations.md) | Output & API | URI annotations, GoTo actions, named destination resolution |
| [image-and-figure-extraction.md](research/image-and-figure-extraction.md) | OCR & Image | Image XObject identification, inline images, figure regions |
| [image-compression-and-filter-decoding.md](research/image-compression-and-filter-decoding.md) | File Format | All PDF filters: FlateDecode, LZW, ASCII85, DCT, JBIG2, JPX |
| [incremental-updates-and-versioning.md](research/incremental-updates-and-versioning.md) | File Format | Non-destructive PDF modification, revision history, soft redaction |
| [indic-script-extraction.md](research/indic-script-extraction.md) | Languages | Devanagari/Tamil/Bengali: cluster reconstruction, logical reordering |
| [invisible-and-hidden-text.md](research/invisible-and-hidden-text.md) | Content Stream | Tr=3 text, OCR layer patterns, include_invisible_text flag |
| [javascript-and-interactive-pdf-extraction.md](research/javascript-and-interactive-pdf-extraction.md) | Security | JavaScript detection, XFA dynamic content, contains_javascript flag |
| [language-detection-and-script-handling.md](research/language-detection-and-script-handling.md) | Languages | Unicode script identification, whichlang, BCP-47 tag assignment |
| [latex-and-scientific-pdf-patterns.md](research/latex-and-scientific-pdf-patterns.md) | Specialized Docs | LaTeX toolchain patterns: font stacks, microtype, missing ToUnicode |
| [legal-and-financial-pdf-patterns.md](research/legal-and-financial-pdf-patterns.md) | Specialized Docs | Legal briefs/contracts/filings: line numbers, Bates stamps, footnotes |
| [linearized-pdf-and-streaming.md](research/linearized-pdf-and-streaming.md) | File Format | Fast-web-view layout, two-xref structure, hint stream decoding |
| [malformed-pdf-repair-and-recovery.md](research/malformed-pdf-repair-and-recovery.md) | File Format | Forward scan fallback, truncation recovery, corruption taxonomy |
| [mathematical-expression-handling.md](research/mathematical-expression-handling.md) | Specialized Docs | Symbol font mapping, subscript/superscript, formula linearization |
| [medical-and-scientific-pdf-patterns.md](research/medical-and-scientific-pdf-patterns.md) | Specialized Docs | Dense scientific docs: equations, citations, footnotes, journal layouts |
| [multilingual-document-extraction.md](research/multilingual-document-extraction.md) | Languages | Mixed-script documents, per-span language, RTL reading order |
| [opentype-math-and-formula-extraction.md](research/opentype-math-and-formula-extraction.md) | Font & Encoding | OpenType MATH table, MathVariants, script glyph positioning |
| [optional-content-groups.md](research/optional-content-groups.md) | Content Stream | OCG layer state, BDC/EMC marked content, visibility filtering |
| [page-geometry-and-document-structure.md](research/page-geometry-and-document-structure.md) | File Format | Page boxes, coordinate systems, rotation matrix |
| [page-labels-and-outline-extraction.md](research/page-labels-and-outline-extraction.md) | Text Assembly | PageLabels number tree, outline walk, destination resolution |
| [pdfa-archival-extraction-guarantees.md](research/pdfa-archival-extraction-guarantees.md) | Specialized Docs | PDF/A conformance-derived fast-path guarantees and skips |
| [pdfa-compliance-and-extraction.md](research/pdfa-compliance-and-extraction.md) | Specialized Docs | PDF/A-1/2/3 conformance levels and their extraction implications |
| [pdf-encryption-and-security.md](research/pdf-encryption-and-security.md) | Security | Standard handler, RC4/AES decryption, password attempt sequence |
| [pdf-fonts-and-encoding.md](research/pdf-fonts-and-encoding.md) | Font & Encoding | All font types, encoding vectors, AGL, four-level fallback chain |
| [pdf-generator-quirks.md](research/pdf-generator-quirks.md) | File Format | Per-generator fingerprinting and spec-deviation workarounds |
| [pdf-object-model-and-data-types.md](research/pdf-object-model-and-data-types.md) | File Format | Eight PDF object types, reference semantics, generation numbers |
| [pdf-portfolio-and-attachments.md](research/pdf-portfolio-and-attachments.md) | Output & API | Portfolio collection schema, navigator sort, sub-document extraction |
| [pdf-specification.md](research/pdf-specification.md) | File Format | ISO 32000-1/2 file structure, xref, object streams, linearization |
| [pdfua2-and-accessibility-standards.md](research/pdfua2-and-accessibility-standards.md) | Accessibility | PDF/UA-2 on PDF 2.0: updated structure rules, WCAG alignment |
| [pdfvt-variable-transactional-printing.md](research/pdfvt-variable-transactional-printing.md) | Specialized Docs | PDF/VT DPart tree, record boundaries, reusable content streams |
| [pdfx-prepress-extraction.md](research/pdfx-prepress-extraction.md) | Specialized Docs | PDF/X print formats: spot colors, OutputIntent, TrimBox boundaries |
| [performance-and-streaming-architecture.md](research/performance-and-streaming-architecture.md) | Output & API | mmap I/O, rayon parallelism, NDJSON streaming, LRU object cache |
| [post-extraction-normalization.md](research/post-extraction-normalization.md) | Text Assembly | NFC normalization, whitespace cleanup, paragraph boundaries |
| [post-ocr-text-correction.md](research/post-ocr-text-correction.md) | OCR & Image | Systematic OCR error correction, dictionary validation, re-scoring |
| [presentation-and-spreadsheet-pdfs.md](research/presentation-and-spreadsheet-pdfs.md) | Specialized Docs | PowerPoint/Excel PDFs: slide structure, speaker notes, cell grids |
| [raster-ocr-pipeline.md](research/raster-ocr-pipeline.md) | OCR & Image | 300 DPI render, Sauvola, deskew, Tesseract HOCR integration |
| [redaction-detection-and-recovery.md](research/redaction-detection-and-recovery.md) | Security | True vs. soft redaction detection, black rectangle over text |
| [resource-dictionary-and-inheritance.md](research/resource-dictionary-and-inheritance.md) | Font & Encoding | Font/XObject namespace resolution, multi-level resource merging |
| [ruby-text-and-east-asian-typography.md](research/ruby-text-and-east-asian-typography.md) | Languages | Japanese ruby/furigana extraction, tate-chu-yoko mixed direction |
| [scanned-vs-vector-page-classification.md](research/scanned-vs-vector-page-classification.md) | OCR & Image | PageClass signals: image coverage, validity rate, Hybrid detection |
| [semantic-text-reconstruction.md](research/semantic-text-reconstruction.md) | Text Assembly | Hyphen removal, ligature expansion, line-end join heuristics |
| [shading-pattern-and-text-visibility.md](research/shading-pattern-and-text-visibility.md) | Content Stream | Luminance estimation for text-on-background visibility |
| [southeast-asian-script-extraction.md](research/southeast-asian-script-extraction.md) | Languages | Thai/Lao/Khmer/Burmese: word segmentation, stacked consonants |
| [span-merging-and-text-run-assembly.md](research/span-merging-and-text-run-assembly.md) | Text Assembly | Span boundary triggers, text run assembly, line merging pipeline |
| [stroke-and-outlined-text.md](research/stroke-and-outlined-text.md) | Content Stream | Rendering modes 17, stroke-only glyphs, clip path text |
| [table-structure-reconstruction.md](research/table-structure-reconstruction.md) | Text Assembly | Ruling-line grid, borderless alignment, cell assignment, merged cells |
| [tagged-pdf-structure-and-reading-order.md](research/tagged-pdf-structure-and-reading-order.md) | Accessibility | StructTree reading order, MCID mapping, Suspects fallback |
| [text-positioning-and-font-metrics.md](research/text-positioning-and-font-metrics.md) | Content Stream | Text state scalars: Tc, Tw, Tz, TL, Ts, and matrix accumulation |
| [text-readability-validation.md](research/text-readability-validation.md) | Text Assembly | Printable-char ratio, dictionary word rate, OCR fallback threshold |
| [type3-font-extraction.md](research/type3-font-extraction.md) | Font & Encoding | Type 3 CharProcs streams, per-glyph rasterization, shape recognition |
| [unicode-normalization-and-text-cleanup.md](research/unicode-normalization-and-text-cleanup.md) | Languages | NFC normalization, combining characters, ZWJ/ZWNJ, BOM stripping |
| [watermark-and-background-separation.md](research/watermark-and-background-separation.md) | Text Assembly | Watermark detection, repeated-pattern suppression, is_artifact flag |
| [word-boundary-reconstruction.md](research/word-boundary-reconstruction.md) | Content Stream | Inter-glyph gap thresholds, TJ kern detection, space insertion |
| [xmp-and-document-metadata.md](research/xmp-and-document-metadata.md) | Output & API | XMP RDF/XML parsing, /Info fallback, Dublin Core, conflict resolution |
| [xref-table-parsing-and-object-lookup.md](research/xref-table-parsing-and-object-lookup.md) | File Format | Traditional xref, xref streams, hybrid files, /Prev chain traversal |

View file

@ -0,0 +1,94 @@
# Parallel Extraction Architecture, Thread Safety, and Shared Cache Design
## Overview
PDF text extraction is an embarrassingly parallel problem at the page level, but the parallelism opportunities are unevenly distributed. Exploiting multi-core hardware correctly requires careful separation of read-only shared state from mutable per-page state, a coherent font cache strategy, and explicit handling of components—like Tesseract OCR—that are not thread-safe by default. This document describes the architecture pdftract must implement to achieve correct, bounded-memory parallel extraction.
## Parallelism Opportunities
PDF pages are structurally independent once the cross-reference table (xref) has been parsed. The xref maps object numbers to byte offsets within the file, and object resolution (reading the raw bytes for a given object number) requires only a shared reference to the file buffer and the xref index—both read-only after parsing. This means page-level parallelism is safe as soon as the document header, xref, and document catalog have been loaded.
Within a single page, additional parallelism exists but is narrower. Font resolution involves looking up font dictionaries by resource name, decoding ToUnicode CMaps, and loading advance-width arrays. If a page references ten distinct fonts and none has been cached yet, those ten font initializations can in principle proceed in parallel. In practice, the overhead of spawning subtasks per font initialization is only worthwhile for fonts with expensive CMap decoding; simpler fonts are faster to initialize sequentially than to dispatch.
Image decoding for OCR is fully parallel at the image level. Each inline or XObject image on a page is an independent stream; decompression and rasterization of one image does not touch any state needed by another. However, as discussed below, Tesseract instances introduce a thread-safety constraint that moves OCR work out of the inline page-worker path entirely.
What cannot be parallelized: cross-page operations. The document outline (bookmarks), page label sequences, article threads, form field trees, and named destinations all reference multiple pages by page index and must be assembled after all per-page extraction is complete. These are extracted from the document catalog as a single sequential pass once the parallel page workers have finished.
## Shared Read-Only State
The xref index, raw file buffer (or memory-mapped file view), decoded font data, embedded font binaries, and ICC profiles are logically read-only once the document is open. pdftract wraps these in `Arc<T>` and clones the Arc into each page worker. No locking is needed for reads because `Arc<T>` guarantees that the pointed-to data is immutable from the perspective of any thread holding a reference.
The document's cross-reference table is represented as `Arc<XrefTable>`, where `XrefTable` holds either a flat `Vec<XrefEntry>` for traditional xref sections or a compressed xref stream decoded at open time. Object bytes are resolved by indexing into the xref and reading the corresponding byte range from the shared file buffer, which is held as `Arc<Mmap>` (a read-only memory mapping). Because the OS page cache handles the actual I/O, concurrent reads to different byte ranges are safe and efficient.
ICC profile data and embedded font binaries (CFF, TrueType, Type1) are similarly wrapped in `Arc<Vec<u8>>` after first decode and shared across all page workers without additional synchronization.
## Mutable Per-Page State
Each page worker owns its own mutable state and never shares it. The PDF graphics state machine maintains a current transformation matrix, a graphics state stack (pushed and popped by `q`/`Q` operators), a text matrix (`Tm`), a text line matrix (`Tlm`), and the active text state parameters (`Tf`, `Tc`, `Tw`, `Th`, `Tl`, `Tmode`). None of these can be shared across pages because different pages may leave the graphics state in different configurations; more importantly, the state evolves as operators are interpreted, so sharing it would require locking on every operator.
The MCID-to-glyph map—which maps marked content identifiers to the sequence of glyphs extracted within each marked content span—is built fresh per page and discarded after the page result is serialized. Similarly, the list of extracted text runs, bounding boxes, and font references is per-page and allocated on the page worker's stack or heap without contention.
## Rayon-Based Page Parallelism
pdftract uses Rayon's work-stealing thread pool for page parallelism. The extraction pipeline calls `(0..page_count).into_par_iter()` and maps each index to a `PageResult`:
```rust
let page_results: Vec<PageResult> = (0..page_count)
.into_par_iter()
.map(|page_index| extract_page(page_index, &shared))
.collect();
```
`shared` is an `Arc<DocumentShared>` that bundles the xref table, file mapping, font cache, and objstm cache. Rayon collects results in-order because `collect()` on a parallel iterator preserves index order; page workers may finish out of order internally, but the output `Vec<PageResult>` is always index-ordered. After `collect()` returns, the main thread extracts document-level structure from the catalog and assembles the final `ExtractionResult`.
## Font Cache Design
Fonts in PDF are referenced by resource name within a page's resource dictionary, which resolves to an indirect object reference—effectively a `u32` object number. The font cache is keyed by object number and shared across all page workers:
```rust
struct FontCache {
entries: RwLock<HashMap<u32, Arc<FontData>>>,
}
```
When a page worker encounters a font reference, it first acquires a read lock and checks the map. On a hit, it clones the `Arc<FontData>` and releases the read lock immediately—no write contention for the common case. On a miss, it drops the read lock, decodes the font (ToUnicode CMap, advance-width array, font metrics, embedding type), then upgrades to a write lock to insert the result. Because two workers might race to initialize the same font, the inserting worker must check again under the write lock before inserting (check-then-act pattern).
`FontData` contains the pre-decoded Unicode mapping table, a compact `Vec<u16>` of advance widths indexed by glyph ID, and the font's bounding box and ascent/descent metrics. All fields are read-only once constructed, so the `Arc<FontData>` can be shared without further locking.
## OCR Task Queue
Tesseract 5.x creates one `TessBaseAPI` instance per logical document operation. Instances are not safe to share across threads. Rather than attempting to serialize Tesseract calls within the parallel page iterator, pdftract separates OCR into a second pass. During the first Rayon pass, pages identified as raster-only (no extractable text stream) produce a `PendingOcr` entry containing the rasterized image buffer. These entries are collected into a `Vec<PendingOcr>` after the first pass completes.
The OCR pass processes pending entries through a bounded thread pool where each worker thread owns exactly one `TessBaseAPI` instance stored in a `thread_local!` cell. The pool size is configurable (default: number of physical cores, not logical, since Tesseract is CPU-bound and hyperthreading yields diminishing returns for it). Each worker picks a `PendingOcr` from the queue, feeds the image to its thread-local Tesseract instance, and writes the resulting `OcrPageResult` to an output channel. This design avoids both contention on a shared Tesseract instance and the overhead of constructing and destroying an API handle per page.
## Object Stream Caching
PDF 1.5 and later allow objects to be packed into compressed object streams (`ObjStm`). A single `ObjStm` may contain objects referenced by dozens of pages. Decompressing and parsing an `ObjStm` is expensive—it involves flate decompression of potentially hundreds of kilobytes followed by PDF tokenization of the embedded objects.
pdftract caches decoded `ObjStm` content using `OnceCell<Arc<ObjStream>>` within the xref table entry for each object stream object. The first page worker to resolve an object whose generation entry points to an `ObjStm` decompresses the stream, parses all embedded objects into a `HashMap<u32, ParsedObject>`, and stores the result in the `OnceCell`. Subsequent workers call `get_or_try_init()`, which either returns the already-initialized value or blocks until the initializing thread completes. Because `OnceCell` from the `once_cell` crate is thread-safe and initializes exactly once regardless of concurrent callers, no additional locking is needed around `ObjStm` access.
## Memory Budget Across Parallel Pages
With N pages in flight simultaneously, peak memory is proportional to N times the per-page working set. For dense pages with large embedded images, the per-page working set can reach tens of megabytes. Unbounded parallelism on a large PDF would exhaust available memory before any results are emitted.
pdftract bounds this with a `tokio::sync::Semaphore` (or equivalent counting semaphore) sized to `max_parallel_pages`. Each page worker acquires one permit before allocating its working buffers and releases the permit when its `PageResult` has been produced. The default value for `max_parallel_pages` is derived at startup by dividing a configurable `memory_budget_mb` by an estimated per-page cost (measured empirically and stored as a build constant, adjustable via environment variable). Users processing memory-constrained environments can set `PDFTRACT_MAX_PARALLEL_PAGES=4` to override. The semaphore interacts naturally with Rayon's work-stealing: workers that cannot acquire a permit block and yield to Rayon, which picks up other ready work.
## Streaming Output with Parallel Extraction
NDJSON streaming mode emits one JSON object per page as soon as that page's result is available, but pages must be emitted in document order. Rayon's `collect()` waits for all pages before returning, which is incompatible with streaming.
For streaming mode, pdftract uses an ordered slot channel. Before launching the parallel iterator, an array of `N` `Option<PageResult>` slots is allocated (wrapped in a `Mutex<Vec<Slot>>`). Each page worker writes its result to `slots[page_index]`. A dedicated ordering thread runs concurrently, advancing a cursor from 0 upward: when `slots[cursor]` is `Some`, it takes the result, serializes it to stdout, and increments the cursor. This produces in-order output with minimal buffering—at most a few pages ahead of the cursor are held in memory at any time. The ordering thread blocks on a condvar when the next slot is not yet ready, and page workers notify the condvar after writing their slot.
## Error Isolation
A panic or extraction failure in one page worker must not abort other in-flight pages. pdftract wraps the body of each page worker in `std::panic::catch_unwind`:
```rust
let result = std::panic::catch_unwind(|| extract_page(page_index, &shared));
match result {
Ok(page_result) => page_result,
Err(_) => PageResult::error(page_index, "internal extraction panic"),
}
```
Non-panic errors (malformed content streams, unsupported filter types) are propagated as `Result` within `extract_page` and converted to `PageResult::Error` entries before returning. Failed pages emit a JSON error object in the output stream at their correct position—downstream consumers can detect and log them without breaking the overall parse. The document-level `extraction_quality` field in the final output reflects the fraction of pages that completed successfully, giving callers a machine-readable signal when partial extraction has occurred.