pdftract/docs/research-index.md
jedarden 8753630bc3 Add parallel extraction research and comprehensive research index
New research document covering parallel extraction architecture:
rayon page-level parallelism, Arc<> shared xref/font/object-stream
caches, RwLock font cache design, Tesseract thread-local OCR pool,
semaphore memory budget, ordered NDJSON streaming slot array, and
catch_unwind error isolation per page.

Also adds docs/research-index.md: a 622-line navigable index of all
83 research documents grouped into 9 thematic categories, with a
"Start Here" reading path, per-phase implementation reading tables,
and an alphabetical lookup table covering every document.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:30:35 -04:00

53 KiB
Raw Blame History

pdftract Research Index

82 documents covering PDF internals, font encoding, OCR, layout reconstruction, specialized document types, security, output schema, and multilingual extraction.

This is a navigable reading guide, not a summary. Use it to find the right document when implementing a specific feature. All links are relative to this file's location (docs/).


Start Here

Read these six documents first, in order, before touching any other research:

  1. research/extraction-pipeline-overview.md — End-to-end architectural blueprint: all 9 pipeline stages, decision points, and data transformations. The canonical integration reference that ties every other document together.
  2. research/pdf-specification.md — ISO 32000-1/2 implementation reference: file structure, xref tables, object model, content streams, and encoding. The foundation for Phase 1.
  3. research/pdf-fonts-and-encoding.md — Every font type in PDF, character code → Unicode resolution, and the four-level fallback chain. Essential before any font work.
  4. research/content-stream-operators.md — Complete operator reference for text extraction: text state operators, positioning, rendering modes, and their effect on glyph output.
  5. research/extraction-output-schema.md — Stable v1.0 JSON schema: document, page, span, block, annotation, and form field structures. Read this before writing any serialization code.
  6. plan/implementation-plan.md — Seven-phase build plan with crate dependencies, critical tests, and milestone targets. The execution roadmap.

Reading Path for Implementation

Phase 1: Core PDF Parser

Build the lexer, object parser, xref resolution, document model, and stream decoder.

Document Why
research/pdf-specification.md File structure, xref tables, object streams, linearized layout
research/pdf-object-model-and-data-types.md The eight PDF object types, reference semantics, generation numbers
research/xref-table-parsing-and-object-lookup.md Traditional xref, xref streams, hybrid files, incremental update chains
research/malformed-pdf-repair-and-recovery.md Forward scan fallback, truncated file recovery, error taxonomy
research/document-catalog-and-structure.md Catalog keys, page tree traversal, inherited attributes
research/page-geometry-and-document-structure.md MediaBox/CropBox/BleedBox, rotation, coordinate systems
research/image-compression-and-filter-decoding.md FlateDecode, LZW, ASCII85, RunLength, DCT, JBIG2, JPX filter chain
research/pdf-encryption-and-security.md Standard handler, RC4/AES decryption, password attempt sequence
research/error-handling-and-robustness.md Recoverable error model, diagnostic codes, graceful degradation
research/adversarial-inputs-and-parser-security.md Decompression bombs, circular references, resource exhaustion limits
research/linearized-pdf-and-streaming.md Two-xref-table layout, hint streams, fast-web-view parsing
research/incremental-updates-and-versioning.md Append-only update model, revision history, redaction detection

Phase 2: Font and Encoding Pipeline

Map every character code to a Unicode scalar value with a confidence score.

Document Why
research/pdf-fonts-and-encoding.md All font types, encoding vectors, AGL lookup, four-level fallback
research/cmap-format-and-cid-encoding.md ToUnicode CMap syntax: bfchar, bfrange, usecmap, UTF-16BE sequences
research/font-descriptor-and-metrics.md FontDescriptor keys, width arrays, hmtx metrics, descriptor flags
research/font-subsetting-and-extraction.md Subset naming convention, glyph table gaps, metric inference
research/glyph-recognition-and-unicode-recovery.md Shape-hash database, perceptual hashing, Level 4 fallback
research/type3-font-extraction.md CharProcs content streams, per-glyph rasterization, shape recognition
research/cjk-and-asian-script-encoding.md CIDFont encoding, predefined CMaps, Shift-JIS/GB18030/Big5/EUC-KR
research/resource-dictionary-and-inheritance.md Font namespace resolution, inherited resource merging, XObject refs

Phase 3: Content Stream Processing

Execute content stream operators to produce a raw glyph list with positions.

Document Why
research/content-stream-operators.md Full operator reference: Tj, TJ, Td, Tm, Tf, Tr, BT/ET, and all text operators
research/graphics-state-tracking.md CTM, text matrix, q/Q stack, Tr rendering mode, color state
research/text-positioning-and-font-metrics.md Text matrix accumulation, leading, Td/TD/T* semantics, rise
research/content-stream-concatenation.md Multi-stream pages, /Length mismatch handling, resource namespace scoping
research/optional-content-groups.md OCG layer state, BDC/EMC marked content sequences, visibility filtering
research/invisible-and-hidden-text.md Tr=3 invisible text, OCR layer patterns, include_invisible_text flag
research/stroke-and-outlined-text.md Rendering modes 17, stroke-only glyphs, clipping path text
research/shading-pattern-and-text-visibility.md Luminance estimation for text-on-background visibility filtering
research/word-boundary-reconstruction.md Inter-glyph gap thresholds, TJ kern detection, synthetic space insertion

Phase 4: Text Assembly and Layout

Transform raw glyph lists into structured blocks in reading order.

Document Why
research/span-merging-and-text-run-assembly.md Span boundary rules, run assembly, line merging, block formation
research/complex-layout-reading-order.md XY-cut algorithm, multi-column detection, sidebar/caption disambiguation
research/tagged-pdf-structure-and-reading-order.md Structure tree walk, MCID-to-span mapping, ActualText override
research/document-classification-and-zone-labeling.md Zone classification: body/header/footer/caption/margin heuristics
research/watermark-and-background-separation.md Watermark detection, repeated-pattern suppression, z-order analysis
research/semantic-text-reconstruction.md Hyphen removal, soft-hyphen handling, ligature expansion, line joining
research/post-extraction-normalization.md NFC normalization, whitespace cleanup, paragraph boundary detection
research/confidence-scoring-and-aggregation.md Per-glyph confidence, span aggregation, readability score computation
research/text-readability-validation.md Printable-character ratio, dictionary word rate, readability threshold
research/page-labels-and-outline-extraction.md PageLabels number tree, outline bookmark walk, named destinations

Phase 5: OCR Integration

Extract text from scanned pages; improve broken-vector pages via Tesseract.

Document Why
research/scanned-vs-vector-page-classification.md Classification signals, PageClass enum, confidence scoring per signal
research/raster-ocr-pipeline.md 300 DPI render, Sauvola binarization, deskew, Tesseract HOCR integration
research/post-ocr-text-correction.md Systematic OCR errors, dictionary-based correction, confidence re-scoring
research/image-and-figure-extraction.md Image XObject identification, inline images, figure region detection
research/historical-and-degraded-document-extraction.md Microfilm, low-quality scan preprocessing, multi-pass OCR strategies

Phase 6: Output and API

Full JSON schema, PyO3 bindings, HTTP serve mode, NDJSON streaming.

Document Why
research/extraction-output-schema.md Complete v1.0 schema: all fields, types, and serialization constraints
research/performance-and-streaming-architecture.md mmap I/O, rayon page parallelism, BufWriter NDJSON, LRU object cache
research/hyperlinks-and-named-destinations.md URI annotations, GoTo actions, cross-document links, named dest resolution
research/xmp-and-document-metadata.md XMP RDF/XML parsing, /Info dict fallback, Dublin Core fields, conflict resolution
research/chunking-for-llm-consumption.md Chunk size strategy, block-boundary splitting, overlap, token estimation
research/benchmark-and-test-methodology.md Test corpus design, accuracy metrics, performance benchmarks, regression suite

Phase 7: Advanced Features

StructTree exploitation, table detection, AcroForm/XFA, attachments, signatures.

Document Why
research/accessibility-and-tagged-pdf-deep-dive.md StructTree walk, role mapping, artifact suppression, ActualText semantics
research/tagged-pdf-structure-and-reading-order.md ParentTree, MCID resolution, Suspects flag fallback
research/table-structure-reconstruction.md Ruling-line detection, borderless heuristics, cell assignment, merged cells
research/form-fields-and-annotations.md AcroForm field walk, XFA extraction, annotation types and values
research/digital-signatures-and-certification.md Sig field metadata, ByteRange, SubFilter, validation_status reporting
research/embedded-files-and-portfolios.md EmbeddedFiles name tree, portfolio navigation, attachment metadata
research/pdf-portfolio-and-attachments.md PDF Portfolio collections, navigator schema, attachment content access
research/article-threads-and-reading-order.md /Threads bead chains, multi-page article flow, reading order override

Document Categories

PDF File Format and Parsing

Core specification, file structure, object model, and xref resolution.

  • research/pdf-specification.md — ISO 32000-1/2 implementation reference covering file structure, xref tables, object streams, and linearized layout

    • File header, body, xref, trailer structure
    • Xref streams (PDF 1.5+), object streams, incremental updates
    • Content stream grammar and operator categorization
  • research/pdf-object-model-and-data-types.md — The eight fundamental PDF object types with serialization details and reference semantics

    • Boolean, integer, real, string, name, array, dictionary, stream
    • Indirect object references, generation numbers, object identity
  • research/xref-table-parsing-and-object-lookup.md — Cross-reference parsing strategies: traditional table, xref streams, hybrid files, and incremental chains

    • Traditional 20-byte xref entry format and subsection merging
    • Type-0/1/2 xref stream entries, /W field widths
    • /Prev chain traversal for incremental updates
  • research/document-catalog-and-structure.md — Document catalog keys, page tree traversal, and inherited attribute resolution

    • /Root, /Pages, /Outlines, /AcroForm, /MarkInfo, /OCProperties
    • Page tree flattening, per-key inheritance walk, attribute override rules
  • research/page-geometry-and-document-structure.md — Page boxes, coordinate systems, rotation, and media/crop/bleed/trim/art box semantics

    • MediaBox, CropBox, BleedBox, TrimBox, ArtBox inheritance
    • User space to device space transformation, rotation matrix application
  • research/linearized-pdf-and-streaming.md — Linearized ("fast web view") PDF layout: two-xref structure, hint streams, and first-page parsing

    • Linearization dictionary keys and their parsing implications
    • Hint table decoding, page offset table, shared object table
  • research/incremental-updates-and-versioning.md — Non-destructive append-only PDF modification: revision history, object shadowing, and redaction forensics

    • Incremental save mechanics, object number reuse across revisions
    • Detecting "soft redaction" where content is hidden but not deleted
  • research/malformed-pdf-repair-and-recovery.md — Forward scan fallback, truncation recovery, and the full taxonomy of real-world PDF corruption

    • Forward object scan from obj/endobj markers when xref fails
    • Per-corruption diagnostic codes and recovery strategies
  • research/pdf-generator-quirks.md — Per-generator fingerprinting and known deviations from spec for Word, LibreOffice, Chrome, LaTeX, and others

    • Generator detection via /Producer field patterns
    • Quirk-specific workarounds keyed to generator fingerprint
  • research/resource-dictionary-and-inheritance.md — Font, XObject, ExtGState, and pattern namespace resolution with inheritance from ancestor page nodes

    • Multi-level resource merging, last-write-wins semantics at page level
    • Namespace isolation between content streams and Form XObjects
  • research/image-compression-and-filter-decoding.md — All PDF stream filters: FlateDecode predictors, LZW, ASCII85, RunLength, DCT, JBIG2, JPX, CCITT

    • Filter pipeline chaining, /DecodeParms alignment
    • Partial decode recovery on zlib truncation errors

Font and Encoding

Everything needed to map a character code to a Unicode codepoint.

  • research/pdf-fonts-and-encoding.md — Complete font type reference and the four-level Unicode resolution fallback chain

    • Type1, TrueType, Type0/CID, Type3, OpenType font loading strategies
    • ToUnicode → AGL → fingerprint → shape recognition fallback
  • research/cmap-format-and-cid-encoding.md — ToUnicode CMap program syntax and CID encoding for composite fonts

    • beginbfchar/beginbfrange, usecmap, UTF-16BE ligature expansion
    • Predefined CMap names for CJK scripts, CID-to-GID mapping
  • research/font-descriptor-and-metrics.md — FontDescriptor dictionary keys, width arrays, hmtx table access, and descriptor flags

    • /Widths, /FirstChar, /LastChar, /MissingWidth for Type1/TrueType
    • /DW, /W sparse width encoding for CIDFonts
  • research/font-subsetting-and-extraction.md — Subset font naming convention, glyph table gaps, and metrics inference for missing glyphs

    • Six-uppercase-letter prefix stripping for Standard 14 lookup
    • Identifying and handling subsetting-induced ToUnicode gaps
  • research/glyph-recognition-and-unicode-recovery.md — Shape-hash database construction and perceptual hashing for Level 4 Unicode recovery

    • 32×32 bitmap rendering, perceptual hash lookup database
    • Shape-match confidence scoring and known-font fingerprint cache
  • research/type3-font-extraction.md — Type 3 fonts: glyph shapes as content stream fragments requiring per-glyph rasterization

    • /CharProcs dictionary parsing, glyph content stream execution
    • Color/grayscale Type 3 glyph rendering for shape recognition
  • research/cjk-and-asian-script-encoding.md — CJK CIDFont encoding, predefined CMaps, and multi-byte character code parsing

    • Shift-JIS, GB18030, Big5, EUC-KR codepage decoding via encoding_rs
    • Vertical writing mode detection and glyph substitution
  • research/opentype-math-and-formula-extraction.md — OpenType MATH table layout and text-extraction strategy for mathematical formulas

    • MathVariants, MathKern, script/superscript glyph positioning
    • Linearized formula extraction vs. MathML reconstruction tradeoffs

Content Stream Processing

Executing the PDF painting model to produce glyphs with positions.

  • research/content-stream-operators.md — Full PDF operator reference for text extraction: every text, positioning, and state operator

    • Tj, TJ, ', " operators; Td, TD, Tm, T* positioning
    • BT/ET block semantics, Tf font selection, Tr rendering mode
  • research/graphics-state-tracking.md — Complete graphics state machine: CTM, text matrix stack, color state, and rendering mode

    • q/Q push/pop, CTM concatenation via cm operator
    • Tr modes 07 and their visibility implications for extraction
  • research/text-positioning-and-font-metrics.md — Text state scalar accumulation, leading, Td/TD/T* semantics, and rise

    • Text matrix vs. text line matrix distinction
    • Character spacing (Tc), word spacing (Tw), horizontal scaling (Tz)
  • research/content-stream-concatenation.md — Multi-stream page assembly, /Length mismatch handling, and resource namespace scoping

    • Pages with array /Contents, stream boundary handling
    • Form XObject execution and resource inheritance within sub-streams
  • research/optional-content-groups.md — OCG layer state tracking, BDC/EMC marked content, and visibility-based glyph suppression

    • /OCProperties catalog entry, OCG on/off state resolution
    • OCMD (optional content membership dictionary) logic operators
  • research/invisible-and-hidden-text.md — Tr=3 invisible text, PDF/A OCR layer patterns, and the include_invisible_text flag

    • Scanned-PDF OCR text layer architecture (visible image + hidden text)
    • White-on-white and zero-font-size hidden text detection
  • research/stroke-and-outlined-text.md — Text rendering modes 17, stroke-only glyphs, and clipping path text handling

    • Mode 1 (stroke), mode 2 (fill+stroke), mode 4 (invisible clip)
    • Outlined text in logos and headings where fill is absent
  • research/word-boundary-reconstruction.md — Inter-glyph gap thresholds, TJ kern detection, and synthetic space character insertion

    • TeX/LaTeX character-per-glyph patterns, missing inter-word spaces
    • Horizontal gap normalized by font size as word-boundary signal
  • research/shading-pattern-and-text-visibility.md — Color space luminance estimation for text-on-background visibility and watermark filtering

    • ICC profile color space normalization to approximate luminance
    • Pattern and shading fills as background suppression candidates

Text Assembly and Layout Reconstruction

Transforming raw glyph lists into ordered, structured text.

  • research/span-merging-and-text-run-assembly.md — Span boundary detection, text run assembly, line merging, and block formation pipeline

    • Font/size/color/mode change triggers for span splits
    • Ascending y-position sort, baseline alignment for line grouping
  • research/complex-layout-reading-order.md — XY-cut recursive page partitioning for multi-column, sidebar, and mixed-layout documents

    • Whitespace gap detection as column separator, minimum gap thresholds
    • Caption-to-figure association, margin note classification
  • research/tagged-pdf-structure-and-reading-order.md — Structure tree as authoritative reading order source for tagged documents

    • StructElem type mapping to block kinds, ParentTree MCID lookup
    • Suspects flag validation and XY-cut fallback conditions
  • research/document-classification-and-zone-labeling.md — Page zone classification: body text, header, footer, caption, margin, and running head

    • Spatial heuristics: y-position thresholds, font size ratios, repetition
    • Zone label influence on reading order and block kind assignment
  • research/watermark-and-background-separation.md — Watermark detection, repeated-pattern suppression, and z-order layering analysis

    • Low-opacity text, diagonal text, large-font centered text as watermark signals
    • Suppression vs. flagging strategies, is_artifact span flag
  • research/semantic-text-reconstruction.md — Hyphen removal, soft-hyphen handling, ligature expansion, and line-end join heuristics

    • End-of-line hyphen detection and word reconstitution
    • Ligature Unicode expansion (fi, fl, ffi, ffl, st)
  • research/post-extraction-normalization.md — NFC normalization, whitespace cleanup, and paragraph boundary detection

    • Unicode combining character normalization pipeline
    • Trailing/leading whitespace removal, control character stripping
  • research/confidence-scoring-and-aggregation.md — Per-glyph confidence scoring, span aggregation, and page-level readability score computation

    • Source-weighted confidence: to_unicode=1.0, agl=0.9, fingerprint=0.85, shape_match=0.7
    • Span minimum confidence, page aggregate, extraction_quality rollup
  • research/text-readability-validation.md — Printable-character ratio, dictionary word rate, and readability threshold for OCR fallback triggering

    • Character validity checks: printable Unicode, non-sentinel codepoints
    • Readability score thresholds for ocr_fallback_threshold comparison
  • research/page-labels-and-outline-extraction.md — PageLabels number tree parsing, outline bookmark walk, and named destination resolution

    • Roman/Arabic/alphabetic label prefix and style encoding
    • /Outlines recursive walk, /Dests name tree lookup
  • research/table-structure-reconstruction.md — Ruling-line grid detection, borderless column alignment, cell assignment, and merged cell inference

    • Path segment clustering, intersection point detection, grid construction
    • Colspan/rowspan inference from missing interior edges

OCR and Image Processing

Handling scanned and raster pages.

  • research/scanned-vs-vector-page-classification.md — Classification signals and the PageClass enum (Vector/Scanned/Hybrid/BrokenVector)

    • Image coverage fraction threshold, character validity rate signals
    • 8×8 grid cell per-region classification for Hybrid detection
  • research/raster-ocr-pipeline.md — Full OCR pipeline: 300 DPI rendering, Sauvola binarization, Hough deskew, Tesseract HOCR

    • leptonica-plumbing preprocessing, HOCR confidence parsing
    • BrokenVector assisted-OCR mode: bounding box seeding from vector positions
  • research/post-ocr-text-correction.md — Systematic OCR error patterns, dictionary-based correction, and post-OCR confidence re-scoring

    • Common substitution errors (l/1/I, O/0, rn/m), context correction
    • Language model probability scoring for candidate correction ranking
  • research/image-and-figure-extraction.md — Image XObject identification, inline image parsing, and figure region demarcation

    • XObject type detection, image placement matrix, DPI computation
    • Figure caption association and alt-text extraction from StructTree
  • research/historical-and-degraded-document-extraction.md — Preprocessing for microfilm, low-quality photocopies, and physically degraded originals

    • Multi-pass binarization strategies, fold/crease artifact suppression
    • Tesseract PSM mode selection for degraded layout recognition
  • research/color-management-and-icc-profiles.md — ICC profile color spaces and luminance estimation for text visibility determination

    • CMYK/Lab/ICCBased color space normalization to approximate RGB
    • Spot color handling, DeviceGray/DeviceRGB identity conversion

Specialized Document Types

Documents with structural patterns that require targeted handling.

  • research/latex-and-scientific-pdf-patterns.md — LaTeX toolchain patterns: pdflatex, XeLaTeX, LuaLaTeX, and their encoding behaviors

    • Type1/OTF font stacks, microtype spacing, missing ToUnicode patterns
    • Figure/table float placement, bibliography link detection
  • research/medical-and-scientific-pdf-patterns.md — Dense mixed-content scientific documents: figures, tables, equations, citations, footnotes

    • Multi-column layout with equation regions, journal template patterns
    • Citation/reference block detection and DOI link extraction
  • research/mathematical-expression-handling.md — Mathematical notation extraction strategies across encoding schemes

    • Symbol font mapping (Symbol, STIX, XITS, Computer Modern math)
    • Subscript/superscript detection, operator precedence linearization
  • research/legal-and-financial-pdf-patterns.md — Legal briefs, contracts, financial filings: line numbers, Bates stamps, footnote styles

    • Court filing format patterns (federal, state), header/footer extraction
    • Financial table dense-number extraction, currency symbol handling
  • research/government-form-pdf-patterns.md — Government form PDFs: IRS, regulatory filings, mixed AcroForm/print-field layouts

    • Form field label-to-value association across non-AcroForm "flat" forms
    • Instruction text vs. fillable field disambiguation
  • research/book-and-publishing-pdf-patterns.md — Book PDF structural complexity: running headers, footnotes, sidebars, indices, TOC

    • Chapter/section boundary detection, page number extraction
    • Index entry reconstruction, cross-reference link resolution
  • research/engineering-document-extraction.md — PDF/E-1 engineering documents: CAD-exported PDFs, technical drawing annotation extraction

    • Dimension annotation text, title block field extraction
    • Revision table parsing, BOM (bill of materials) table detection
  • research/presentation-and-spreadsheet-pdfs.md — PowerPoint and Excel PDFs: slide structure, speaker notes, sheet grids, frozen headers

    • Slide bounding box as implicit zone boundary, note text association
    • Spreadsheet cell grid reconstruction from absolute-positioned text
  • research/pdfa-compliance-and-extraction.md — PDF/A conformance levels and their extraction guarantees and fast-path optimizations

    • PDF/A-1a/1b, 2a/2b/2u, 3a/3b/3u conformance constraints
    • ToUnicode guarantee in PDF/A-1a, mandatory tagging in PDF/A-2a
  • research/pdfa-archival-extraction-guarantees.md — Specific extraction guarantees derivable from PDF/A conformance, enabling fast-path skips

    • Level-specific ToUnicode presence guarantees, XMP metadata mandates
    • Conformance-driven fallback skipping to improve throughput
  • research/pdfx-prepress-extraction.md — PDF/X print production formats: spot colors, bleed marks, output intent profiles

    • PDF/X-1a, X-3, X-4, X-6 conformance constraints
    • OutputIntent ICC profile, TrimBox/BleedBox as canonical page boundaries
  • research/pdfua2-and-accessibility-standards.md — PDF/UA-2 (ISO 14289-2) built on PDF 2.0: updated structure requirements and WCAG alignment

    • Namespace-qualified structure types, artifact classification changes
    • Associated file attachment for MathML, pronunciation dictionaries
  • research/pdfvt-variable-transactional-printing.md — PDF/VT variable and transactional printing: DPart tree, record boundary, reusable content

    • DPart metadata extraction, record-per-recipient text variation
    • Reusable content stream (RCS) handling, page piece dictionary

Security and Robustness

Handling adversarial inputs, encryption, redaction, and JavaScript.

  • research/adversarial-inputs-and-parser-security.md — Concrete attack classes and defensive techniques for production PDF parsing

    • Decompression bombs: stream size limits, inflation ratio caps
    • Circular reference guards, stack depth limits, object count caps
  • research/pdf-encryption-and-security.md — Standard security handler, RC4 and AES decryption, certificate handlers, and password resolution

    • /V, /R, /KeyLength, /CF//StmF//StrF handler fields
    • Empty-password-first attempt sequence, unsupported handler error path
  • research/error-handling-and-robustness.md — Recoverable error model, diagnostic code taxonomy, and graceful degradation across all stages

    • No-panic guarantee in library code, per-error diagnostic entries
    • Stage-level error isolation: one page failure does not abort others
  • research/redaction-detection-and-recovery.md — Distinguishing true redaction from soft redaction; detecting content beneath covering rectangles

    • Black rectangle over text detection, opacity-0 text identification
    • /Redact annotation type, incremental update soft-redaction forensics
  • research/javascript-and-interactive-pdf-extraction.md — JavaScript detection, dynamic content identification, and extraction strategy for interactive PDFs

    • /JS action detection, contains_javascript metadata flag
    • XFA dynamic form extraction vs. static snapshot fallback
  • research/digital-signatures-and-certification.md — Digital signature field metadata extraction and ByteRange coverage reporting

    • Sig field walk, /ByteRange, /SubFilter format identification
    • Certification vs. approval signature distinction, validation_status field

Output, API, and Metadata

Schema, serialization, and document-level metadata extraction.

  • research/extraction-output-schema.md — Stable v1.0 JSON schema: full field inventory for document, page, span, block, form, and annotation output

    • Document-level metadata, outline, page array, extraction_quality
    • Span and block structs, confidence sources, block kind enum
  • research/xmp-and-document-metadata.md — XMP RDF/XML parsing, /Info dict fallback, Dublin Core fields, and XMP-vs-Info conflict resolution

    • pdfaid:conformance, dc:title, pdf:Producer namespace fields
    • XMP priority over /Info in PDF 1.4+ documents
  • research/hyperlinks-and-named-destinations.md — URI annotations, GoTo actions, named destination resolution, and internal navigation link extraction

    • /Annots Link annotation type, /A action dictionary
    • /Dests name tree, /Names catalog entry, cross-document GoToR
  • research/page-labels-and-outline-extraction.md — PageLabels number tree, outline bookmark traversal, destination types

    • /S label style (D/r/R/A/a), /P prefix, /St start value
    • /Outlines /First//Next//Last linked list walk
  • research/form-fields-and-annotations.md — AcroForm field hierarchy, XFA extraction, and annotation text (highlights, stamps, notes)

    • /Fields array walk, field type detection (Tx/Btn/Ch/Sig)
    • Annotation subtypes: Highlight, StrikeOut, FreeText, Stamp, Link
  • research/embedded-files-and-portfolios.md — EmbeddedFiles name tree navigation, attachment metadata, and portfolio structure

    • /EmbeddedFiles name tree, /EF dictionary, file stream access
    • Portfolio /Collection schema, navigator sort order
  • research/pdf-portfolio-and-attachments.md — PDF Portfolio collections: navigator schema, attachment content access, and sub-document extraction

    • /Collection fields array, sort key extraction
    • Recursive extraction of PDF attachments within portfolios
  • research/performance-and-streaming-architecture.md — Memory-mapped I/O, rayon page parallelism, NDJSON streaming, and LRU object cache design

    • mmap + madvise(MADV_SEQUENTIAL) on content streams
    • BufWriter<Stdout> NDJSON, page-level rayon scatter/gather
  • research/chunking-for-llm-consumption.md — Block-boundary chunk splitting, overlap strategy, and token count estimation for RAG ingestion

    • Heading-aware chunk boundaries, table/figure keep-together rules
    • Overlap window sizing, chunk metadata (page_index, block_ids)
  • research/benchmark-and-test-methodology.md — Test corpus design, extraction accuracy metrics, performance benchmarks, and regression suite

    • Ground-truth corpus construction, character error rate (CER) metric
    • Performance targets: throughput pages/sec, memory ceiling per process

Languages, Scripts, and Multilingual Documents

Non-Latin script handling, bidirectional text, and language detection.

  • research/multilingual-document-extraction.md — Mixed-script documents combining Latin with Arabic, Hebrew, CJK, and other scripts

    • Per-span language detection, BCP-47 tag assignment
    • Bidi paragraph detection and RTL reading order handling
  • research/language-detection-and-script-handling.md — Unicode script identification, whichlang integration, and language tag propagation

    • Script block ranges for Latin/Arabic/Hebrew/CJK/Devanagari/Thai
    • Language tag inheritance from StructTree /Lang attribute
  • research/cjk-and-asian-script-encoding.md — CJK font encoding, multi-byte character code parsing, and vertical writing mode

    • Shift-JIS, GB18030 (GBK), Big5, EUC-KR code page decoding
    • Vertical glyph substitution, column-major reading order
  • research/indic-script-extraction.md — Devanagari, Tamil, Telugu, Bengali, and related abugida script extraction

    • Akhand/matra glyph cluster reconstruction, halant handling
    • Visual-order to logical-order reordering for Indic scripts
  • research/southeast-asian-script-extraction.md — Thai, Lao, Khmer, Burmese: scripts without inter-word spaces requiring segmentation

    • Dictionary-based word segmentation for Thai/Lao/Khmer
    • Stacked consonant cluster handling in Burmese/Khmer
  • research/ruby-text-and-east-asian-typography.md — Japanese ruby (furigana) annotation extraction and East Asian typography conventions

    • Ruby base/annotation text pair reconstruction
    • Tate-chu-yoko (horizontal-in-vertical) mixed direction handling
  • research/unicode-normalization-and-text-cleanup.md — NFC normalization pipeline, combining character handling, and post-extraction cleanup

    • Canonical decomposition + canonical composition (NFC) via unicode-normalization
    • Zero-width joiner/non-joiner, byte order mark stripping

Accessibility and Tagged PDF

Structure tree exploitation and accessibility standard compliance.

  • research/accessibility-and-tagged-pdf-deep-dive.md — PDF/UA-1 deep dive: structure tree contract, reading order derivation, and artifact suppression

    • StructTreeRoot walk, RoleMap normalization, ActualText semantics
    • Artifact classification (/Pagination, /Layout, /Background)
  • research/tagged-pdf-structure-and-reading-order.md — Tagged PDF structure tree as authoritative reading order with MCID-to-span mapping

    • ParentTree reverse lookup, StructElem type-to-block-kind mapping
    • Suspects flag: when to fall back to XY-cut for coverage gaps
  • research/pdfua2-and-accessibility-standards.md — PDF/UA-2 standard built on PDF 2.0 with updated structure requirements and WCAG alignment

    • New artifact classification rules, associated file for MathML
    • Namespace-qualified structure element types in PDF 2.0
  • research/article-threads-and-reading-order.md — PDF article thread bead chains as multi-page reading order override for magazine layouts

    • /Threads array, bead rect chains across non-contiguous pages
    • Priority relative to structure tree and XY-cut ordering

Full Document List (Alphabetical)

Document Category One-Line Description
accessibility-and-tagged-pdf-deep-dive.md Accessibility PDF/UA-1 structure tree contract, artifact suppression, ActualText semantics
adversarial-inputs-and-parser-security.md Security Decompression bombs, circular refs, resource exhaustion defense
article-threads-and-reading-order.md Accessibility Article thread bead chains as multi-page reading order override
benchmark-and-test-methodology.md Output & API Test corpus design, CER metric, performance targets
book-and-publishing-pdf-patterns.md Specialized Docs Book PDFs: running headers, footnotes, sidebars, indices, TOC
chunking-for-llm-consumption.md Output & API Block-boundary chunk splitting and overlap for RAG ingestion
cjk-and-asian-script-encoding.md Languages CJK CIDFont encoding, multi-byte codes, vertical writing mode
cmap-format-and-cid-encoding.md Font & Encoding ToUnicode CMap syntax: bfchar, bfrange, usecmap, ligature expansion
color-management-and-icc-profiles.md OCR & Image ICC profile normalization for text visibility luminance estimation
complex-layout-reading-order.md Text Assembly XY-cut algorithm for multi-column and mixed-layout documents
confidence-scoring-and-aggregation.md Text Assembly Per-glyph confidence, span aggregation, readability score
content-stream-concatenation.md Content Stream Multi-stream pages, /Length mismatches, Form XObject sub-streams
content-stream-operators.md Content Stream Complete text operator reference: Tj, TJ, Td, Tm, Tf, Tr, BT/ET
digital-signatures-and-certification.md Security Sig field metadata, ByteRange, SubFilter, validation_status
document-catalog-and-structure.md File Format Catalog keys, page tree traversal, inherited attribute resolution
document-classification-and-zone-labeling.md Text Assembly Body/header/footer/caption/margin zone heuristics
embedded-files-and-portfolios.md Output & API EmbeddedFiles name tree, attachment metadata, portfolio structure
engineering-document-extraction.md Specialized Docs PDF/E-1 CAD exports: dimension annotations, title blocks, BOM tables
error-handling-and-robustness.md Security Recoverable error model, diagnostic taxonomy, stage isolation
extraction-output-schema.md Output & API Stable v1.0 JSON schema for all output fields
extraction-pipeline-overview.md Start Here End-to-end 9-stage architectural blueprint
font-descriptor-and-metrics.md Font & Encoding FontDescriptor keys, Widths arrays, hmtx metrics
font-subsetting-and-extraction.md Font & Encoding Subset naming, glyph table gaps, Standard 14 prefix stripping
form-fields-and-annotations.md Output & API AcroForm field walk, XFA, annotation text types
glyph-recognition-and-unicode-recovery.md Font & Encoding Shape-hash Level 4 fallback, perceptual hash database
government-form-pdf-patterns.md Specialized Docs IRS/regulatory forms: flat print fields vs. AcroForm disambiguation
graphics-state-tracking.md Content Stream CTM, text matrix, q/Q stack, rendering mode state machine
historical-and-degraded-document-extraction.md OCR & Image Microfilm/photocopy preprocessing, multi-pass OCR strategies
hyperlinks-and-named-destinations.md Output & API URI annotations, GoTo actions, named destination resolution
image-and-figure-extraction.md OCR & Image Image XObject identification, inline images, figure regions
image-compression-and-filter-decoding.md File Format All PDF filters: FlateDecode, LZW, ASCII85, DCT, JBIG2, JPX
incremental-updates-and-versioning.md File Format Non-destructive PDF modification, revision history, soft redaction
indic-script-extraction.md Languages Devanagari/Tamil/Bengali: cluster reconstruction, logical reordering
invisible-and-hidden-text.md Content Stream Tr=3 text, OCR layer patterns, include_invisible_text flag
javascript-and-interactive-pdf-extraction.md Security JavaScript detection, XFA dynamic content, contains_javascript flag
language-detection-and-script-handling.md Languages Unicode script identification, whichlang, BCP-47 tag assignment
latex-and-scientific-pdf-patterns.md Specialized Docs LaTeX toolchain patterns: font stacks, microtype, missing ToUnicode
legal-and-financial-pdf-patterns.md Specialized Docs Legal briefs/contracts/filings: line numbers, Bates stamps, footnotes
linearized-pdf-and-streaming.md File Format Fast-web-view layout, two-xref structure, hint stream decoding
malformed-pdf-repair-and-recovery.md File Format Forward scan fallback, truncation recovery, corruption taxonomy
mathematical-expression-handling.md Specialized Docs Symbol font mapping, subscript/superscript, formula linearization
medical-and-scientific-pdf-patterns.md Specialized Docs Dense scientific docs: equations, citations, footnotes, journal layouts
multilingual-document-extraction.md Languages Mixed-script documents, per-span language, RTL reading order
opentype-math-and-formula-extraction.md Font & Encoding OpenType MATH table, MathVariants, script glyph positioning
optional-content-groups.md Content Stream OCG layer state, BDC/EMC marked content, visibility filtering
page-geometry-and-document-structure.md File Format Page boxes, coordinate systems, rotation matrix
page-labels-and-outline-extraction.md Text Assembly PageLabels number tree, outline walk, destination resolution
pdfa-archival-extraction-guarantees.md Specialized Docs PDF/A conformance-derived fast-path guarantees and skips
pdfa-compliance-and-extraction.md Specialized Docs PDF/A-1/2/3 conformance levels and their extraction implications
pdf-encryption-and-security.md Security Standard handler, RC4/AES decryption, password attempt sequence
pdf-fonts-and-encoding.md Font & Encoding All font types, encoding vectors, AGL, four-level fallback chain
pdf-generator-quirks.md File Format Per-generator fingerprinting and spec-deviation workarounds
pdf-object-model-and-data-types.md File Format Eight PDF object types, reference semantics, generation numbers
pdf-portfolio-and-attachments.md Output & API Portfolio collection schema, navigator sort, sub-document extraction
pdf-specification.md File Format ISO 32000-1/2 file structure, xref, object streams, linearization
pdfua2-and-accessibility-standards.md Accessibility PDF/UA-2 on PDF 2.0: updated structure rules, WCAG alignment
pdfvt-variable-transactional-printing.md Specialized Docs PDF/VT DPart tree, record boundaries, reusable content streams
pdfx-prepress-extraction.md Specialized Docs PDF/X print formats: spot colors, OutputIntent, TrimBox boundaries
performance-and-streaming-architecture.md Output & API mmap I/O, rayon parallelism, NDJSON streaming, LRU object cache
post-extraction-normalization.md Text Assembly NFC normalization, whitespace cleanup, paragraph boundaries
post-ocr-text-correction.md OCR & Image Systematic OCR error correction, dictionary validation, re-scoring
presentation-and-spreadsheet-pdfs.md Specialized Docs PowerPoint/Excel PDFs: slide structure, speaker notes, cell grids
raster-ocr-pipeline.md OCR & Image 300 DPI render, Sauvola, deskew, Tesseract HOCR integration
redaction-detection-and-recovery.md Security True vs. soft redaction detection, black rectangle over text
resource-dictionary-and-inheritance.md Font & Encoding Font/XObject namespace resolution, multi-level resource merging
ruby-text-and-east-asian-typography.md Languages Japanese ruby/furigana extraction, tate-chu-yoko mixed direction
scanned-vs-vector-page-classification.md OCR & Image PageClass signals: image coverage, validity rate, Hybrid detection
semantic-text-reconstruction.md Text Assembly Hyphen removal, ligature expansion, line-end join heuristics
shading-pattern-and-text-visibility.md Content Stream Luminance estimation for text-on-background visibility
southeast-asian-script-extraction.md Languages Thai/Lao/Khmer/Burmese: word segmentation, stacked consonants
span-merging-and-text-run-assembly.md Text Assembly Span boundary triggers, text run assembly, line merging pipeline
stroke-and-outlined-text.md Content Stream Rendering modes 17, stroke-only glyphs, clip path text
table-structure-reconstruction.md Text Assembly Ruling-line grid, borderless alignment, cell assignment, merged cells
tagged-pdf-structure-and-reading-order.md Accessibility StructTree reading order, MCID mapping, Suspects fallback
text-positioning-and-font-metrics.md Content Stream Text state scalars: Tc, Tw, Tz, TL, Ts, and matrix accumulation
text-readability-validation.md Text Assembly Printable-char ratio, dictionary word rate, OCR fallback threshold
type3-font-extraction.md Font & Encoding Type 3 CharProcs streams, per-glyph rasterization, shape recognition
unicode-normalization-and-text-cleanup.md Languages NFC normalization, combining characters, ZWJ/ZWNJ, BOM stripping
watermark-and-background-separation.md Text Assembly Watermark detection, repeated-pattern suppression, is_artifact flag
word-boundary-reconstruction.md Content Stream Inter-glyph gap thresholds, TJ kern detection, space insertion
xmp-and-document-metadata.md Output & API XMP RDF/XML parsing, /Info fallback, Dublin Core, conflict resolution
xref-table-parsing-and-object-lookup.md File Format Traditional xref, xref streams, hybrid files, /Prev chain traversal