jedarden/pdftract

Author	SHA1	Message	Date
jedarden	bcccc98fd7	docs(plan): fix 30 gaps from Round 1 gap review CRITICAL fixes: - Remove jpeg-decoder from Phase 1.5 crates (contradicted dep matrix) - Specify word boundary adaptive threshold: text space, per-font-switch window, 20-glyph seed - Add page_number (1-based) alongside page_index (0-based) to resolve SDK/schema mismatch - Add mcid: Option<u32> to Glyph struct (was defined in 3.4 but missing from 3.2) - Add aes + rc4 crates under new decrypt feature; document crypto dependency HIGH fixes: - Specify font fingerprint database format (phf::Map, SHA-256, ~500KB, JSON source) - Fix Level 4 shape DB cross-ref (was "Phase 2.3", corrected to research doc); add Phase 2.5 definition - Document header/footer cross-page pass as sequential post-rayon with Levenshtein matching - Replace Tesseract box-file hint approach with PSM_SPARSE_TEXT + post-OCR validation - Add HTTP serve security constraints: decompression bomb limit, auth guidance, no path params - Add JavaScript detection spec to Phase 1.4 (all four JS action locations) - Align CI benchmark gate to 10x pdfminer.six (was 5x, contradicted primary objectives) - Add cargo bloat CI gate for phf word list size; bloomfilter fallback if >250KB - Add pdftract-py-ci WorkflowTemplate note with manylinux/osxcross/cross approach - Add ConfidenceSource enum → schema string mapping table in Phase 4.1 MEDIUM fixes: - Define docs/schema/v1.0/pdftract.schema.json as Phase 6.1 deliverable - Add unicode-bidi crate to dep matrix and Phase 4.2 for RTL detection - Define Color enum with CSS hex conversion rules in Phase 3.1 - Remove bytes crate from Phase 1.2 (belongs in serve feature only; use Arc<[u8]>) - Specify NDJSON buffer Condvar blocking behavior at window saturation - Clarify pdftract:ocr vs pdftract:full Docker image tags and size budgets - Add Docstrum parameters: k=5, Euclidean, ±30° constraints, root node definition - Add code and formula block kind detection heuristics to Phase 4.4 - Add OCG visibility handling to Phase 1.4 (ON/OFF from /OCProperties /D /AS) - Add linearized PDF detection and dual-xref merge to Phase 1.3 - Add HTTP 413 to error table with custom JSON rejection handler - Add Phase 0: CI Infrastructure section (pdftract-ci WorkflowTemplate) LOW fixes: - Clarify Name length limit: 127 bytes pre-expansion, matching PDF spec 7.3.5 - Reorder preprocessing pipeline: contrast normalization before binarization (was after) - Add CIDToGIDMap stream form: 2-byte big-endian GID array Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 17:45:04 -04:00
jedarden	d161d109b3	docs(plan): revise plan to center accuracy/speed/weight as hard targets - Add Primary Objectives section with CI-gated measurable targets: accuracy (CER <0.5%, WER <3%, readability >0.85), speed (100pp <3s, 10x vs pdfminer), weight (<4MB default binary, <20 default deps) - Add feature-flag strategy: axum/tokio/pdfium/pyo3 are all optional; default build is core extraction + CLI only - Add Phase 4.7: text readability validation and correction pipeline (ligature repair, hyphenation, mojibake detection, readability scoring) - Make pdfium-render explicitly optional (full-render feature) vs. the always-present direct image compositing path - Add Tier 4 competitive benchmark suite (vs. pdfminer.six, pypdf, pdfplumber) - Remove jpeg-decoder and whichlang from dependency matrix (unnecessary) - Rename implementation-plan.md → plan.md (matches CLAUDE.md reference) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 17:07:48 -04:00
jedarden	8753630bc3	Add parallel extraction research and comprehensive research index New research document covering parallel extraction architecture: rayon page-level parallelism, Arc<> shared xref/font/object-stream caches, RwLock font cache design, Tesseract thread-local OCR pool, semaphore memory budget, ordered NDJSON streaming slot array, and catch_unwind error isolation per page. Also adds docs/research-index.md: a 622-line navigable index of all 83 research documents grouped into 9 thematic categories, with a "Start Here" reading path, per-phase implementation reading tables, and an alphabetical lookup table covering every document. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:30:35 -04:00
jedarden	92e6196ac5	Add research: Ruby/furigana typography, PDF/VT variable printing Two new research documents covering Japanese Ruby text and East Asian typography (tagged/untagged furigana extraction, Kinsoku Shori spacing, full-width normalization, tate-chu-yoko, CJK/Latin boundary detection, ruby_text output field) and PDF/VT variable and transactional printing (DPart hierarchy traversal, per-record extraction model, DPM metadata, variable vs. static content classification, postal address extraction, records array output schema). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:24:21 -04:00
jedarden	e3b72efc83	Add research: Southeast Asian scripts, OpenType MATH formula extraction Two new research documents covering Southeast Asian script extraction (Thai/Khmer/Myanmar/Lao/Tibetan/Ethiopic — cluster structure, no-space word boundary policy for Thai/Lao, Zawgyi vs Unicode detection for Myanmar, USE shaping, Tesseract fallback) and OpenType MATH table exploitation for formula extraction (MathConstants for fraction/ subscript/radical layout, TeX OML/OMS/OMX encoding tables, MathML output generation, GlyphAssembly reconstruction, alternative text and MathJax XMP source recovery). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:21:48 -04:00
jedarden	4e72c66763	Add research: Indic scripts, adversarial parser security Two new research documents covering Indic script extraction (abugida structure, ToUnicode CMap failures for shaped glyphs, ActualText fast-path, GSUB lookup reversal, pre-base matra reordering, virama placement, Tesseract fallback with script-specific models) and adversarial input handling (decompression bombs, circular references, malformed stream lengths, path traversal in attachments, content stream loop detection, O(n log n) algorithm requirements, output sanitization). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:18:03 -04:00
jedarden	12fad41596	Add research: span merging, Unicode normalization, implementation plan Two new research documents covering the glyph-to-span-to-block assembly pipeline (inter-operator merging, adaptive word gap threshold, column detection, ligature bbox splitting, multi-granularity output) and Unicode post-processing (NFC normalization, selective NFKC decomposition for ligatures, PUA preservation, soft hyphen resolution, ZWJ/ZWNJ handling, combining character reordering). Also adds docs/plan/implementation-plan.md: the full 7-phase Rust implementation roadmap covering core parser, font/encoding pipeline, content stream processing, text assembly, OCR integration, API surface, and advanced features — with crate selections, complexity ratings, test strategy, and v0.1–v1.0 release milestones. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:15:14 -04:00
jedarden	6b96d8d637	Add research: error handling, PDF/A guarantees, output schema, generator quirks Four new extraction research documents covering permissive error handling with extraction quality signaling (five error classes, circular reference detection, memory limits), PDF/A conformance level guarantees and fast-path optimization (Level A skips OCR and layout heuristics), the complete extraction output schema (span/block/table/NDJSON streaming/ versioning), and per-generator extraction quirks (Word/LibreOffice/ InDesign/LaTeX/Chrome/Ghostscript/scanners). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:07:13 -04:00
jedarden	a89fef64fc	Add research: article threads, resource dictionaries, catalog, hyperlinks Four new extraction research documents covering PDF article thread traversal for multi-flow magazine layouts, resource dictionary inheritance and ResourceStack semantics for nested Form XObjects, document catalog and page tree structure (UserUnit, Contents array, page inheritance), and hyperlink/named destination extraction with QuadPoints anchor text and link density classification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:04:00 -04:00
jedarden	16cb1bd61d	Add research: xref parsing, object model, font descriptors, PDF/UA-2 Four new extraction research documents covering cross-reference table and xref stream parsing with error recovery, PDF object model and lexer correctness (all 8 types, string escapes, stream /Length recovery), FontDescriptor fields and embedded font data (Type1/TrueType/CFF/OT), and PDF/UA-2 / PDF 2.0 structure changes (MathML, NFC normalization, new structure types, artifact classification improvements). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:01:34 -04:00
jedarden	6c6ec6a4ca	Add research: color management, text metrics, PDF/X, content stream operators Four new extraction research documents covering ICC profile and color space luminance estimation for text visibility, precise text state tracking and bounding box computation (Tc/Tw/Tz/TL, font units, TJ kerning, baseline clustering), PDF/X prepress handling (OutputIntent, TrimBox, spot colors, article threading), and a complete content stream operator reference (BT/ET, Tj/TJ/'/", BI/ID/EI, BX/EX, marked content). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:59:02 -04:00
jedarden	516ca154aa	Add research: page labels, government forms, book publishing, filter decoding Four new extraction research documents covering page label/PageLabels number tree and outline/bookmark tree extraction, government form PDF patterns (IRS, USCIS, court filings, classification markings), book and publishing PDF structure (running heads, footnotes, index extraction), and PDF stream filter pipeline (FlateDecode/LZW predictors, JBIG2 global segments, CCITTFax, JPX, error boundaries). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:55:08 -04:00
jedarden	5ff918b178	Add research: portfolios, incremental updates, tagged PDF, JavaScript/forms Four new extraction research documents covering PDF portfolio and attachment enumeration (ZUGFeRD, PDF/A-3 AFRelationship), incremental update structure and xref chaining, PDF/UA tagged PDF deep dive with all 36 structure types and MCID mechanics, and JavaScript/AcroForm/XFA field extraction without script execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:45:59 -04:00
jedarden	006dfb286c	Add research: color visibility, medical/scientific, multilingual, digital signatures Four new extraction research documents covering color space and contrast analysis for text visibility, medical/scientific document structure (ICH E3, IMRaD, FDA labeling, eCTD), multilingual mixed-script extraction with UBA bidi handling and CJK vertical text, and digital signature metadata extraction with DocMDP integrity context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:41:43 -04:00
jedarden	eac3235291	Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs Four new extraction research documents covering text rendering modes (Tr 0-7 including invisible OCR layers), legal/financial document extraction patterns, character-level confidence aggregation with output schema, and PDF/E engineering document handling (CAD, GD&T, schematics). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:35:48 -04:00
jedarden	8f8138a65e	Add research: font subsetting, LaTeX patterns, redaction detection Three new extraction research documents covering subset font Unicode recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and proper vs. improper redaction detection with output schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:30:52 -04:00
jedarden	04b60a1cf7	Add three research documents: CJK encoding, pipeline synthesis, linearization - cjk-and-asian-script-encoding: all six CJK encoding systems, Type 0 composite font pipeline, predefined CMap tables for Japan1/GB1/CNS1/Korea1, Shift-JIS/GB18030/Big5 byte structure, missing ToUnicode recovery via Adobe CID tables, full-width normalization, vertical text detection - extraction-pipeline-overview: end-to-end 9-stage synthesis referencing all 36 research documents; stages: file open, metadata, page classification, content extraction (4 sub-paths), font pipeline, span assembly, normalization and quality, supplementary content, output serialization; ASCII data-flow diagram - linearized-pdf-and-streaming: linearization dict keys, hint stream bitfield tables, first-page xref lazy parsing, HTTP range request pattern, staleness validation, incremental update interaction, NDJSON streaming, partial file extraction, lazy PageIter API with rayon par_bridge Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:26:36 -04:00
jedarden	116db89c95	Add three research documents on routing and text reconstruction - word-boundary-reconstruction: expected position formula with Tc/Tw/Tz, TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold strategies including adaptive histogram, multi-column gap discrimination - scanned-vs-vector-page-classification: four-category taxonomy, fast pre-checks, image coverage AABB computation, character density ratio, validity rate, glyph bbox plausibility, region routing map, confidence scoring with cost-aware OCR threshold - pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP pdfaid detection, Level B/U/A guarantee implications for extraction, font embedding requirements, artifact tagging, PDF/A-3 embedded files, PdfaLevel enum with per-level fast-path branching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:22:08 -04:00
jedarden	9420964b73	Add three research documents on parser correctness fundamentals - graphics-state-tracking: full q/Q stack, text state operators, color space tracking, ExtGState keys, clip path management, CTM concatenation, blend mode/soft mask visibility, Form XObject isolation, GraphicsState Rust struct with is_text_visible implementation - cmap-format-and-cid-encoding: CMap file structure, codespace range scan grammar, bfchar/bfrange/cidchar/cidrange semantics, usecmap inheritance with predefined CJK CMap inventory, mixed-length parsing state machine, ToUnicode defect handling, Rust CMap struct design - content-stream-concatenation: multi-stream concatenation with 0x0A injection, continuous graphics state across boundaries, resource inheritance page-tree walk, Form XObject and Type 3 resource isolation, ResourceStack design, EI disambiguation in binary data, lazy decompression Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:16:41 -04:00
jedarden	f805e52fa3	Add four research documents focused on readable text production - type3-font-extraction: CharProcs stream parsing, TeX/dvips naming conventions, dHash shape fingerprinting, nested font stacks, OCR fallback - watermark-and-background-separation: five PDF watermark mechanisms, transparency tracking, cross-page repetition, WCAG contrast detection, raster inpainting, diagonal watermark removal pipeline - historical-and-degraded-document-extraction: eight degradation categories, bleed-through removal, illumination correction, Sauvola binarization, stroke reconstruction, Fraktur/long-s handling, confidence-gated output - complex-layout-reading-order: baseline clustering, XY-cut, Docstrum, RLSA smearing, mixed-layout detection, sidebar/inset/footnote ordering, perplexity-based confidence with natural_order fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:13:10 -04:00
jedarden	31e715633d	Add four research documents on text quality and document-type handling - text-readability-validation: character/word/entropy/perplexity checks, symbol font detection, remediation decision tree, span quality metadata - post-ocr-text-correction: error taxonomy, confusable tables, noisy channel n-gram model, regex patterns, hyphenation, layout-based correction pipeline - presentation-and-spreadsheet-pdfs: detection heuristics, slide structure, bullet hierarchy, speaker notes, hairline grid detection, sheet boundaries, cell type inference, Rust output schema - semantic-text-reconstruction: beam search n-gram reconstruction, NER validation, domain lexicons, cross-span consistency, abbreviation expansion, citation repair, coherence scoring, ReconstructedSpan output schema Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:07:30 -04:00
jedarden	a7673c906f	Add 12 research documents covering full PDF extraction surface Infrastructure and parsing: - raster-ocr-pipeline: trigger detection, preprocessing, Tesseract integration, assisted OCR, HOCR alignment, multi-language, performance - image-and-figure-extraction: XObjects, inline images, filter decoding, color spaces, geometry, form XObjects, transparency, figure detection - form-fields-and-annotations: AcroForm types, XFA, widget appearance streams, rich text, annotation text, output schema - pdf-encryption-and-security: R2-R6 key derivation, object-level decryption, permission flags, RustCrypto implementation approach - page-geometry-and-document-structure: page tree, all five page boxes, rotation, coordinate inversion, page labels, outlines, named destinations - optional-content-groups: OCG/OCMD visibility, usage dictionary, default state resolution, content stream marking, multilingual layer patterns - invisible-and-hidden-text: all 8 Tr modes, PDF/A invisible layer pattern, white-on-white, zero-opacity, clipped text, color tracking - malformed-pdf-repair-and-recovery: xref recovery, stream length repair, syntax tolerance, partial extraction, structured warnings Quality and metadata: - xmp-and-document-metadata: /Info vs XMP, all namespaces, RDF/XML parsing, conflict resolution, encrypted metadata, thumbnails - embedded-files-and-portfolios: EmbeddedFile streams, Filespec, AF relationships, Portfolio detection, ZUGFeRD/Factur-X, security - performance-and-streaming-architecture: mmap, lazy loading, NDJSON streaming, rayon parallelism, font caching, axum HTTP server - benchmark-and-test-methodology: CER/WER/TEDS metrics, corpus categories, reading order scoring, regression CI, public datasets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:05:42 -04:00
jedarden	b805593973	Add six research documents covering output-side extraction topics - table-structure-reconstruction: line detection, gap analysis, Hough transform, graph-based cell reconstruction, merged cells, multi-page tables - mathematical-expression-handling: five encoding cases, OpenType MATH table, symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers - language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi, CJK vertical text, ligature normalization, whatlang/lingua integration - document-classification-and-zone-labeling: margin heuristics, font clustering, cross-page recurrence, footnote/caption/sidebar detection - post-extraction-normalization: hyphen handling, ligature expansion, paragraph reconstruction, Unicode normalization, pipeline ordering - chunking-for-llm-consumption: semantic snapping, heading hierarchy, sliding window overlap, table chunking strategies, token budget, late chunking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:56:25 -04:00
jedarden	ef9c03095d	Add SDK architecture notes covering top 10 languages Covers TypeScript, C#, C++, PHP, and Kotlin gaps with full code examples for both subprocess and HTTP tracks, NuGet RID packaging detail, PHP FFI options, and implementation sequencing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:51:25 -04:00
jedarden	c2870e6640	Add research docs and SDK invocation notes Four research documents covering PDF spec fundamentals, font types and encoding, glyph Unicode recovery, and tagged PDF structure/reading order. SDK invocation notes with subprocess and HTTP examples for Python, Node.js, Go, Ruby, Java, Rust, and Bash. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:33:34 -04:00
jedarden	4ae798c8b1	Initial repo scaffold with README and docs structure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:26:16 -04:00

26 commits