jedarden/pdftract

Author	SHA1	Message	Date
jedarden	9b5fbc9b5e	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction - Add decode_page_content_streams() function for per-page lazy decode - Update extract_page_from_dict() to support lazy stream decoding - Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding - Fix borrow checker issue in LazyPageIter::next() This ensures content streams are decoded lazily per page and dropped immediately after processing, keeping peak RSS flat across page count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 12:30:26 -04:00
jedarden	9fca24c77a	docs(plan): SDKs are monorepo members, not separate repos Add a Repository Layout subsection: SDK source lives at root-level pdftract-<lang>/ in this monorepo (single source of truth), generated via pdftract sdk codegen and published to language registries from here. Retire the legacy standalone repos. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 07:21:45 -04:00
jedarden	2251f8a9c0	docs(plan): make bounded peak-RSS a CI-gated target; default max_decompress_bytes 2GB->512MB Add a Memory targets table as a first-class acceptance criterion alongside Accuracy/Speed/Weight, with a hard per-document peak-RSS ceiling that must not scale with input/payload. Promote OOM-safety to a Tier-1 hard gate. Reconcile the contradictory 2 GB max_decompress_bytes default to the research-backed 512 MB (root cause of an observed multi-GB OOM via the unbounded PNG-predictor pre-alloc under rayon page parallelism). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 23:25:50 -04:00
jedarden	bb5346b305	docs(pdftract-58kz): add security policy documentation Add comprehensive SECURITY.md covering: - Supported versions policy - Private vulnerability reporting (email + GitHub) - 90-day disclosure window with timelines - CVE assignment via GitHub Security Advisories - In-scope and out-of-scope vulnerability classes - Safe harbor policy for good-faith researchers Add security issue template redirecting users to private reporting. Add Security section to CONTRIBUTING.md and README.md with links to SECURITY.md. Add docs/security/pgp-public-key.asc placeholder with generation instructions. References: bead pdftract-58kz, plan line 3433 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 19:39:24 -04:00
jedarden	9456d8e231	feat(pdftract-5omc): implement per-language conformance test runner pattern Implements the conformance test runner pattern for all 10 SDKs as specified in the plan (line 3547). Each SDK now has a dedicated conformance test runner. Created: - tests/sdk-conformance/report-schema.json: JSON schema for conformance reports - docs/notes/sdk-conformance-runner.md: Pattern documentation and reference - crates/pdftract-cli/tests/conformance.rs: Rust cargo test target - tests/conformance/test_conformance.py: Python pytest harness - tests/conformance/conformance.test.ts: Node.js vitest runner - tests/conformance/conformance_test.go: Go go test runner - tests/conformance/ConformanceTest.java: Java JUnit 5 runner - tests/conformance/ConformanceTests.cs: .NET xUnit runner - tests/conformance/conformance.c: C standalone binary - tests/conformance/conformance_test.rb: Ruby minitest runner - tests/conformance/ConformanceTest.php: PHP PHPUnit runner - tests/conformance/ConformanceTests.swift: Swift XCTest runner All runners implement: - Loading of tests/sdk-conformance/cases.json - Execution of test cases with language-native method invocations - Comparison of results against expected values with numeric tolerances - Emission of machine-readable conformance-report.json - Non-zero exit on failures/errors for CI gating Acceptance criteria: - PASS: All 10 SDKs have language-specific runners - PASS: Runners consume shared cases.json - PASS: Runners emit JSON reports matching schema - PASS: Runners exit non-zero on failure - WARN: README integration pending SDK repo creation - WARN: Stub implementations return placeholder results References: - Plan line 3547: "Every SDK has a pdftract-sdk-conformance test runner" - Plan line 3589: "Conformance suite results published as Argo artifact" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: pdftract-5omc	2026-05-18 01:32:24 -04:00
jedarden	857f928732	feat(pdftract-5omc): implement SDK conformance test runner pattern Implement the conformance test runner pattern that every SDK will implement to validate against the shared test suite. - Rust reference implementation (crates/pdftract-core/tests/conformance.rs) * Full test suite loader and executor * Comparison engine with min/max, string constraints, tolerances * Skip logic for unsupported features and schema versions * Report generation in JSON format - CLI compare subcommand (crates/pdftract-cli/src/main.rs) * pdftract compare - Compare actual vs expected with tolerances * Cross-language comparison tool to avoid reimplementations - Documentation (docs/conformance/sdk-contract.md) * Complete pattern specification with pseudocode * Per-language runner locations * CI integration requirements - Python reference stub (tests/python-conformance/test_conformance.py) * Full pytest-based implementation following the pattern Closes: pdftract-5omc	2026-05-18 01:22:23 -04:00
jedarden	a34f9c18d0	docs(pdftract-1g87): create mdBook scaffolding for user documentation - book.toml with title, authors, build directory, edit-url-template - src/SUMMARY.md with complete TOC for all planned sections - src/introduction.md: what pdftract does and doesn't do (Non-Goals) - src/installation.md: cargo, pip, Homebrew, Docker; KU-12 caveat verbatim - src/quickstart.md: five-minute walkthrough with executable commands - 39 draft placeholder files for CLI reference, schema, profiles, SDKs, advanced topics, troubleshooting, FAQ mdbook build completes cleanly with zero warnings (linkcheck optional). See notes/pdftract-1g87.md for verification details. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 00:38:51 -04:00
jedarden	5e66846288	docs(pdftract-147a): author SDK contract specification Add comprehensive SDK contract specification at docs/notes/sdk-contract.md. This document serves as the constitutional specification for all pdftract SDK implementations across all languages. The contract defines: - Method surface (9 methods mirroring CLI/MCP tools) - Error mapping (CLI exit codes → native exceptions) - Versioning compatibility rules (MAJOR lock, MINOR flexibility) - Option-naming conventions (CLI flag → language-native case) - Native type-mapping requirements (Document, Page, Span, Block, Match, Fingerprint, Classification) - Async conventions per language - Conformance enforcement (100% pass required) - Change policy (ADR required for contract changes) Verification note: notes/pdftract-147a.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:13:55 -04:00
jedarden	9f27d16f25	docs(phase-0.1): verify pdftract-ci scaffolding complete Verified the pdftract-ci WorkflowTemplate exists in declarative-config and is correctly synced to the iad-ci cluster. All scaffolding requirements met for Phase 0.1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 03:24:36 -04:00
jedarden	7035706068	docs(plan): fix 3 HIGH gaps + 3 LOW items from Round 5 gap review HIGH: - Add outline/bookmark traversal spec to Phase 1.4 (linked list walk, PDFDocEncoding vs UTF-16BE) - Specify base64 encoding for attachment data field in Phase 7.5 - Move decompression limit to ExtractionOptions.max_decompress_bytes (universal, not serve-only); add max_decompress_gb to CLI/Python/HTTP API surfaces LOW: - Split log+env_logger into two dep matrix rows for accurate crate count - Add full_render to Python keyword args and HTTP form fields (with no-op note) - Clarify v0.1.0 milestone: "all applicable" targets (OCR speed target excluded until v0.2.0) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:30:02 -04:00
jedarden	2ba51a8a73	docs(plan): fix 4 gaps from Round 4 gap review - Fix quick-xml feature gate: move from ocr to default (XMP conformance detection) - Make page_number schema update an explicit Phase 6.1 deliverable - Add PageClass → page_type mapping table; define broken_vector as valid output value - Fix CI test matrix: musl target excludes ocr/python features; glibc runs --all-features Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:24:12 -04:00
jedarden	2d194a4b1b	docs(plan): fix 15 gaps from Round 3 gap review HIGH: - Add fontdue crate for glyph rasterization (ttf-parser is a parser, not rasterizer) - Remove num_cpus reference (rayon default pool sizing is sufficient) - Update dep count target to < 30 direct crates (< 20 was violated by plan's own list) - Fix watermark deferral: Phase 7 not Phase 6; no kind:'watermark' until Phase 7 - Add Phase 7.6 (Hyperlink/Annotation Extraction) and 7.7 (Article Thread Chains) MEDIUM: - Document header/footer streaming mode limitation: first 3 pages emit as paragraph - Add conformance/XFA detection spec to Phase 1.4; move quick-xml to default feature - Clarify pdftract-py-ci is Phase 0 stub, filled in during Phase 6.3 - Specify /Contents array concatenation in Phase 1.4 page tree - Add page rotation un-rotation step after Phase 3 glyph bbox computation - Add password delivery: ExtractionOptions.password, --password CLI, HTTP form, Python kwarg - Fix glyph shape DB: phf::Map → sorted &'static [(u64,char)] slice for Hamming nearest-neighbor - Add Python benchmark runner infrastructure (python:3.11-slim, requirements.txt, hyperfine) - Add wordlist-bloom to Feature flags bullet list LOW: - Clarify extract_stream() yields page dicts only, not header/footer frames Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:18:33 -04:00
jedarden	eb799c0956	docs(plan): fix 21 gaps from Round 2 gap review CRITICAL: - Fix deskew step: pixDeskew operates on grayscale, not binarized image HIGH: - Add sha2 crate to dep matrix (needed for font fingerprint hashing) - Fix bloomfilter feature: wordlist-bloom (optional), not default conditional - Add build-dependencies subsection (phf_codegen, serde_json) - Add v0.1.0 fallback for tagged PDFs: XY-cut with TAGGED_PDF_STRUCT_TREE_DEFERRED diagnostic - Add v0.1.0 fallback for BrokenVector: emit broken_vector page_type when ocr feature absent - Add strsim crate for Levenshtein in header/footer deduplication - Add tokio::task::spawn_blocking bridge for axum→rayon hand-off - Fix JPX/CCITT OCR path: document full-render requirement; add OCR_JPX/CCITT_UNSUPPORTED diagnostics - Fix OCG default visibility: /D/BaseState + /D/ON + /D/OFF (was incorrectly /D/AS) MEDIUM: - Add JBIG2 OCR limitation: requires full-render; OCR_JBIG2_UNSUPPORTED diagnostic - Add Standard-14 font skip for Level 3 fingerprinting (no embedded program) - Change flags field from EnumSet<SpanFlag> to u8 bitmask (removes undocumented enumset dep) - Add tests/fixtures/encoding/ and tests/fixtures/perf/ to Tier 2 fixture list - Add ocg_present to Phase 6.1 metadata field list - Add "Phase 7 feature; empty in Phase 6" notes to links and annotations fields - Add include_invisible/extract_forms/extract_attachments to HTTP serve form fields - Clarify Phase 4.7 lang source: document-level /Lang, not per-span (Phase 7) - Fix Stage 1-2 → Phases 1-2 in architectural decisions (stale draft terminology) - Remove frame-index notation from NDJSON streaming critical test - Define Hybrid threshold (≥15% each) and Hybrid merge strategy (vector wins on overlap) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:05:26 -04:00
jedarden	bcccc98fd7	docs(plan): fix 30 gaps from Round 1 gap review CRITICAL fixes: - Remove jpeg-decoder from Phase 1.5 crates (contradicted dep matrix) - Specify word boundary adaptive threshold: text space, per-font-switch window, 20-glyph seed - Add page_number (1-based) alongside page_index (0-based) to resolve SDK/schema mismatch - Add mcid: Option<u32> to Glyph struct (was defined in 3.4 but missing from 3.2) - Add aes + rc4 crates under new decrypt feature; document crypto dependency HIGH fixes: - Specify font fingerprint database format (phf::Map, SHA-256, ~500KB, JSON source) - Fix Level 4 shape DB cross-ref (was "Phase 2.3", corrected to research doc); add Phase 2.5 definition - Document header/footer cross-page pass as sequential post-rayon with Levenshtein matching - Replace Tesseract box-file hint approach with PSM_SPARSE_TEXT + post-OCR validation - Add HTTP serve security constraints: decompression bomb limit, auth guidance, no path params - Add JavaScript detection spec to Phase 1.4 (all four JS action locations) - Align CI benchmark gate to 10x pdfminer.six (was 5x, contradicted primary objectives) - Add cargo bloat CI gate for phf word list size; bloomfilter fallback if >250KB - Add pdftract-py-ci WorkflowTemplate note with manylinux/osxcross/cross approach - Add ConfidenceSource enum → schema string mapping table in Phase 4.1 MEDIUM fixes: - Define docs/schema/v1.0/pdftract.schema.json as Phase 6.1 deliverable - Add unicode-bidi crate to dep matrix and Phase 4.2 for RTL detection - Define Color enum with CSS hex conversion rules in Phase 3.1 - Remove bytes crate from Phase 1.2 (belongs in serve feature only; use Arc<[u8]>) - Specify NDJSON buffer Condvar blocking behavior at window saturation - Clarify pdftract:ocr vs pdftract:full Docker image tags and size budgets - Add Docstrum parameters: k=5, Euclidean, ±30° constraints, root node definition - Add code and formula block kind detection heuristics to Phase 4.4 - Add OCG visibility handling to Phase 1.4 (ON/OFF from /OCProperties /D /AS) - Add linearized PDF detection and dual-xref merge to Phase 1.3 - Add HTTP 413 to error table with custom JSON rejection handler - Add Phase 0: CI Infrastructure section (pdftract-ci WorkflowTemplate) LOW fixes: - Clarify Name length limit: 127 bytes pre-expansion, matching PDF spec 7.3.5 - Reorder preprocessing pipeline: contrast normalization before binarization (was after) - Add CIDToGIDMap stream form: 2-byte big-endian GID array Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 17:45:04 -04:00
jedarden	d161d109b3	docs(plan): revise plan to center accuracy/speed/weight as hard targets - Add Primary Objectives section with CI-gated measurable targets: accuracy (CER <0.5%, WER <3%, readability >0.85), speed (100pp <3s, 10x vs pdfminer), weight (<4MB default binary, <20 default deps) - Add feature-flag strategy: axum/tokio/pdfium/pyo3 are all optional; default build is core extraction + CLI only - Add Phase 4.7: text readability validation and correction pipeline (ligature repair, hyphenation, mojibake detection, readability scoring) - Make pdfium-render explicitly optional (full-render feature) vs. the always-present direct image compositing path - Add Tier 4 competitive benchmark suite (vs. pdfminer.six, pypdf, pdfplumber) - Remove jpeg-decoder and whichlang from dependency matrix (unnecessary) - Rename implementation-plan.md → plan.md (matches CLAUDE.md reference) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 17:07:48 -04:00
jedarden	8753630bc3	Add parallel extraction research and comprehensive research index New research document covering parallel extraction architecture: rayon page-level parallelism, Arc<> shared xref/font/object-stream caches, RwLock font cache design, Tesseract thread-local OCR pool, semaphore memory budget, ordered NDJSON streaming slot array, and catch_unwind error isolation per page. Also adds docs/research-index.md: a 622-line navigable index of all 83 research documents grouped into 9 thematic categories, with a "Start Here" reading path, per-phase implementation reading tables, and an alphabetical lookup table covering every document. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:30:35 -04:00
jedarden	92e6196ac5	Add research: Ruby/furigana typography, PDF/VT variable printing Two new research documents covering Japanese Ruby text and East Asian typography (tagged/untagged furigana extraction, Kinsoku Shori spacing, full-width normalization, tate-chu-yoko, CJK/Latin boundary detection, ruby_text output field) and PDF/VT variable and transactional printing (DPart hierarchy traversal, per-record extraction model, DPM metadata, variable vs. static content classification, postal address extraction, records array output schema). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:24:21 -04:00
jedarden	e3b72efc83	Add research: Southeast Asian scripts, OpenType MATH formula extraction Two new research documents covering Southeast Asian script extraction (Thai/Khmer/Myanmar/Lao/Tibetan/Ethiopic — cluster structure, no-space word boundary policy for Thai/Lao, Zawgyi vs Unicode detection for Myanmar, USE shaping, Tesseract fallback) and OpenType MATH table exploitation for formula extraction (MathConstants for fraction/ subscript/radical layout, TeX OML/OMS/OMX encoding tables, MathML output generation, GlyphAssembly reconstruction, alternative text and MathJax XMP source recovery). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:21:48 -04:00
jedarden	4e72c66763	Add research: Indic scripts, adversarial parser security Two new research documents covering Indic script extraction (abugida structure, ToUnicode CMap failures for shaped glyphs, ActualText fast-path, GSUB lookup reversal, pre-base matra reordering, virama placement, Tesseract fallback with script-specific models) and adversarial input handling (decompression bombs, circular references, malformed stream lengths, path traversal in attachments, content stream loop detection, O(n log n) algorithm requirements, output sanitization). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:18:03 -04:00
jedarden	12fad41596	Add research: span merging, Unicode normalization, implementation plan Two new research documents covering the glyph-to-span-to-block assembly pipeline (inter-operator merging, adaptive word gap threshold, column detection, ligature bbox splitting, multi-granularity output) and Unicode post-processing (NFC normalization, selective NFKC decomposition for ligatures, PUA preservation, soft hyphen resolution, ZWJ/ZWNJ handling, combining character reordering). Also adds docs/plan/implementation-plan.md: the full 7-phase Rust implementation roadmap covering core parser, font/encoding pipeline, content stream processing, text assembly, OCR integration, API surface, and advanced features — with crate selections, complexity ratings, test strategy, and v0.1–v1.0 release milestones. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:15:14 -04:00
jedarden	6b96d8d637	Add research: error handling, PDF/A guarantees, output schema, generator quirks Four new extraction research documents covering permissive error handling with extraction quality signaling (five error classes, circular reference detection, memory limits), PDF/A conformance level guarantees and fast-path optimization (Level A skips OCR and layout heuristics), the complete extraction output schema (span/block/table/NDJSON streaming/ versioning), and per-generator extraction quirks (Word/LibreOffice/ InDesign/LaTeX/Chrome/Ghostscript/scanners). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:07:13 -04:00
jedarden	a89fef64fc	Add research: article threads, resource dictionaries, catalog, hyperlinks Four new extraction research documents covering PDF article thread traversal for multi-flow magazine layouts, resource dictionary inheritance and ResourceStack semantics for nested Form XObjects, document catalog and page tree structure (UserUnit, Contents array, page inheritance), and hyperlink/named destination extraction with QuadPoints anchor text and link density classification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:04:00 -04:00
jedarden	16cb1bd61d	Add research: xref parsing, object model, font descriptors, PDF/UA-2 Four new extraction research documents covering cross-reference table and xref stream parsing with error recovery, PDF object model and lexer correctness (all 8 types, string escapes, stream /Length recovery), FontDescriptor fields and embedded font data (Type1/TrueType/CFF/OT), and PDF/UA-2 / PDF 2.0 structure changes (MathML, NFC normalization, new structure types, artifact classification improvements). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:01:34 -04:00
jedarden	6c6ec6a4ca	Add research: color management, text metrics, PDF/X, content stream operators Four new extraction research documents covering ICC profile and color space luminance estimation for text visibility, precise text state tracking and bounding box computation (Tc/Tw/Tz/TL, font units, TJ kerning, baseline clustering), PDF/X prepress handling (OutputIntent, TrimBox, spot colors, article threading), and a complete content stream operator reference (BT/ET, Tj/TJ/'/", BI/ID/EI, BX/EX, marked content). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:59:02 -04:00
jedarden	516ca154aa	Add research: page labels, government forms, book publishing, filter decoding Four new extraction research documents covering page label/PageLabels number tree and outline/bookmark tree extraction, government form PDF patterns (IRS, USCIS, court filings, classification markings), book and publishing PDF structure (running heads, footnotes, index extraction), and PDF stream filter pipeline (FlateDecode/LZW predictors, JBIG2 global segments, CCITTFax, JPX, error boundaries). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:55:08 -04:00
jedarden	5ff918b178	Add research: portfolios, incremental updates, tagged PDF, JavaScript/forms Four new extraction research documents covering PDF portfolio and attachment enumeration (ZUGFeRD, PDF/A-3 AFRelationship), incremental update structure and xref chaining, PDF/UA tagged PDF deep dive with all 36 structure types and MCID mechanics, and JavaScript/AcroForm/XFA field extraction without script execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:45:59 -04:00
jedarden	006dfb286c	Add research: color visibility, medical/scientific, multilingual, digital signatures Four new extraction research documents covering color space and contrast analysis for text visibility, medical/scientific document structure (ICH E3, IMRaD, FDA labeling, eCTD), multilingual mixed-script extraction with UBA bidi handling and CJK vertical text, and digital signature metadata extraction with DocMDP integrity context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:41:43 -04:00
jedarden	eac3235291	Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs Four new extraction research documents covering text rendering modes (Tr 0-7 including invisible OCR layers), legal/financial document extraction patterns, character-level confidence aggregation with output schema, and PDF/E engineering document handling (CAD, GD&T, schematics). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:35:48 -04:00
jedarden	8f8138a65e	Add research: font subsetting, LaTeX patterns, redaction detection Three new extraction research documents covering subset font Unicode recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and proper vs. improper redaction detection with output schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:30:52 -04:00
jedarden	04b60a1cf7	Add three research documents: CJK encoding, pipeline synthesis, linearization - cjk-and-asian-script-encoding: all six CJK encoding systems, Type 0 composite font pipeline, predefined CMap tables for Japan1/GB1/CNS1/Korea1, Shift-JIS/GB18030/Big5 byte structure, missing ToUnicode recovery via Adobe CID tables, full-width normalization, vertical text detection - extraction-pipeline-overview: end-to-end 9-stage synthesis referencing all 36 research documents; stages: file open, metadata, page classification, content extraction (4 sub-paths), font pipeline, span assembly, normalization and quality, supplementary content, output serialization; ASCII data-flow diagram - linearized-pdf-and-streaming: linearization dict keys, hint stream bitfield tables, first-page xref lazy parsing, HTTP range request pattern, staleness validation, incremental update interaction, NDJSON streaming, partial file extraction, lazy PageIter API with rayon par_bridge Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:26:36 -04:00
jedarden	116db89c95	Add three research documents on routing and text reconstruction - word-boundary-reconstruction: expected position formula with Tc/Tw/Tz, TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold strategies including adaptive histogram, multi-column gap discrimination - scanned-vs-vector-page-classification: four-category taxonomy, fast pre-checks, image coverage AABB computation, character density ratio, validity rate, glyph bbox plausibility, region routing map, confidence scoring with cost-aware OCR threshold - pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP pdfaid detection, Level B/U/A guarantee implications for extraction, font embedding requirements, artifact tagging, PDF/A-3 embedded files, PdfaLevel enum with per-level fast-path branching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:22:08 -04:00
jedarden	9420964b73	Add three research documents on parser correctness fundamentals - graphics-state-tracking: full q/Q stack, text state operators, color space tracking, ExtGState keys, clip path management, CTM concatenation, blend mode/soft mask visibility, Form XObject isolation, GraphicsState Rust struct with is_text_visible implementation - cmap-format-and-cid-encoding: CMap file structure, codespace range scan grammar, bfchar/bfrange/cidchar/cidrange semantics, usecmap inheritance with predefined CJK CMap inventory, mixed-length parsing state machine, ToUnicode defect handling, Rust CMap struct design - content-stream-concatenation: multi-stream concatenation with 0x0A injection, continuous graphics state across boundaries, resource inheritance page-tree walk, Form XObject and Type 3 resource isolation, ResourceStack design, EI disambiguation in binary data, lazy decompression Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:16:41 -04:00
jedarden	f805e52fa3	Add four research documents focused on readable text production - type3-font-extraction: CharProcs stream parsing, TeX/dvips naming conventions, dHash shape fingerprinting, nested font stacks, OCR fallback - watermark-and-background-separation: five PDF watermark mechanisms, transparency tracking, cross-page repetition, WCAG contrast detection, raster inpainting, diagonal watermark removal pipeline - historical-and-degraded-document-extraction: eight degradation categories, bleed-through removal, illumination correction, Sauvola binarization, stroke reconstruction, Fraktur/long-s handling, confidence-gated output - complex-layout-reading-order: baseline clustering, XY-cut, Docstrum, RLSA smearing, mixed-layout detection, sidebar/inset/footnote ordering, perplexity-based confidence with natural_order fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:13:10 -04:00
jedarden	31e715633d	Add four research documents on text quality and document-type handling - text-readability-validation: character/word/entropy/perplexity checks, symbol font detection, remediation decision tree, span quality metadata - post-ocr-text-correction: error taxonomy, confusable tables, noisy channel n-gram model, regex patterns, hyphenation, layout-based correction pipeline - presentation-and-spreadsheet-pdfs: detection heuristics, slide structure, bullet hierarchy, speaker notes, hairline grid detection, sheet boundaries, cell type inference, Rust output schema - semantic-text-reconstruction: beam search n-gram reconstruction, NER validation, domain lexicons, cross-span consistency, abbreviation expansion, citation repair, coherence scoring, ReconstructedSpan output schema Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:07:30 -04:00
jedarden	a7673c906f	Add 12 research documents covering full PDF extraction surface Infrastructure and parsing: - raster-ocr-pipeline: trigger detection, preprocessing, Tesseract integration, assisted OCR, HOCR alignment, multi-language, performance - image-and-figure-extraction: XObjects, inline images, filter decoding, color spaces, geometry, form XObjects, transparency, figure detection - form-fields-and-annotations: AcroForm types, XFA, widget appearance streams, rich text, annotation text, output schema - pdf-encryption-and-security: R2-R6 key derivation, object-level decryption, permission flags, RustCrypto implementation approach - page-geometry-and-document-structure: page tree, all five page boxes, rotation, coordinate inversion, page labels, outlines, named destinations - optional-content-groups: OCG/OCMD visibility, usage dictionary, default state resolution, content stream marking, multilingual layer patterns - invisible-and-hidden-text: all 8 Tr modes, PDF/A invisible layer pattern, white-on-white, zero-opacity, clipped text, color tracking - malformed-pdf-repair-and-recovery: xref recovery, stream length repair, syntax tolerance, partial extraction, structured warnings Quality and metadata: - xmp-and-document-metadata: /Info vs XMP, all namespaces, RDF/XML parsing, conflict resolution, encrypted metadata, thumbnails - embedded-files-and-portfolios: EmbeddedFile streams, Filespec, AF relationships, Portfolio detection, ZUGFeRD/Factur-X, security - performance-and-streaming-architecture: mmap, lazy loading, NDJSON streaming, rayon parallelism, font caching, axum HTTP server - benchmark-and-test-methodology: CER/WER/TEDS metrics, corpus categories, reading order scoring, regression CI, public datasets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:05:42 -04:00
jedarden	b805593973	Add six research documents covering output-side extraction topics - table-structure-reconstruction: line detection, gap analysis, Hough transform, graph-based cell reconstruction, merged cells, multi-page tables - mathematical-expression-handling: five encoding cases, OpenType MATH table, symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers - language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi, CJK vertical text, ligature normalization, whatlang/lingua integration - document-classification-and-zone-labeling: margin heuristics, font clustering, cross-page recurrence, footnote/caption/sidebar detection - post-extraction-normalization: hyphen handling, ligature expansion, paragraph reconstruction, Unicode normalization, pipeline ordering - chunking-for-llm-consumption: semantic snapping, heading hierarchy, sliding window overlap, table chunking strategies, token budget, late chunking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:56:25 -04:00
jedarden	ef9c03095d	Add SDK architecture notes covering top 10 languages Covers TypeScript, C#, C++, PHP, and Kotlin gaps with full code examples for both subprocess and HTTP tracks, NuGet RID packaging detail, PHP FFI options, and implementation sequencing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:51:25 -04:00
jedarden	c2870e6640	Add research docs and SDK invocation notes Four research documents covering PDF spec fundamentals, font types and encoding, glyph Unicode recovery, and tagged PDF structure/reading order. SDK invocation notes with subprocess and HTTP examples for Python, Node.js, Go, Ruby, Java, Rust, and Bash. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:33:34 -04:00
jedarden	4ae798c8b1	Initial repo scaffold with README and docs structure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:26:16 -04:00

39 commits