jedarden/pdftract

Author	SHA1	Message	Date
jedarden	1791bb6d80	docs(pdftract-32y9): finalize SDK architecture note with workspace layout, cross-compile matrix, and KU-12 alignment - Add workspace layout section documenting pdftract-core as the only direct dependency, with pdftract-cli, pdftract-py, and pdftract-inspector-ui as siblings - Update binary distribution table with correct target triples (musl not gnu for Linux) - Add KU-12 cross-platform test limitation section with verbatim wording from plan: "Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release" - Add Argo CI templates section (pdftract-cargo-build, pdftract-maturin-build) - Add feature flag composition section with tiers, dependencies, and binary size budgets - Add cross-references to sdk-invocation.md, sdk-contract.md, ocr-language-packs.md - Fix clippy warnings in build.rs files (expect_fun_call, get_first, manual_strip, unused imports) Closes: pdftract-32y9 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 06:38:23 -04:00
jedarden	67b3fde4d6	feat(pdftract-j6yd): implement signatures array output + validation_status enum + schema integration Add document-level /signatures array output per Phase 7.3 of the plan. Changes: - Add SignatureJson struct to schema module with all signature metadata fields - Update ExtractionResult to include signatures: Vec<SignatureJson> - Integrate signature extraction into extract_pdf() pipeline - Update result_to_json() to include signatures in JSON output - Update JSON schema with signatures array and SignatureJson definition - Add markdown sink signatures footer when signatures are present - Add comprehensive tests for signature JSON serialization and validation Acceptance criteria: - Schema tests: 5/5 signature JSON tests pass - Markdown sink emits Signatures footer when count > 0 - PyO3 binding automatically handles Vec<SignatureJson> via serde - docs/schema/v1.0/pdftract.schema.json updated with signatures shape Verification note: notes/pdftract-j6yd.md Closes: pdftract-j6yd	2026-05-24 04:05:34 -04:00
jedarden	d174725241	docs(pdftract-5vhp): bring word-boundary-reconstruction.md to v1.0 final-pass Complete documentation of the adaptive word-boundary algorithm including: - Initial threshold = 0.25 * font_size - 20-glyph median adjustment - 1.5x median formula - Full Tc/Tw/Tz (character-spacing, word-spacing, horizontal-scaling) corrections Expanded from 202 lines to 899 lines with: - Section 3.1: Tc/Tw/Tz formula with explicit parameter table - Section 3.2: Text-space vs. device-space comparison per plan line 1550 - Section 4: Adaptive algorithm specification (20-glyph window, 1.5× median, outlier exclusion) - Section 11: Complete pseudo-code (data structures, main loop, detection, threshold computation) - Section 12: Edge cases (ZWJ, combining marks, CJK, justified text, monospaced, RTL, ligatures, soft hyphens, tabs) - Section 13: Validation methodology (corpus at tests/fixtures/word-boundary-corpus/, 141 PDFs, 8 categories) - Section 14: Implementation checklist and references Closes: pdftract-5vhp	2026-05-24 03:55:43 -04:00
jedarden	28c31ba0a1	feat(pdftract-vk0gc): implement markdown anchors with parser regex Add --md-anchors flag that emits HTML comment markers before each block in Markdown output, allowing downstream tools to map excerpts back to precise PDF locations. Changes: - Add markdown module with Anchor struct and parse_anchors() function - Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) --> - Add markdown_anchors: bool to ExtractionOptions - Add --md-anchors CLI flag - Implement block_to_markdown() and page_to_markdown() functions - Add comprehensive documentation in docs/integrations/markdown-anchors.md - 16 unit tests pass, including roundtrip test Closes: pdftract-vk0gc	2026-05-24 02:49:16 -04:00
jedarden	cf8f04e3ec	docs(pdftract-26r8): finalize glyph recognition research note v1.0 - Reorganize around the four-level Unicode recovery cascade from plan - Document all cascade levels with confidence scores: - Level 1: ToUnicode CMap (1.0) - Level 2: Encoding + AGL (0.9) - Level 3: Font fingerprint cache (0.85) - Level 4: Glyph shape recognition (0.7) - Add shape database design (pHash algorithm, query, format) - Document pHash collision tie-break rules (frequency-based) - Add Type 3 font handling section - Cross-reference Phase 2.2, 2.4, 2.5 and OQ-02 File grows from 112 to 210 lines. Covers all acceptance criteria. Closes: pdftract-26r8	2026-05-24 02:10:06 -04:00
jedarden	92e90af0b0	feat(pdftract-zy2jx): generate JSON Schema from Rust output types - Add schemars dependency to pdftract-core (v1.2) - Add JsonSchema derives to output types (ExtractionResult, PageResult, ExtractionMetadata, SpanJson, BlockJson, CellJson, RowJson, TableJson, ExtractionQuality, Receipt, ReceiptsMode) - Create xtask/src/bin/gen_schema.rs for schema generation - Add gen-schema command to xtask main.rs - Generate docs/schema/v1.0/pdftract.schema.json using Draft 2020-12 Schema includes: - $schema: "https://json-schema.org/draft/2020-12/schema" - $defs with all output type definitions - Proper type annotations for all fields Closes: pdftract-zy2jx	2026-05-24 01:29:14 -04:00
jedarden	bf37f0f05f	docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields This commit brings docs/research/extraction-output-schema.md to v1.0 final-pass specification, aligning with Phase 6.1 deliverables and plan requirements. Key additions: - page_number field documented with page_index relationship (1-based vs 0-based) - page_type enum expanded with all six values: text, scanned, mixed, broken_vector, blank, figure_only — with broken_vector cross-referenced to Phase 5.5 - Block kind enum fully documented: paragraph, heading, list, table, figure, caption, code, formula, watermark, header, footer - Attachments schema with base64 contentEncoding and 50MB truncation rule - Profile-based classification fields (document_type, document_type_confidence, document_type_reasons, profile_name, profile_version, profile_fields) - Schema Version Compatibility section with additive-evolution rules - JSON Schema cross-reference throughout Format changes: - Restructured with ATX headings (## for sections) - Added explicit field tables for each major schema section - Cross-linked to machine-readable JSON Schema at docs/schema/v1.0/pdftract.schema.json - Grew from 81 lines to 304 lines per acceptance criteria Plan references: - Lines 97, 2002-2030, 2017, 1836, 2640, 1709, 1752, 2989-3006, 3659 - INV-9 page_type taxonomy stability Co-Authored-By: Claude Code (GLM-4.7) <noreply@anthropic.com>	2026-05-24 00:59:23 -04:00
jedarden	d14ec92fcb	feat(pdftract-3zhf): add unified TableDetector::detect entry point Add unified detect() method to TableDetector that combines both line-based and borderless table detection pipelines. This completes the coordinator bead for Phase 7.2: Table Detection and Structure Reconstruction. All child beads (7.2.1-7.2.6) are closed: - 7.2.1: Line-based detection (path segment clustering) - 7.2.2: Borderless detection (x0 alignment heuristic) - 7.2.3: Span-to-cell assignment (centroid containment) - 7.2.4: Header row detection (bold + StructTree TH) - 7.2.5: Merged cell detection (missing interior edges) - 7.2.6: Table JSON output schema integration Critical tests pass: - 5x3 bordered table (15 cells extracted) - Merged header cell colspan=3 - Borderless 3-column table detection - Two-page table continuation detection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 00:51:59 -04:00
jedarden	33372c23ae	fix(pdftract-3c4i): export detect_merged_cells from table module The detect_merged_cells function was implemented but not exported from the table module, making it inaccessible to library users. This commit adds the function to the public API exports. Also adds a verification note documenting the complete implementation and the export fix. Acceptance criteria status: - All 6 merged cell detection tests pass - Public Cell.rowspan/colspan fields exist with default 1 - Absorbed cells are excluded from output - Bbox of merged cell covers absorbed cells - Borderless tables NO-OP with diagnostic Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 00:23:14 -04:00
jedarden	26bdd255c8	feat(pdftract-ilen): implement header row detection with bold+TH support Implement header row detection for tables using two signals: 1. Bold font detection (fully implemented) 2. StructTree TH detection (stub pending MCID tracking) Bold detection: - is_bold_font(): detects bold fonts from PostScript name patterns - is_cell_bold(): checks if all non-whitespace content in a cell is bold - is_bold_header_row(): validates rows with >=2 bold cells - count_header_rows(): counts contiguous bold headers from top - Cell::mark_header_rows(): sets is_header_row flag on cells TH detection (stub): - is_th_header_row(): placeholder for StructTree TH detection Requires MCID tracking on TableSpan (future work) Will use ParentTree to map MCIDs to StructElems Will verify TR > TH chain structure Combined detection: - is_header_row(): combines bold and TH signals - Bold wins on conflict per body data design principle Documentation: - Updated table-structure-reconstruction.md with full header detection spec - Documented implemented vs pending signals - Added implementation notes for TH detection Tests: - 45 tests covering all bold detection scenarios - Tests for multi-row headers (contiguous from top) - Tests for single-cell row exclusion - Tests for empty/whitespace cell handling - Placeholder tests for TH detection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 23:32:54 -04:00
jedarden	9b5fbc9b5e	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction - Add decode_page_content_streams() function for per-page lazy decode - Update extract_page_from_dict() to support lazy stream decoding - Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding - Fix borrow checker issue in LazyPageIter::next() This ensures content streams are decoded lazily per page and dropped immediately after processing, keeping peak RSS flat across page count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 12:30:26 -04:00
jedarden	9fca24c77a	docs(plan): SDKs are monorepo members, not separate repos Add a Repository Layout subsection: SDK source lives at root-level pdftract-<lang>/ in this monorepo (single source of truth), generated via pdftract sdk codegen and published to language registries from here. Retire the legacy standalone repos. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 07:21:45 -04:00
jedarden	2251f8a9c0	docs(plan): make bounded peak-RSS a CI-gated target; default max_decompress_bytes 2GB->512MB Add a Memory targets table as a first-class acceptance criterion alongside Accuracy/Speed/Weight, with a hard per-document peak-RSS ceiling that must not scale with input/payload. Promote OOM-safety to a Tier-1 hard gate. Reconcile the contradictory 2 GB max_decompress_bytes default to the research-backed 512 MB (root cause of an observed multi-GB OOM via the unbounded PNG-predictor pre-alloc under rayon page parallelism). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 23:25:50 -04:00
jedarden	bb5346b305	docs(pdftract-58kz): add security policy documentation Add comprehensive SECURITY.md covering: - Supported versions policy - Private vulnerability reporting (email + GitHub) - 90-day disclosure window with timelines - CVE assignment via GitHub Security Advisories - In-scope and out-of-scope vulnerability classes - Safe harbor policy for good-faith researchers Add security issue template redirecting users to private reporting. Add Security section to CONTRIBUTING.md and README.md with links to SECURITY.md. Add docs/security/pgp-public-key.asc placeholder with generation instructions. References: bead pdftract-58kz, plan line 3433 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 19:39:24 -04:00
jedarden	9456d8e231	feat(pdftract-5omc): implement per-language conformance test runner pattern Implements the conformance test runner pattern for all 10 SDKs as specified in the plan (line 3547). Each SDK now has a dedicated conformance test runner. Created: - tests/sdk-conformance/report-schema.json: JSON schema for conformance reports - docs/notes/sdk-conformance-runner.md: Pattern documentation and reference - crates/pdftract-cli/tests/conformance.rs: Rust cargo test target - tests/conformance/test_conformance.py: Python pytest harness - tests/conformance/conformance.test.ts: Node.js vitest runner - tests/conformance/conformance_test.go: Go go test runner - tests/conformance/ConformanceTest.java: Java JUnit 5 runner - tests/conformance/ConformanceTests.cs: .NET xUnit runner - tests/conformance/conformance.c: C standalone binary - tests/conformance/conformance_test.rb: Ruby minitest runner - tests/conformance/ConformanceTest.php: PHP PHPUnit runner - tests/conformance/ConformanceTests.swift: Swift XCTest runner All runners implement: - Loading of tests/sdk-conformance/cases.json - Execution of test cases with language-native method invocations - Comparison of results against expected values with numeric tolerances - Emission of machine-readable conformance-report.json - Non-zero exit on failures/errors for CI gating Acceptance criteria: - PASS: All 10 SDKs have language-specific runners - PASS: Runners consume shared cases.json - PASS: Runners emit JSON reports matching schema - PASS: Runners exit non-zero on failure - WARN: README integration pending SDK repo creation - WARN: Stub implementations return placeholder results References: - Plan line 3547: "Every SDK has a pdftract-sdk-conformance test runner" - Plan line 3589: "Conformance suite results published as Argo artifact" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: pdftract-5omc	2026-05-18 01:32:24 -04:00
jedarden	857f928732	feat(pdftract-5omc): implement SDK conformance test runner pattern Implement the conformance test runner pattern that every SDK will implement to validate against the shared test suite. - Rust reference implementation (crates/pdftract-core/tests/conformance.rs) * Full test suite loader and executor * Comparison engine with min/max, string constraints, tolerances * Skip logic for unsupported features and schema versions * Report generation in JSON format - CLI compare subcommand (crates/pdftract-cli/src/main.rs) * pdftract compare - Compare actual vs expected with tolerances * Cross-language comparison tool to avoid reimplementations - Documentation (docs/conformance/sdk-contract.md) * Complete pattern specification with pseudocode * Per-language runner locations * CI integration requirements - Python reference stub (tests/python-conformance/test_conformance.py) * Full pytest-based implementation following the pattern Closes: pdftract-5omc	2026-05-18 01:22:23 -04:00
jedarden	a34f9c18d0	docs(pdftract-1g87): create mdBook scaffolding for user documentation - book.toml with title, authors, build directory, edit-url-template - src/SUMMARY.md with complete TOC for all planned sections - src/introduction.md: what pdftract does and doesn't do (Non-Goals) - src/installation.md: cargo, pip, Homebrew, Docker; KU-12 caveat verbatim - src/quickstart.md: five-minute walkthrough with executable commands - 39 draft placeholder files for CLI reference, schema, profiles, SDKs, advanced topics, troubleshooting, FAQ mdbook build completes cleanly with zero warnings (linkcheck optional). See notes/pdftract-1g87.md for verification details. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 00:38:51 -04:00
jedarden	5e66846288	docs(pdftract-147a): author SDK contract specification Add comprehensive SDK contract specification at docs/notes/sdk-contract.md. This document serves as the constitutional specification for all pdftract SDK implementations across all languages. The contract defines: - Method surface (9 methods mirroring CLI/MCP tools) - Error mapping (CLI exit codes → native exceptions) - Versioning compatibility rules (MAJOR lock, MINOR flexibility) - Option-naming conventions (CLI flag → language-native case) - Native type-mapping requirements (Document, Page, Span, Block, Match, Fingerprint, Classification) - Async conventions per language - Conformance enforcement (100% pass required) - Change policy (ADR required for contract changes) Verification note: notes/pdftract-147a.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:13:55 -04:00
jedarden	9f27d16f25	docs(phase-0.1): verify pdftract-ci scaffolding complete Verified the pdftract-ci WorkflowTemplate exists in declarative-config and is correctly synced to the iad-ci cluster. All scaffolding requirements met for Phase 0.1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 03:24:36 -04:00
jedarden	7035706068	docs(plan): fix 3 HIGH gaps + 3 LOW items from Round 5 gap review HIGH: - Add outline/bookmark traversal spec to Phase 1.4 (linked list walk, PDFDocEncoding vs UTF-16BE) - Specify base64 encoding for attachment data field in Phase 7.5 - Move decompression limit to ExtractionOptions.max_decompress_bytes (universal, not serve-only); add max_decompress_gb to CLI/Python/HTTP API surfaces LOW: - Split log+env_logger into two dep matrix rows for accurate crate count - Add full_render to Python keyword args and HTTP form fields (with no-op note) - Clarify v0.1.0 milestone: "all applicable" targets (OCR speed target excluded until v0.2.0) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:30:02 -04:00
jedarden	2ba51a8a73	docs(plan): fix 4 gaps from Round 4 gap review - Fix quick-xml feature gate: move from ocr to default (XMP conformance detection) - Make page_number schema update an explicit Phase 6.1 deliverable - Add PageClass → page_type mapping table; define broken_vector as valid output value - Fix CI test matrix: musl target excludes ocr/python features; glibc runs --all-features Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:24:12 -04:00
jedarden	2d194a4b1b	docs(plan): fix 15 gaps from Round 3 gap review HIGH: - Add fontdue crate for glyph rasterization (ttf-parser is a parser, not rasterizer) - Remove num_cpus reference (rayon default pool sizing is sufficient) - Update dep count target to < 30 direct crates (< 20 was violated by plan's own list) - Fix watermark deferral: Phase 7 not Phase 6; no kind:'watermark' until Phase 7 - Add Phase 7.6 (Hyperlink/Annotation Extraction) and 7.7 (Article Thread Chains) MEDIUM: - Document header/footer streaming mode limitation: first 3 pages emit as paragraph - Add conformance/XFA detection spec to Phase 1.4; move quick-xml to default feature - Clarify pdftract-py-ci is Phase 0 stub, filled in during Phase 6.3 - Specify /Contents array concatenation in Phase 1.4 page tree - Add page rotation un-rotation step after Phase 3 glyph bbox computation - Add password delivery: ExtractionOptions.password, --password CLI, HTTP form, Python kwarg - Fix glyph shape DB: phf::Map → sorted &'static [(u64,char)] slice for Hamming nearest-neighbor - Add Python benchmark runner infrastructure (python:3.11-slim, requirements.txt, hyperfine) - Add wordlist-bloom to Feature flags bullet list LOW: - Clarify extract_stream() yields page dicts only, not header/footer frames Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:18:33 -04:00
jedarden	eb799c0956	docs(plan): fix 21 gaps from Round 2 gap review CRITICAL: - Fix deskew step: pixDeskew operates on grayscale, not binarized image HIGH: - Add sha2 crate to dep matrix (needed for font fingerprint hashing) - Fix bloomfilter feature: wordlist-bloom (optional), not default conditional - Add build-dependencies subsection (phf_codegen, serde_json) - Add v0.1.0 fallback for tagged PDFs: XY-cut with TAGGED_PDF_STRUCT_TREE_DEFERRED diagnostic - Add v0.1.0 fallback for BrokenVector: emit broken_vector page_type when ocr feature absent - Add strsim crate for Levenshtein in header/footer deduplication - Add tokio::task::spawn_blocking bridge for axum→rayon hand-off - Fix JPX/CCITT OCR path: document full-render requirement; add OCR_JPX/CCITT_UNSUPPORTED diagnostics - Fix OCG default visibility: /D/BaseState + /D/ON + /D/OFF (was incorrectly /D/AS) MEDIUM: - Add JBIG2 OCR limitation: requires full-render; OCR_JBIG2_UNSUPPORTED diagnostic - Add Standard-14 font skip for Level 3 fingerprinting (no embedded program) - Change flags field from EnumSet<SpanFlag> to u8 bitmask (removes undocumented enumset dep) - Add tests/fixtures/encoding/ and tests/fixtures/perf/ to Tier 2 fixture list - Add ocg_present to Phase 6.1 metadata field list - Add "Phase 7 feature; empty in Phase 6" notes to links and annotations fields - Add include_invisible/extract_forms/extract_attachments to HTTP serve form fields - Clarify Phase 4.7 lang source: document-level /Lang, not per-span (Phase 7) - Fix Stage 1-2 → Phases 1-2 in architectural decisions (stale draft terminology) - Remove frame-index notation from NDJSON streaming critical test - Define Hybrid threshold (≥15% each) and Hybrid merge strategy (vector wins on overlap) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:05:26 -04:00
jedarden	bcccc98fd7	docs(plan): fix 30 gaps from Round 1 gap review CRITICAL fixes: - Remove jpeg-decoder from Phase 1.5 crates (contradicted dep matrix) - Specify word boundary adaptive threshold: text space, per-font-switch window, 20-glyph seed - Add page_number (1-based) alongside page_index (0-based) to resolve SDK/schema mismatch - Add mcid: Option<u32> to Glyph struct (was defined in 3.4 but missing from 3.2) - Add aes + rc4 crates under new decrypt feature; document crypto dependency HIGH fixes: - Specify font fingerprint database format (phf::Map, SHA-256, ~500KB, JSON source) - Fix Level 4 shape DB cross-ref (was "Phase 2.3", corrected to research doc); add Phase 2.5 definition - Document header/footer cross-page pass as sequential post-rayon with Levenshtein matching - Replace Tesseract box-file hint approach with PSM_SPARSE_TEXT + post-OCR validation - Add HTTP serve security constraints: decompression bomb limit, auth guidance, no path params - Add JavaScript detection spec to Phase 1.4 (all four JS action locations) - Align CI benchmark gate to 10x pdfminer.six (was 5x, contradicted primary objectives) - Add cargo bloat CI gate for phf word list size; bloomfilter fallback if >250KB - Add pdftract-py-ci WorkflowTemplate note with manylinux/osxcross/cross approach - Add ConfidenceSource enum → schema string mapping table in Phase 4.1 MEDIUM fixes: - Define docs/schema/v1.0/pdftract.schema.json as Phase 6.1 deliverable - Add unicode-bidi crate to dep matrix and Phase 4.2 for RTL detection - Define Color enum with CSS hex conversion rules in Phase 3.1 - Remove bytes crate from Phase 1.2 (belongs in serve feature only; use Arc<[u8]>) - Specify NDJSON buffer Condvar blocking behavior at window saturation - Clarify pdftract:ocr vs pdftract:full Docker image tags and size budgets - Add Docstrum parameters: k=5, Euclidean, ±30° constraints, root node definition - Add code and formula block kind detection heuristics to Phase 4.4 - Add OCG visibility handling to Phase 1.4 (ON/OFF from /OCProperties /D /AS) - Add linearized PDF detection and dual-xref merge to Phase 1.3 - Add HTTP 413 to error table with custom JSON rejection handler - Add Phase 0: CI Infrastructure section (pdftract-ci WorkflowTemplate) LOW fixes: - Clarify Name length limit: 127 bytes pre-expansion, matching PDF spec 7.3.5 - Reorder preprocessing pipeline: contrast normalization before binarization (was after) - Add CIDToGIDMap stream form: 2-byte big-endian GID array Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 17:45:04 -04:00
jedarden	d161d109b3	docs(plan): revise plan to center accuracy/speed/weight as hard targets - Add Primary Objectives section with CI-gated measurable targets: accuracy (CER <0.5%, WER <3%, readability >0.85), speed (100pp <3s, 10x vs pdfminer), weight (<4MB default binary, <20 default deps) - Add feature-flag strategy: axum/tokio/pdfium/pyo3 are all optional; default build is core extraction + CLI only - Add Phase 4.7: text readability validation and correction pipeline (ligature repair, hyphenation, mojibake detection, readability scoring) - Make pdfium-render explicitly optional (full-render feature) vs. the always-present direct image compositing path - Add Tier 4 competitive benchmark suite (vs. pdfminer.six, pypdf, pdfplumber) - Remove jpeg-decoder and whichlang from dependency matrix (unnecessary) - Rename implementation-plan.md → plan.md (matches CLAUDE.md reference) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 17:07:48 -04:00
jedarden	8753630bc3	Add parallel extraction research and comprehensive research index New research document covering parallel extraction architecture: rayon page-level parallelism, Arc<> shared xref/font/object-stream caches, RwLock font cache design, Tesseract thread-local OCR pool, semaphore memory budget, ordered NDJSON streaming slot array, and catch_unwind error isolation per page. Also adds docs/research-index.md: a 622-line navigable index of all 83 research documents grouped into 9 thematic categories, with a "Start Here" reading path, per-phase implementation reading tables, and an alphabetical lookup table covering every document. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:30:35 -04:00
jedarden	92e6196ac5	Add research: Ruby/furigana typography, PDF/VT variable printing Two new research documents covering Japanese Ruby text and East Asian typography (tagged/untagged furigana extraction, Kinsoku Shori spacing, full-width normalization, tate-chu-yoko, CJK/Latin boundary detection, ruby_text output field) and PDF/VT variable and transactional printing (DPart hierarchy traversal, per-record extraction model, DPM metadata, variable vs. static content classification, postal address extraction, records array output schema). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:24:21 -04:00
jedarden	e3b72efc83	Add research: Southeast Asian scripts, OpenType MATH formula extraction Two new research documents covering Southeast Asian script extraction (Thai/Khmer/Myanmar/Lao/Tibetan/Ethiopic — cluster structure, no-space word boundary policy for Thai/Lao, Zawgyi vs Unicode detection for Myanmar, USE shaping, Tesseract fallback) and OpenType MATH table exploitation for formula extraction (MathConstants for fraction/ subscript/radical layout, TeX OML/OMS/OMX encoding tables, MathML output generation, GlyphAssembly reconstruction, alternative text and MathJax XMP source recovery). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:21:48 -04:00
jedarden	4e72c66763	Add research: Indic scripts, adversarial parser security Two new research documents covering Indic script extraction (abugida structure, ToUnicode CMap failures for shaped glyphs, ActualText fast-path, GSUB lookup reversal, pre-base matra reordering, virama placement, Tesseract fallback with script-specific models) and adversarial input handling (decompression bombs, circular references, malformed stream lengths, path traversal in attachments, content stream loop detection, O(n log n) algorithm requirements, output sanitization). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:18:03 -04:00
jedarden	12fad41596	Add research: span merging, Unicode normalization, implementation plan Two new research documents covering the glyph-to-span-to-block assembly pipeline (inter-operator merging, adaptive word gap threshold, column detection, ligature bbox splitting, multi-granularity output) and Unicode post-processing (NFC normalization, selective NFKC decomposition for ligatures, PUA preservation, soft hyphen resolution, ZWJ/ZWNJ handling, combining character reordering). Also adds docs/plan/implementation-plan.md: the full 7-phase Rust implementation roadmap covering core parser, font/encoding pipeline, content stream processing, text assembly, OCR integration, API surface, and advanced features — with crate selections, complexity ratings, test strategy, and v0.1–v1.0 release milestones. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:15:14 -04:00
jedarden	6b96d8d637	Add research: error handling, PDF/A guarantees, output schema, generator quirks Four new extraction research documents covering permissive error handling with extraction quality signaling (five error classes, circular reference detection, memory limits), PDF/A conformance level guarantees and fast-path optimization (Level A skips OCR and layout heuristics), the complete extraction output schema (span/block/table/NDJSON streaming/ versioning), and per-generator extraction quirks (Word/LibreOffice/ InDesign/LaTeX/Chrome/Ghostscript/scanners). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:07:13 -04:00
jedarden	a89fef64fc	Add research: article threads, resource dictionaries, catalog, hyperlinks Four new extraction research documents covering PDF article thread traversal for multi-flow magazine layouts, resource dictionary inheritance and ResourceStack semantics for nested Form XObjects, document catalog and page tree structure (UserUnit, Contents array, page inheritance), and hyperlink/named destination extraction with QuadPoints anchor text and link density classification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:04:00 -04:00
jedarden	16cb1bd61d	Add research: xref parsing, object model, font descriptors, PDF/UA-2 Four new extraction research documents covering cross-reference table and xref stream parsing with error recovery, PDF object model and lexer correctness (all 8 types, string escapes, stream /Length recovery), FontDescriptor fields and embedded font data (Type1/TrueType/CFF/OT), and PDF/UA-2 / PDF 2.0 structure changes (MathML, NFC normalization, new structure types, artifact classification improvements). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:01:34 -04:00
jedarden	6c6ec6a4ca	Add research: color management, text metrics, PDF/X, content stream operators Four new extraction research documents covering ICC profile and color space luminance estimation for text visibility, precise text state tracking and bounding box computation (Tc/Tw/Tz/TL, font units, TJ kerning, baseline clustering), PDF/X prepress handling (OutputIntent, TrimBox, spot colors, article threading), and a complete content stream operator reference (BT/ET, Tj/TJ/'/", BI/ID/EI, BX/EX, marked content). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:59:02 -04:00
jedarden	516ca154aa	Add research: page labels, government forms, book publishing, filter decoding Four new extraction research documents covering page label/PageLabels number tree and outline/bookmark tree extraction, government form PDF patterns (IRS, USCIS, court filings, classification markings), book and publishing PDF structure (running heads, footnotes, index extraction), and PDF stream filter pipeline (FlateDecode/LZW predictors, JBIG2 global segments, CCITTFax, JPX, error boundaries). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:55:08 -04:00
jedarden	5ff918b178	Add research: portfolios, incremental updates, tagged PDF, JavaScript/forms Four new extraction research documents covering PDF portfolio and attachment enumeration (ZUGFeRD, PDF/A-3 AFRelationship), incremental update structure and xref chaining, PDF/UA tagged PDF deep dive with all 36 structure types and MCID mechanics, and JavaScript/AcroForm/XFA field extraction without script execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:45:59 -04:00
jedarden	006dfb286c	Add research: color visibility, medical/scientific, multilingual, digital signatures Four new extraction research documents covering color space and contrast analysis for text visibility, medical/scientific document structure (ICH E3, IMRaD, FDA labeling, eCTD), multilingual mixed-script extraction with UBA bidi handling and CJK vertical text, and digital signature metadata extraction with DocMDP integrity context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:41:43 -04:00
jedarden	eac3235291	Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs Four new extraction research documents covering text rendering modes (Tr 0-7 including invisible OCR layers), legal/financial document extraction patterns, character-level confidence aggregation with output schema, and PDF/E engineering document handling (CAD, GD&T, schematics). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:35:48 -04:00
jedarden	8f8138a65e	Add research: font subsetting, LaTeX patterns, redaction detection Three new extraction research documents covering subset font Unicode recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and proper vs. improper redaction detection with output schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:30:52 -04:00
jedarden	04b60a1cf7	Add three research documents: CJK encoding, pipeline synthesis, linearization - cjk-and-asian-script-encoding: all six CJK encoding systems, Type 0 composite font pipeline, predefined CMap tables for Japan1/GB1/CNS1/Korea1, Shift-JIS/GB18030/Big5 byte structure, missing ToUnicode recovery via Adobe CID tables, full-width normalization, vertical text detection - extraction-pipeline-overview: end-to-end 9-stage synthesis referencing all 36 research documents; stages: file open, metadata, page classification, content extraction (4 sub-paths), font pipeline, span assembly, normalization and quality, supplementary content, output serialization; ASCII data-flow diagram - linearized-pdf-and-streaming: linearization dict keys, hint stream bitfield tables, first-page xref lazy parsing, HTTP range request pattern, staleness validation, incremental update interaction, NDJSON streaming, partial file extraction, lazy PageIter API with rayon par_bridge Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:26:36 -04:00
jedarden	116db89c95	Add three research documents on routing and text reconstruction - word-boundary-reconstruction: expected position formula with Tc/Tw/Tz, TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold strategies including adaptive histogram, multi-column gap discrimination - scanned-vs-vector-page-classification: four-category taxonomy, fast pre-checks, image coverage AABB computation, character density ratio, validity rate, glyph bbox plausibility, region routing map, confidence scoring with cost-aware OCR threshold - pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP pdfaid detection, Level B/U/A guarantee implications for extraction, font embedding requirements, artifact tagging, PDF/A-3 embedded files, PdfaLevel enum with per-level fast-path branching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:22:08 -04:00
jedarden	9420964b73	Add three research documents on parser correctness fundamentals - graphics-state-tracking: full q/Q stack, text state operators, color space tracking, ExtGState keys, clip path management, CTM concatenation, blend mode/soft mask visibility, Form XObject isolation, GraphicsState Rust struct with is_text_visible implementation - cmap-format-and-cid-encoding: CMap file structure, codespace range scan grammar, bfchar/bfrange/cidchar/cidrange semantics, usecmap inheritance with predefined CJK CMap inventory, mixed-length parsing state machine, ToUnicode defect handling, Rust CMap struct design - content-stream-concatenation: multi-stream concatenation with 0x0A injection, continuous graphics state across boundaries, resource inheritance page-tree walk, Form XObject and Type 3 resource isolation, ResourceStack design, EI disambiguation in binary data, lazy decompression Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:16:41 -04:00
jedarden	f805e52fa3	Add four research documents focused on readable text production - type3-font-extraction: CharProcs stream parsing, TeX/dvips naming conventions, dHash shape fingerprinting, nested font stacks, OCR fallback - watermark-and-background-separation: five PDF watermark mechanisms, transparency tracking, cross-page repetition, WCAG contrast detection, raster inpainting, diagonal watermark removal pipeline - historical-and-degraded-document-extraction: eight degradation categories, bleed-through removal, illumination correction, Sauvola binarization, stroke reconstruction, Fraktur/long-s handling, confidence-gated output - complex-layout-reading-order: baseline clustering, XY-cut, Docstrum, RLSA smearing, mixed-layout detection, sidebar/inset/footnote ordering, perplexity-based confidence with natural_order fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:13:10 -04:00
jedarden	31e715633d	Add four research documents on text quality and document-type handling - text-readability-validation: character/word/entropy/perplexity checks, symbol font detection, remediation decision tree, span quality metadata - post-ocr-text-correction: error taxonomy, confusable tables, noisy channel n-gram model, regex patterns, hyphenation, layout-based correction pipeline - presentation-and-spreadsheet-pdfs: detection heuristics, slide structure, bullet hierarchy, speaker notes, hairline grid detection, sheet boundaries, cell type inference, Rust output schema - semantic-text-reconstruction: beam search n-gram reconstruction, NER validation, domain lexicons, cross-span consistency, abbreviation expansion, citation repair, coherence scoring, ReconstructedSpan output schema Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:07:30 -04:00
jedarden	a7673c906f	Add 12 research documents covering full PDF extraction surface Infrastructure and parsing: - raster-ocr-pipeline: trigger detection, preprocessing, Tesseract integration, assisted OCR, HOCR alignment, multi-language, performance - image-and-figure-extraction: XObjects, inline images, filter decoding, color spaces, geometry, form XObjects, transparency, figure detection - form-fields-and-annotations: AcroForm types, XFA, widget appearance streams, rich text, annotation text, output schema - pdf-encryption-and-security: R2-R6 key derivation, object-level decryption, permission flags, RustCrypto implementation approach - page-geometry-and-document-structure: page tree, all five page boxes, rotation, coordinate inversion, page labels, outlines, named destinations - optional-content-groups: OCG/OCMD visibility, usage dictionary, default state resolution, content stream marking, multilingual layer patterns - invisible-and-hidden-text: all 8 Tr modes, PDF/A invisible layer pattern, white-on-white, zero-opacity, clipped text, color tracking - malformed-pdf-repair-and-recovery: xref recovery, stream length repair, syntax tolerance, partial extraction, structured warnings Quality and metadata: - xmp-and-document-metadata: /Info vs XMP, all namespaces, RDF/XML parsing, conflict resolution, encrypted metadata, thumbnails - embedded-files-and-portfolios: EmbeddedFile streams, Filespec, AF relationships, Portfolio detection, ZUGFeRD/Factur-X, security - performance-and-streaming-architecture: mmap, lazy loading, NDJSON streaming, rayon parallelism, font caching, axum HTTP server - benchmark-and-test-methodology: CER/WER/TEDS metrics, corpus categories, reading order scoring, regression CI, public datasets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:05:42 -04:00
jedarden	b805593973	Add six research documents covering output-side extraction topics - table-structure-reconstruction: line detection, gap analysis, Hough transform, graph-based cell reconstruction, merged cells, multi-page tables - mathematical-expression-handling: five encoding cases, OpenType MATH table, symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers - language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi, CJK vertical text, ligature normalization, whatlang/lingua integration - document-classification-and-zone-labeling: margin heuristics, font clustering, cross-page recurrence, footnote/caption/sidebar detection - post-extraction-normalization: hyphen handling, ligature expansion, paragraph reconstruction, Unicode normalization, pipeline ordering - chunking-for-llm-consumption: semantic snapping, heading hierarchy, sliding window overlap, table chunking strategies, token budget, late chunking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:56:25 -04:00
jedarden	ef9c03095d	Add SDK architecture notes covering top 10 languages Covers TypeScript, C#, C++, PHP, and Kotlin gaps with full code examples for both subprocess and HTTP tracks, NuGet RID packaging detail, PHP FFI options, and implementation sequencing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:51:25 -04:00
jedarden	c2870e6640	Add research docs and SDK invocation notes Four research documents covering PDF spec fundamentals, font types and encoding, glyph Unicode recovery, and tagged PDF structure/reading order. SDK invocation notes with subprocess and HTTP examples for Python, Node.js, Go, Ruby, Java, Rust, and Bash. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:33:34 -04:00
jedarden	4ae798c8b1	Initial repo scaffold with README and docs structure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:26:16 -04:00

49 commits