jedarden/pdftract

Author	SHA1	Message	Date
jedarden	9fca24c77a	docs(plan): SDKs are monorepo members, not separate repos Add a Repository Layout subsection: SDK source lives at root-level pdftract-<lang>/ in this monorepo (single source of truth), generated via pdftract sdk codegen and published to language registries from here. Retire the legacy standalone repos. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 07:21:45 -04:00
jedarden	2251f8a9c0	docs(plan): make bounded peak-RSS a CI-gated target; default max_decompress_bytes 2GB->512MB Add a Memory targets table as a first-class acceptance criterion alongside Accuracy/Speed/Weight, with a hard per-document peak-RSS ceiling that must not scale with input/payload. Promote OOM-safety to a Tier-1 hard gate. Reconcile the contradictory 2 GB max_decompress_bytes default to the research-backed 512 MB (root cause of an observed multi-GB OOM via the unbounded PNG-predictor pre-alloc under rayon page parallelism). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 23:25:50 -04:00
jedarden	9f27d16f25	docs(phase-0.1): verify pdftract-ci scaffolding complete Verified the pdftract-ci WorkflowTemplate exists in declarative-config and is correctly synced to the iad-ci cluster. All scaffolding requirements met for Phase 0.1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 03:24:36 -04:00
jedarden	7035706068	docs(plan): fix 3 HIGH gaps + 3 LOW items from Round 5 gap review HIGH: - Add outline/bookmark traversal spec to Phase 1.4 (linked list walk, PDFDocEncoding vs UTF-16BE) - Specify base64 encoding for attachment data field in Phase 7.5 - Move decompression limit to ExtractionOptions.max_decompress_bytes (universal, not serve-only); add max_decompress_gb to CLI/Python/HTTP API surfaces LOW: - Split log+env_logger into two dep matrix rows for accurate crate count - Add full_render to Python keyword args and HTTP form fields (with no-op note) - Clarify v0.1.0 milestone: "all applicable" targets (OCR speed target excluded until v0.2.0) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:30:02 -04:00
jedarden	2ba51a8a73	docs(plan): fix 4 gaps from Round 4 gap review - Fix quick-xml feature gate: move from ocr to default (XMP conformance detection) - Make page_number schema update an explicit Phase 6.1 deliverable - Add PageClass → page_type mapping table; define broken_vector as valid output value - Fix CI test matrix: musl target excludes ocr/python features; glibc runs --all-features Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:24:12 -04:00
jedarden	2d194a4b1b	docs(plan): fix 15 gaps from Round 3 gap review HIGH: - Add fontdue crate for glyph rasterization (ttf-parser is a parser, not rasterizer) - Remove num_cpus reference (rayon default pool sizing is sufficient) - Update dep count target to < 30 direct crates (< 20 was violated by plan's own list) - Fix watermark deferral: Phase 7 not Phase 6; no kind:'watermark' until Phase 7 - Add Phase 7.6 (Hyperlink/Annotation Extraction) and 7.7 (Article Thread Chains) MEDIUM: - Document header/footer streaming mode limitation: first 3 pages emit as paragraph - Add conformance/XFA detection spec to Phase 1.4; move quick-xml to default feature - Clarify pdftract-py-ci is Phase 0 stub, filled in during Phase 6.3 - Specify /Contents array concatenation in Phase 1.4 page tree - Add page rotation un-rotation step after Phase 3 glyph bbox computation - Add password delivery: ExtractionOptions.password, --password CLI, HTTP form, Python kwarg - Fix glyph shape DB: phf::Map → sorted &'static [(u64,char)] slice for Hamming nearest-neighbor - Add Python benchmark runner infrastructure (python:3.11-slim, requirements.txt, hyperfine) - Add wordlist-bloom to Feature flags bullet list LOW: - Clarify extract_stream() yields page dicts only, not header/footer frames Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:18:33 -04:00
jedarden	eb799c0956	docs(plan): fix 21 gaps from Round 2 gap review CRITICAL: - Fix deskew step: pixDeskew operates on grayscale, not binarized image HIGH: - Add sha2 crate to dep matrix (needed for font fingerprint hashing) - Fix bloomfilter feature: wordlist-bloom (optional), not default conditional - Add build-dependencies subsection (phf_codegen, serde_json) - Add v0.1.0 fallback for tagged PDFs: XY-cut with TAGGED_PDF_STRUCT_TREE_DEFERRED diagnostic - Add v0.1.0 fallback for BrokenVector: emit broken_vector page_type when ocr feature absent - Add strsim crate for Levenshtein in header/footer deduplication - Add tokio::task::spawn_blocking bridge for axum→rayon hand-off - Fix JPX/CCITT OCR path: document full-render requirement; add OCR_JPX/CCITT_UNSUPPORTED diagnostics - Fix OCG default visibility: /D/BaseState + /D/ON + /D/OFF (was incorrectly /D/AS) MEDIUM: - Add JBIG2 OCR limitation: requires full-render; OCR_JBIG2_UNSUPPORTED diagnostic - Add Standard-14 font skip for Level 3 fingerprinting (no embedded program) - Change flags field from EnumSet<SpanFlag> to u8 bitmask (removes undocumented enumset dep) - Add tests/fixtures/encoding/ and tests/fixtures/perf/ to Tier 2 fixture list - Add ocg_present to Phase 6.1 metadata field list - Add "Phase 7 feature; empty in Phase 6" notes to links and annotations fields - Add include_invisible/extract_forms/extract_attachments to HTTP serve form fields - Clarify Phase 4.7 lang source: document-level /Lang, not per-span (Phase 7) - Fix Stage 1-2 → Phases 1-2 in architectural decisions (stale draft terminology) - Remove frame-index notation from NDJSON streaming critical test - Define Hybrid threshold (≥15% each) and Hybrid merge strategy (vector wins on overlap) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:05:26 -04:00
jedarden	bcccc98fd7	docs(plan): fix 30 gaps from Round 1 gap review CRITICAL fixes: - Remove jpeg-decoder from Phase 1.5 crates (contradicted dep matrix) - Specify word boundary adaptive threshold: text space, per-font-switch window, 20-glyph seed - Add page_number (1-based) alongside page_index (0-based) to resolve SDK/schema mismatch - Add mcid: Option<u32> to Glyph struct (was defined in 3.4 but missing from 3.2) - Add aes + rc4 crates under new decrypt feature; document crypto dependency HIGH fixes: - Specify font fingerprint database format (phf::Map, SHA-256, ~500KB, JSON source) - Fix Level 4 shape DB cross-ref (was "Phase 2.3", corrected to research doc); add Phase 2.5 definition - Document header/footer cross-page pass as sequential post-rayon with Levenshtein matching - Replace Tesseract box-file hint approach with PSM_SPARSE_TEXT + post-OCR validation - Add HTTP serve security constraints: decompression bomb limit, auth guidance, no path params - Add JavaScript detection spec to Phase 1.4 (all four JS action locations) - Align CI benchmark gate to 10x pdfminer.six (was 5x, contradicted primary objectives) - Add cargo bloat CI gate for phf word list size; bloomfilter fallback if >250KB - Add pdftract-py-ci WorkflowTemplate note with manylinux/osxcross/cross approach - Add ConfidenceSource enum → schema string mapping table in Phase 4.1 MEDIUM fixes: - Define docs/schema/v1.0/pdftract.schema.json as Phase 6.1 deliverable - Add unicode-bidi crate to dep matrix and Phase 4.2 for RTL detection - Define Color enum with CSS hex conversion rules in Phase 3.1 - Remove bytes crate from Phase 1.2 (belongs in serve feature only; use Arc<[u8]>) - Specify NDJSON buffer Condvar blocking behavior at window saturation - Clarify pdftract:ocr vs pdftract:full Docker image tags and size budgets - Add Docstrum parameters: k=5, Euclidean, ±30° constraints, root node definition - Add code and formula block kind detection heuristics to Phase 4.4 - Add OCG visibility handling to Phase 1.4 (ON/OFF from /OCProperties /D /AS) - Add linearized PDF detection and dual-xref merge to Phase 1.3 - Add HTTP 413 to error table with custom JSON rejection handler - Add Phase 0: CI Infrastructure section (pdftract-ci WorkflowTemplate) LOW fixes: - Clarify Name length limit: 127 bytes pre-expansion, matching PDF spec 7.3.5 - Reorder preprocessing pipeline: contrast normalization before binarization (was after) - Add CIDToGIDMap stream form: 2-byte big-endian GID array Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 17:45:04 -04:00
jedarden	d161d109b3	docs(plan): revise plan to center accuracy/speed/weight as hard targets - Add Primary Objectives section with CI-gated measurable targets: accuracy (CER <0.5%, WER <3%, readability >0.85), speed (100pp <3s, 10x vs pdfminer), weight (<4MB default binary, <20 default deps) - Add feature-flag strategy: axum/tokio/pdfium/pyo3 are all optional; default build is core extraction + CLI only - Add Phase 4.7: text readability validation and correction pipeline (ligature repair, hyphenation, mojibake detection, readability scoring) - Make pdfium-render explicitly optional (full-render feature) vs. the always-present direct image compositing path - Add Tier 4 competitive benchmark suite (vs. pdfminer.six, pypdf, pdfplumber) - Remove jpeg-decoder and whichlang from dependency matrix (unnecessary) - Rename implementation-plan.md → plan.md (matches CLAUDE.md reference) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 17:07:48 -04:00
jedarden	12fad41596	Add research: span merging, Unicode normalization, implementation plan Two new research documents covering the glyph-to-span-to-block assembly pipeline (inter-operator merging, adaptive word gap threshold, column detection, ligature bbox splitting, multi-granularity output) and Unicode post-processing (NFC normalization, selective NFKC decomposition for ligatures, PUA preservation, soft hyphen resolution, ZWJ/ZWNJ handling, combining character reordering). Also adds docs/plan/implementation-plan.md: the full 7-phase Rust implementation roadmap covering core parser, font/encoding pipeline, content stream processing, text assembly, OCR integration, API surface, and advanced features — with crate selections, complexity ratings, test strategy, and v0.1–v1.0 release milestones. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:15:14 -04:00
jedarden	4ae798c8b1	Initial repo scaffold with README and docs structure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:26:16 -04:00

11 commits