jedarden
9fca24c77a
docs(plan): SDKs are monorepo members, not separate repos
...
Add a Repository Layout subsection: SDK source lives at root-level pdftract-<lang>/
in this monorepo (single source of truth), generated via pdftract sdk codegen and
published to language registries from here. Retire the legacy standalone repos.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:21:45 -04:00
jedarden
2251f8a9c0
docs(plan): make bounded peak-RSS a CI-gated target; default max_decompress_bytes 2GB->512MB
...
Add a Memory targets table as a first-class acceptance criterion alongside
Accuracy/Speed/Weight, with a hard per-document peak-RSS ceiling that must not
scale with input/payload. Promote OOM-safety to a Tier-1 hard gate. Reconcile
the contradictory 2 GB max_decompress_bytes default to the research-backed 512 MB
(root cause of an observed multi-GB OOM via the unbounded PNG-predictor pre-alloc
under rayon page parallelism).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:25:50 -04:00
jedarden
9f27d16f25
docs(phase-0.1): verify pdftract-ci scaffolding complete
...
Verified the pdftract-ci WorkflowTemplate exists in declarative-config
and is correctly synced to the iad-ci cluster. All scaffolding
requirements met for Phase 0.1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 03:24:36 -04:00
jedarden
7035706068
docs(plan): fix 3 HIGH gaps + 3 LOW items from Round 5 gap review
...
HIGH:
- Add outline/bookmark traversal spec to Phase 1.4 (linked list walk, PDFDocEncoding vs UTF-16BE)
- Specify base64 encoding for attachment data field in Phase 7.5
- Move decompression limit to ExtractionOptions.max_decompress_bytes (universal, not serve-only);
add max_decompress_gb to CLI/Python/HTTP API surfaces
LOW:
- Split log+env_logger into two dep matrix rows for accurate crate count
- Add full_render to Python keyword args and HTTP form fields (with no-op note)
- Clarify v0.1.0 milestone: "all applicable" targets (OCR speed target excluded until v0.2.0)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:30:02 -04:00
jedarden
2ba51a8a73
docs(plan): fix 4 gaps from Round 4 gap review
...
- Fix quick-xml feature gate: move from ocr to default (XMP conformance detection)
- Make page_number schema update an explicit Phase 6.1 deliverable
- Add PageClass → page_type mapping table; define broken_vector as valid output value
- Fix CI test matrix: musl target excludes ocr/python features; glibc runs --all-features
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:24:12 -04:00
jedarden
2d194a4b1b
docs(plan): fix 15 gaps from Round 3 gap review
...
HIGH:
- Add fontdue crate for glyph rasterization (ttf-parser is a parser, not rasterizer)
- Remove num_cpus reference (rayon default pool sizing is sufficient)
- Update dep count target to < 30 direct crates (< 20 was violated by plan's own list)
- Fix watermark deferral: Phase 7 not Phase 6; no kind:'watermark' until Phase 7
- Add Phase 7.6 (Hyperlink/Annotation Extraction) and 7.7 (Article Thread Chains)
MEDIUM:
- Document header/footer streaming mode limitation: first 3 pages emit as paragraph
- Add conformance/XFA detection spec to Phase 1.4; move quick-xml to default feature
- Clarify pdftract-py-ci is Phase 0 stub, filled in during Phase 6.3
- Specify /Contents array concatenation in Phase 1.4 page tree
- Add page rotation un-rotation step after Phase 3 glyph bbox computation
- Add password delivery: ExtractionOptions.password, --password CLI, HTTP form, Python kwarg
- Fix glyph shape DB: phf::Map → sorted &'static [(u64,char)] slice for Hamming nearest-neighbor
- Add Python benchmark runner infrastructure (python:3.11-slim, requirements.txt, hyperfine)
- Add wordlist-bloom to Feature flags bullet list
LOW:
- Clarify extract_stream() yields page dicts only, not header/footer frames
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:18:33 -04:00
jedarden
eb799c0956
docs(plan): fix 21 gaps from Round 2 gap review
...
CRITICAL:
- Fix deskew step: pixDeskew operates on grayscale, not binarized image
HIGH:
- Add sha2 crate to dep matrix (needed for font fingerprint hashing)
- Fix bloomfilter feature: wordlist-bloom (optional), not default conditional
- Add build-dependencies subsection (phf_codegen, serde_json)
- Add v0.1.0 fallback for tagged PDFs: XY-cut with TAGGED_PDF_STRUCT_TREE_DEFERRED diagnostic
- Add v0.1.0 fallback for BrokenVector: emit broken_vector page_type when ocr feature absent
- Add strsim crate for Levenshtein in header/footer deduplication
- Add tokio::task::spawn_blocking bridge for axum→rayon hand-off
- Fix JPX/CCITT OCR path: document full-render requirement; add OCR_JPX/CCITT_UNSUPPORTED diagnostics
- Fix OCG default visibility: /D/BaseState + /D/ON + /D/OFF (was incorrectly /D/AS)
MEDIUM:
- Add JBIG2 OCR limitation: requires full-render; OCR_JBIG2_UNSUPPORTED diagnostic
- Add Standard-14 font skip for Level 3 fingerprinting (no embedded program)
- Change flags field from EnumSet<SpanFlag> to u8 bitmask (removes undocumented enumset dep)
- Add tests/fixtures/encoding/ and tests/fixtures/perf/ to Tier 2 fixture list
- Add ocg_present to Phase 6.1 metadata field list
- Add "Phase 7 feature; empty in Phase 6" notes to links and annotations fields
- Add include_invisible/extract_forms/extract_attachments to HTTP serve form fields
- Clarify Phase 4.7 lang source: document-level /Lang, not per-span (Phase 7)
- Fix Stage 1-2 → Phases 1-2 in architectural decisions (stale draft terminology)
- Remove frame-index notation from NDJSON streaming critical test
- Define Hybrid threshold (≥15% each) and Hybrid merge strategy (vector wins on overlap)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:05:26 -04:00
jedarden
bcccc98fd7
docs(plan): fix 30 gaps from Round 1 gap review
...
CRITICAL fixes:
- Remove jpeg-decoder from Phase 1.5 crates (contradicted dep matrix)
- Specify word boundary adaptive threshold: text space, per-font-switch window, 20-glyph seed
- Add page_number (1-based) alongside page_index (0-based) to resolve SDK/schema mismatch
- Add mcid: Option<u32> to Glyph struct (was defined in 3.4 but missing from 3.2)
- Add aes + rc4 crates under new decrypt feature; document crypto dependency
HIGH fixes:
- Specify font fingerprint database format (phf::Map, SHA-256, ~500KB, JSON source)
- Fix Level 4 shape DB cross-ref (was "Phase 2.3", corrected to research doc); add Phase 2.5 definition
- Document header/footer cross-page pass as sequential post-rayon with Levenshtein matching
- Replace Tesseract box-file hint approach with PSM_SPARSE_TEXT + post-OCR validation
- Add HTTP serve security constraints: decompression bomb limit, auth guidance, no path params
- Add JavaScript detection spec to Phase 1.4 (all four JS action locations)
- Align CI benchmark gate to 10x pdfminer.six (was 5x, contradicted primary objectives)
- Add cargo bloat CI gate for phf word list size; bloomfilter fallback if >250KB
- Add pdftract-py-ci WorkflowTemplate note with manylinux/osxcross/cross approach
- Add ConfidenceSource enum → schema string mapping table in Phase 4.1
MEDIUM fixes:
- Define docs/schema/v1.0/pdftract.schema.json as Phase 6.1 deliverable
- Add unicode-bidi crate to dep matrix and Phase 4.2 for RTL detection
- Define Color enum with CSS hex conversion rules in Phase 3.1
- Remove bytes crate from Phase 1.2 (belongs in serve feature only; use Arc<[u8]>)
- Specify NDJSON buffer Condvar blocking behavior at window saturation
- Clarify pdftract:ocr vs pdftract:full Docker image tags and size budgets
- Add Docstrum parameters: k=5, Euclidean, ±30° constraints, root node definition
- Add code and formula block kind detection heuristics to Phase 4.4
- Add OCG visibility handling to Phase 1.4 (ON/OFF from /OCProperties /D /AS)
- Add linearized PDF detection and dual-xref merge to Phase 1.3
- Add HTTP 413 to error table with custom JSON rejection handler
- Add Phase 0: CI Infrastructure section (pdftract-ci WorkflowTemplate)
LOW fixes:
- Clarify Name length limit: 127 bytes pre-expansion, matching PDF spec 7.3.5
- Reorder preprocessing pipeline: contrast normalization before binarization (was after)
- Add CIDToGIDMap stream form: 2-byte big-endian GID array
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 17:45:04 -04:00
jedarden
d161d109b3
docs(plan): revise plan to center accuracy/speed/weight as hard targets
...
- Add Primary Objectives section with CI-gated measurable targets:
accuracy (CER <0.5%, WER <3%, readability >0.85), speed (100pp <3s,
10x vs pdfminer), weight (<4MB default binary, <20 default deps)
- Add feature-flag strategy: axum/tokio/pdfium/pyo3 are all optional;
default build is core extraction + CLI only
- Add Phase 4.7: text readability validation and correction pipeline
(ligature repair, hyphenation, mojibake detection, readability scoring)
- Make pdfium-render explicitly optional (full-render feature) vs. the
always-present direct image compositing path
- Add Tier 4 competitive benchmark suite (vs. pdfminer.six, pypdf, pdfplumber)
- Remove jpeg-decoder and whichlang from dependency matrix (unnecessary)
- Rename implementation-plan.md → plan.md (matches CLAUDE.md reference)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 17:07:48 -04:00
jedarden
12fad41596
Add research: span merging, Unicode normalization, implementation plan
...
Two new research documents covering the glyph-to-span-to-block assembly
pipeline (inter-operator merging, adaptive word gap threshold, column
detection, ligature bbox splitting, multi-granularity output) and
Unicode post-processing (NFC normalization, selective NFKC decomposition
for ligatures, PUA preservation, soft hyphen resolution, ZWJ/ZWNJ
handling, combining character reordering).
Also adds docs/plan/implementation-plan.md: the full 7-phase Rust
implementation roadmap covering core parser, font/encoding pipeline,
content stream processing, text assembly, OCR integration, API surface,
and advanced features — with crate selections, complexity ratings,
test strategy, and v0.1–v1.0 release milestones.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:15:14 -04:00
jedarden
4ae798c8b1
Initial repo scaffold with README and docs structure
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:26:16 -04:00