Commit graph

3 commits

Author SHA1 Message Date
jedarden
d161d109b3 docs(plan): revise plan to center accuracy/speed/weight as hard targets
- Add Primary Objectives section with CI-gated measurable targets:
  accuracy (CER <0.5%, WER <3%, readability >0.85), speed (100pp <3s,
  10x vs pdfminer), weight (<4MB default binary, <20 default deps)
- Add feature-flag strategy: axum/tokio/pdfium/pyo3 are all optional;
  default build is core extraction + CLI only
- Add Phase 4.7: text readability validation and correction pipeline
  (ligature repair, hyphenation, mojibake detection, readability scoring)
- Make pdfium-render explicitly optional (full-render feature) vs. the
  always-present direct image compositing path
- Add Tier 4 competitive benchmark suite (vs. pdfminer.six, pypdf, pdfplumber)
- Remove jpeg-decoder and whichlang from dependency matrix (unnecessary)
- Rename implementation-plan.md → plan.md (matches CLAUDE.md reference)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 17:07:48 -04:00
jedarden
12fad41596 Add research: span merging, Unicode normalization, implementation plan
Two new research documents covering the glyph-to-span-to-block assembly
pipeline (inter-operator merging, adaptive word gap threshold, column
detection, ligature bbox splitting, multi-granularity output) and
Unicode post-processing (NFC normalization, selective NFKC decomposition
for ligatures, PUA preservation, soft hyphen resolution, ZWJ/ZWNJ
handling, combining character reordering).

Also adds docs/plan/implementation-plan.md: the full 7-phase Rust
implementation roadmap covering core parser, font/encoding pipeline,
content stream processing, text assembly, OCR integration, API surface,
and advanced features — with crate selections, complexity ratings,
test strategy, and v0.1–v1.0 release milestones.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:15:14 -04:00
jedarden
4ae798c8b1 Initial repo scaffold with README and docs structure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:26:16 -04:00