pdftract/docs
jedarden eb799c0956 docs(plan): fix 21 gaps from Round 2 gap review
CRITICAL:
- Fix deskew step: pixDeskew operates on grayscale, not binarized image

HIGH:
- Add sha2 crate to dep matrix (needed for font fingerprint hashing)
- Fix bloomfilter feature: wordlist-bloom (optional), not default conditional
- Add build-dependencies subsection (phf_codegen, serde_json)
- Add v0.1.0 fallback for tagged PDFs: XY-cut with TAGGED_PDF_STRUCT_TREE_DEFERRED diagnostic
- Add v0.1.0 fallback for BrokenVector: emit broken_vector page_type when ocr feature absent
- Add strsim crate for Levenshtein in header/footer deduplication
- Add tokio::task::spawn_blocking bridge for axum→rayon hand-off
- Fix JPX/CCITT OCR path: document full-render requirement; add OCR_JPX/CCITT_UNSUPPORTED diagnostics
- Fix OCG default visibility: /D/BaseState + /D/ON + /D/OFF (was incorrectly /D/AS)

MEDIUM:
- Add JBIG2 OCR limitation: requires full-render; OCR_JBIG2_UNSUPPORTED diagnostic
- Add Standard-14 font skip for Level 3 fingerprinting (no embedded program)
- Change flags field from EnumSet<SpanFlag> to u8 bitmask (removes undocumented enumset dep)
- Add tests/fixtures/encoding/ and tests/fixtures/perf/ to Tier 2 fixture list
- Add ocg_present to Phase 6.1 metadata field list
- Add "Phase 7 feature; empty in Phase 6" notes to links and annotations fields
- Add include_invisible/extract_forms/extract_attachments to HTTP serve form fields
- Clarify Phase 4.7 lang source: document-level /Lang, not per-span (Phase 7)
- Fix Stage 1-2 → Phases 1-2 in architectural decisions (stale draft terminology)
- Remove frame-index notation from NDJSON streaming critical test
- Define Hybrid threshold (≥15% each) and Hybrid merge strategy (vector wins on overlap)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:05:26 -04:00
..
notes Add SDK architecture notes covering top 10 languages 2026-05-16 14:51:25 -04:00
plan docs(plan): fix 21 gaps from Round 2 gap review 2026-05-16 18:05:26 -04:00
research Add parallel extraction research and comprehensive research index 2026-05-16 16:30:35 -04:00
research-index.md Add parallel extraction research and comprehensive research index 2026-05-16 16:30:35 -04:00