pdftract

History

jedarden 2d194a4b1b docs(plan): fix 15 gaps from Round 3 gap review HIGH: - Add fontdue crate for glyph rasterization (ttf-parser is a parser, not rasterizer) - Remove num_cpus reference (rayon default pool sizing is sufficient) - Update dep count target to < 30 direct crates (< 20 was violated by plan's own list) - Fix watermark deferral: Phase 7 not Phase 6; no kind:'watermark' until Phase 7 - Add Phase 7.6 (Hyperlink/Annotation Extraction) and 7.7 (Article Thread Chains) MEDIUM: - Document header/footer streaming mode limitation: first 3 pages emit as paragraph - Add conformance/XFA detection spec to Phase 1.4; move quick-xml to default feature - Clarify pdftract-py-ci is Phase 0 stub, filled in during Phase 6.3 - Specify /Contents array concatenation in Phase 1.4 page tree - Add page rotation un-rotation step after Phase 3 glyph bbox computation - Add password delivery: ExtractionOptions.password, --password CLI, HTTP form, Python kwarg - Fix glyph shape DB: phf::Map → sorted &'static [(u64,char)] slice for Hamming nearest-neighbor - Add Python benchmark runner infrastructure (python:3.11-slim, requirements.txt, hyperfine) - Add wordlist-bloom to Feature flags bullet list LOW: - Clarify extract_stream() yields page dicts only, not header/footer frames Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:18:33 -04:00
..
.gitkeep	Initial repo scaffold with README and docs structure	2026-05-16 14:26:16 -04:00
plan.md	docs(plan): fix 15 gaps from Round 3 gap review	2026-05-16 18:18:33 -04:00

jedarden 2d194a4b1b docs(plan): fix 15 gaps from Round 3 gap review

HIGH:
- Add fontdue crate for glyph rasterization (ttf-parser is a parser, not rasterizer)
- Remove num_cpus reference (rayon default pool sizing is sufficient)
- Update dep count target to < 30 direct crates (< 20 was violated by plan's own list)
- Fix watermark deferral: Phase 7 not Phase 6; no kind:'watermark' until Phase 7
- Add Phase 7.6 (Hyperlink/Annotation Extraction) and 7.7 (Article Thread Chains)

MEDIUM:
- Document header/footer streaming mode limitation: first 3 pages emit as paragraph
- Add conformance/XFA detection spec to Phase 1.4; move quick-xml to default feature
- Clarify pdftract-py-ci is Phase 0 stub, filled in during Phase 6.3
- Specify /Contents array concatenation in Phase 1.4 page tree
- Add page rotation un-rotation step after Phase 3 glyph bbox computation
- Add password delivery: ExtractionOptions.password, --password CLI, HTTP form, Python kwarg
- Fix glyph shape DB: phf::Map → sorted &'static [(u64,char)] slice for Hamming nearest-neighbor
- Add Python benchmark runner infrastructure (python:3.11-slim, requirements.txt, hyperfine)
- Add wordlist-bloom to Feature flags bullet list

LOW:
- Clarify extract_stream() yields page dicts only, not header/footer frames

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 18:18:33 -04:00

.gitkeep

Initial repo scaffold with README and docs structure

2026-05-16 14:26:16 -04:00

plan.md

docs(plan): fix 15 gaps from Round 3 gap review

2026-05-16 18:18:33 -04:00