pdftract/crates/pdftract-core/src
jedarden 24f5af8fc5 feat(pdftract-47zt): implement thread-local Tesseract instance management
Implement Phase 5.4 Tesseract integration with thread-local caching.
Each rayon worker thread holds one TessBaseAPI in a thread_local! RefCell,
with lazy initialization on first use and reinitialization only when OCR
configuration changes (language or tessdata path).

- Add TessOpts with PartialEq for cache comparison
- Add TessState wrapping TessBaseAPI + last opts
- Implement thread_local! TESS with RefCell<Option<TessState>>
- Implement borrow_or_init() helper with caching strategy
- Add tessdata path resolution: opts.tessdata_path > TESSDATA_PREFIX > default
- Add INIT_COUNT atomic for testing initialization behavior
- Implement all acceptance criteria tests (cache reuse, diff-opts, multithreaded)

Dependencies:
- Add tesseract 0.15 crate (optional, ocr feature)

Tests:
- test_microbenchmark_cache_reuse: 100 calls → 1 init + 99 reuses ✓
- test_diff_opts_reinit: alternating languages → 2 inits ✓
- test_multithreaded_inits: 4 workers → at most 8 inits ✓
- test_resolve_tessdata_path_*: path resolution priority ✓

Note: Full compilation requires libleptonica-dev and libtesseract-dev
system packages. Rust code is syntactically correct; WARN for memory
leak test (requires valgrind/sanitizer on system with OCR deps).

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 23:04:59 -04:00
..
cache feat(pdftract-2i6rt): implement cache CLI subcommand and HTTP integration 2026-05-23 06:33:43 -04:00
fingerprint feat(pdftract-154mz): fix canonicalization module compilation 2026-05-20 19:24:38 -04:00
font feat(pdftract-43ry): implement predefined CMap registry 2026-05-23 23:00:59 -04:00
parser feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback 2026-05-23 20:53:25 -04:00
receipts feat(pdftract-36wlt): implement verify-receipt subcommand + verifier protocol 2026-05-23 04:00:15 -04:00
render feat(pdftract-4my): implement pdfium-render path behind full-render feature 2026-05-23 16:28:08 -04:00
schema feat(pdftract-sg6): implement DPI selection logic for OCR rendering 2026-05-23 17:37:40 -04:00
table feat(pdftract-2oqh): implement span-to-cell assignment by centroid containment 2026-05-23 22:50:42 -04:00
classify.rs feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate 2026-05-23 15:04:05 -04:00
diagnostics.rs feat(pdftract-3wku): implement deskew via pixFindSkewAndDeskew 2026-05-23 21:20:02 -04:00
document.rs feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback 2026-05-23 20:53:25 -04:00
dpi.rs feat(pdftract-sg6): implement DPI selection logic for OCR rendering 2026-05-23 17:37:40 -04:00
extract.rs feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback 2026-05-23 20:53:25 -04:00
graphics_state.rs feat(pdftract-byq): implement direct image compositing path (Phase 5.2.1) 2026-05-23 15:46:38 -04:00
hybrid.rs feat(pdftract-4y9l): implement hybrid page routing with bbox merge rule 2026-05-23 17:48:00 -04:00
lib.rs feat(pdftract-47zt): implement thread-local Tesseract instance management 2026-05-23 23:04:59 -04:00
ocr.rs feat(pdftract-47zt): implement thread-local Tesseract instance management 2026-05-23 23:04:59 -04:00
options.rs feat(pdftract-sg6): implement DPI selection logic for OCR rendering 2026-05-23 17:37:40 -04:00
preprocess.rs fix(pdftract-27n3): remove duplicate import in preprocess module 2026-05-23 21:55:11 -04:00
render.rs feat(pdftract-4my): implement pdfium-render path behind full-render feature 2026-05-23 16:28:08 -04:00
semaphore.rs fix(pdftract-bf-5mry9): fix compilation bugs in rayon parallel extraction 2026-05-23 12:02:54 -04:00