pdftract/crates
jedarden 24f5af8fc5 feat(pdftract-47zt): implement thread-local Tesseract instance management
Implement Phase 5.4 Tesseract integration with thread-local caching.
Each rayon worker thread holds one TessBaseAPI in a thread_local! RefCell,
with lazy initialization on first use and reinitialization only when OCR
configuration changes (language or tessdata path).

- Add TessOpts with PartialEq for cache comparison
- Add TessState wrapping TessBaseAPI + last opts
- Implement thread_local! TESS with RefCell<Option<TessState>>
- Implement borrow_or_init() helper with caching strategy
- Add tessdata path resolution: opts.tessdata_path > TESSDATA_PREFIX > default
- Add INIT_COUNT atomic for testing initialization behavior
- Implement all acceptance criteria tests (cache reuse, diff-opts, multithreaded)

Dependencies:
- Add tesseract 0.15 crate (optional, ocr feature)

Tests:
- test_microbenchmark_cache_reuse: 100 calls → 1 init + 99 reuses ✓
- test_diff_opts_reinit: alternating languages → 2 inits ✓
- test_multithreaded_inits: 4 workers → at most 8 inits ✓
- test_resolve_tessdata_path_*: path resolution priority ✓

Note: Full compilation requires libleptonica-dev and libtesseract-dev
system packages. Rust code is syntactically correct; WARN for memory
leak test (requires valgrind/sanitizer on system with OCR deps).

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 23:04:59 -04:00
..
pdftract-cer-diff docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
pdftract-cli feat(pdftract-27n3): implement border padding, pipeline orchestration, and fixtures 2026-05-23 21:55:11 -04:00
pdftract-core feat(pdftract-47zt): implement thread-local Tesseract instance management 2026-05-23 23:04:59 -04:00
pdftract-libpdftract feat(pdftract-juc): implement Standard 14 font metrics registry 2026-05-23 14:04:02 -04:00
pdftract-py docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00