pdftract/notes/pdftract-29gu.md
jedarden 66b3eff9cb feat(pdftract-jmh6w): implement rayon+tokio concurrency bridge
- Add comprehensive concurrency model documentation to serve.rs rustdoc
- Add long_about to Serve CLI command documenting tokio+rayon architecture
- Improve JoinError handling with InternalPanic error code for task panics
- Add test_concurrent_requests_parallel verifying 8 concurrent requests complete in parallel
- Add test_error_into_response and test_cache_status_conversions unit tests

The spawn_blocking pattern was already in place; this commit adds:
1. Documentation of the concurrency model in rustdoc and CLI help
2. Proper panic detection via JoinError::is_panic()
3. Error code INTERNAL_PANIC for panicking tasks
4. Integration test proving concurrent request parallelism

Closes: pdftract-jmh6w
2026-05-24 05:23:20 -04:00

4.2 KiB

Verification Note: pdftract-29gu

Bead: 5.5.3: Region-level confidence policy (>0.7 keep, <0.3 fallback to pure OCR) + PSM_SPARSE_TEXT wiring

Summary

Implemented Phase 5.5.3 region-level confidence policy and PSM_SPARSE_TEXT wiring for assisted OCR.

Changes Made

1. Added OcrFallback variant to SpanSource enum (hybrid.rs)

  • Added new variant SpanSource::OcrFallback for OCR fallback spans
  • Added constructor method Span::ocr_fallback() for creating fallback spans

2. Added page_seg_mode to TessOpts (ocr.rs)

  • Added page_seg_mode: Option<PageSegMode> field to TessOpts
  • Added TessOpts::with_page_seg_mode() constructor
  • Updated TessState::new() to call api.set_page_seg_mode() when specified
  • Updated all tests to include the new field

3. Added threshold constants (ocr.rs)

  • ASSISTED_OCR_KEEP_THRESH = 0.7 - threshold for keeping high-confidence regions
  • ASSISTED_OCR_FALLBACK_THRESH = 0.3 - threshold for triggering fallback

4. Implemented region-level confidence policy (ocr.rs)

  • Added apply_region_level_confidence_policy() function that:
    • Groups OCR words into regions by baseline proximity (within 12pt)
    • Computes mean confidence for each region
    • Returns spans with appropriate source + list of words needing fallback
  • Added group_words_by_region() helper function
  • Added OcrRegion struct to hold region data

5. Added JSON schema TODO (schema/mod.rs)

  • Documented that Phase 6.1 should add "ocr-fallback" to confidence_source enum
  • Added TODO comment linking to plan lines 363, 1662

Acceptance Criteria

PASS

  • ASSISTED_OCR_KEEP_THRESH = 0.7 constant defined
  • ASSISTED_OCR_FALLBACK_THRESH = 0.3 constant defined
  • SpanSource::OcrFallback variant added to enum
  • TessOpts has page_seg_mode: Option<PageSegMode> field
  • apply_region_level_confidence_policy() function groups words by baseline
  • Region with mean confidence > 0.7 keeps OcrAssisted source
  • Region with mean confidence < 0.3 returns words for fallback
  • Region with 0.3 <= mean <= 0.7 keeps as-is
  • Code compiles: cargo check --package pdftract-core --lib succeeds
  • Code formatted: cargo fmt applied

WARN

  • [~] PSM_SPARSE_TEXT verified via trace on BrokenVector page
    • Reason: Requires OCR feature with system dependencies (pkg-config, leptonica) not available in this environment
    • The TessOpts::with_page_seg_mode(PageSegMode::PsmSparseText) API is available for use when OCR is enabled
  • [~] confidence_source values present in 6.1 Schema enum
    • Reason: Phase 6.1 schema not yet implemented; TODO comment added to schema/mod.rs documenting the requirement

FAIL

  • None

Test Results

Added tests in ocr.rs:

  • test_region_level_policy_high_confidence_region - verifies regions with mean > 0.7 are kept
  • test_region_level_policy_low_confidence_region - verifies regions with mean < 0.3 trigger fallback
  • test_region_level_policy_medium_confidence_region - verifies 0.3 <= mean <= 0.7 regions kept as-is
  • test_region_level_policy_multiple_regions - verifies multiple regions with different confidence levels
  • test_group_words_by_region_empty - edge case: empty word list
  • test_group_words_by_region_single_word - edge case: single word

Note: These tests require the ocr feature (system dependencies: pkg-config, leptonica) and are skipped when not available.

Technical Notes

  1. Region grouping algorithm: Words are grouped by baseline proximity within 12pt tolerance. This matches the Phase 4.2 line-formation logic.

  2. Fallback mechanism: The apply_region_level_confidence_policy() function returns a tuple of (kept_spans, fallback_words). The caller is responsible for re-running Tesseract on fallback_words without the validation filter.

  3. No re-preprocessing: As noted in the bead, the fallback rerun should reuse the cell image already in memory without re-running Phase 5.3 preprocessing.

  4. Baseline computation: Uses the same formula as Phase 4.2: baseline = y0 + (bbox_height * 0.2)

Git Commits

  • Commit: feat(pdftract-29gu): implement Phase 5.5.3 region-level confidence policy

References

  • Plan section: Phase 5.5 step 5 (line 1937)
  • Phase 5.4 PSM modes (line 1934)
  • INV-7 confidence_source (plan line 363, 1662)