- Add comprehensive concurrency model documentation to serve.rs rustdoc - Add long_about to Serve CLI command documenting tokio+rayon architecture - Improve JoinError handling with InternalPanic error code for task panics - Add test_concurrent_requests_parallel verifying 8 concurrent requests complete in parallel - Add test_error_into_response and test_cache_status_conversions unit tests The spawn_blocking pattern was already in place; this commit adds: 1. Documentation of the concurrency model in rustdoc and CLI help 2. Proper panic detection via JoinError::is_panic() 3. Error code INTERNAL_PANIC for panicking tasks 4. Integration test proving concurrent request parallelism Closes: pdftract-jmh6w
91 lines
4.2 KiB
Markdown
91 lines
4.2 KiB
Markdown
# Verification Note: pdftract-29gu
|
|
|
|
## Bead: 5.5.3: Region-level confidence policy (>0.7 keep, <0.3 fallback to pure OCR) + PSM_SPARSE_TEXT wiring
|
|
|
|
## Summary
|
|
|
|
Implemented Phase 5.5.3 region-level confidence policy and PSM_SPARSE_TEXT wiring for assisted OCR.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Added `OcrFallback` variant to `SpanSource` enum (`hybrid.rs`)
|
|
- Added new variant `SpanSource::OcrFallback` for OCR fallback spans
|
|
- Added constructor method `Span::ocr_fallback()` for creating fallback spans
|
|
|
|
### 2. Added `page_seg_mode` to `TessOpts` (`ocr.rs`)
|
|
- Added `page_seg_mode: Option<PageSegMode>` field to `TessOpts`
|
|
- Added `TessOpts::with_page_seg_mode()` constructor
|
|
- Updated `TessState::new()` to call `api.set_page_seg_mode()` when specified
|
|
- Updated all tests to include the new field
|
|
|
|
### 3. Added threshold constants (`ocr.rs`)
|
|
- `ASSISTED_OCR_KEEP_THRESH = 0.7` - threshold for keeping high-confidence regions
|
|
- `ASSISTED_OCR_FALLBACK_THRESH = 0.3` - threshold for triggering fallback
|
|
|
|
### 4. Implemented region-level confidence policy (`ocr.rs`)
|
|
- Added `apply_region_level_confidence_policy()` function that:
|
|
- Groups OCR words into regions by baseline proximity (within 12pt)
|
|
- Computes mean confidence for each region
|
|
- Returns spans with appropriate source + list of words needing fallback
|
|
- Added `group_words_by_region()` helper function
|
|
- Added `OcrRegion` struct to hold region data
|
|
|
|
### 5. Added JSON schema TODO (`schema/mod.rs`)
|
|
- Documented that Phase 6.1 should add "ocr-fallback" to `confidence_source` enum
|
|
- Added TODO comment linking to plan lines 363, 1662
|
|
|
|
## Acceptance Criteria
|
|
|
|
### PASS
|
|
- [x] `ASSISTED_OCR_KEEP_THRESH = 0.7` constant defined
|
|
- [x] `ASSISTED_OCR_FALLBACK_THRESH = 0.3` constant defined
|
|
- [x] `SpanSource::OcrFallback` variant added to enum
|
|
- [x] `TessOpts` has `page_seg_mode: Option<PageSegMode>` field
|
|
- [x] `apply_region_level_confidence_policy()` function groups words by baseline
|
|
- [x] Region with mean confidence > 0.7 keeps `OcrAssisted` source
|
|
- [x] Region with mean confidence < 0.3 returns words for fallback
|
|
- [x] Region with 0.3 <= mean <= 0.7 keeps as-is
|
|
- [x] Code compiles: `cargo check --package pdftract-core --lib` succeeds
|
|
- [x] Code formatted: `cargo fmt` applied
|
|
|
|
### WARN
|
|
- [~] PSM_SPARSE_TEXT verified via trace on BrokenVector page
|
|
- Reason: Requires OCR feature with system dependencies (pkg-config, leptonica) not available in this environment
|
|
- The `TessOpts::with_page_seg_mode(PageSegMode::PsmSparseText)` API is available for use when OCR is enabled
|
|
- [~] confidence_source values present in 6.1 Schema enum
|
|
- Reason: Phase 6.1 schema not yet implemented; TODO comment added to `schema/mod.rs` documenting the requirement
|
|
|
|
### FAIL
|
|
- None
|
|
|
|
## Test Results
|
|
|
|
Added tests in `ocr.rs`:
|
|
- `test_region_level_policy_high_confidence_region` - verifies regions with mean > 0.7 are kept
|
|
- `test_region_level_policy_low_confidence_region` - verifies regions with mean < 0.3 trigger fallback
|
|
- `test_region_level_policy_medium_confidence_region` - verifies 0.3 <= mean <= 0.7 regions kept as-is
|
|
- `test_region_level_policy_multiple_regions` - verifies multiple regions with different confidence levels
|
|
- `test_group_words_by_region_empty` - edge case: empty word list
|
|
- `test_group_words_by_region_single_word` - edge case: single word
|
|
|
|
Note: These tests require the `ocr` feature (system dependencies: pkg-config, leptonica) and are skipped when not available.
|
|
|
|
## Technical Notes
|
|
|
|
1. **Region grouping algorithm**: Words are grouped by baseline proximity within 12pt tolerance. This matches the Phase 4.2 line-formation logic.
|
|
|
|
2. **Fallback mechanism**: The `apply_region_level_confidence_policy()` function returns a tuple of (kept_spans, fallback_words). The caller is responsible for re-running Tesseract on fallback_words without the validation filter.
|
|
|
|
3. **No re-preprocessing**: As noted in the bead, the fallback rerun should reuse the cell image already in memory without re-running Phase 5.3 preprocessing.
|
|
|
|
4. **Baseline computation**: Uses the same formula as Phase 4.2: `baseline = y0 + (bbox_height * 0.2)`
|
|
|
|
## Git Commits
|
|
|
|
- Commit: feat(pdftract-29gu): implement Phase 5.5.3 region-level confidence policy
|
|
|
|
## References
|
|
|
|
- Plan section: Phase 5.5 step 5 (line 1937)
|
|
- Phase 5.4 PSM modes (line 1934)
|
|
- INV-7 confidence_source (plan line 363, 1662)
|