pdftract/notes/pdftract-3wku.md
jedarden 3ea7fe051d test(pdftract-3wku): add acceptance criteria tests for deskew
Added three new tests to verify the deskew acceptance criteria:
- test_deskew_2_degree_skew: Verifies 2-degree skew is deskewed within 0.1 deg
- test_deskew_0_2_degree_skew_skipped: Verifies 0.2-degree skew is skipped
- test_deskew_20_degree_skew_out_of_range: Verifies out-of-range diagnostic

Helper function create_skewed_text_lines() creates synthetic test images
with known skew angles using small-angle trigonometric approximations.

Note: Tests compile but cannot run without leptonica library (NixOS limitation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 21:21:59 -04:00

96 lines
5.2 KiB
Markdown

# pdftract-3wku: Deskew via pixDeskew (Hough transform)
## Summary
Implemented the deskew preprocessing step using leptonica's `pixFindSkewAndDeskew` function. The implementation detects the dominant text angle using a Hough line transform and rotates the image if the angle is >= 0.3 degrees.
## Changes Made
### 1. Added leptonica-plumbing dependency
- **File**: `crates/pdftract-core/Cargo.toml`
- **Change**: Added `leptonica-plumbing = { version = "1.4", optional = true }`
- **Feature gate**: Added to `ocr` feature: `ocr = ["dep:image", "dep:leptonica-plumbing"]`
### 2. Created preprocess module
- **File**: `crates/pdftract-core/src/preprocess.rs` (new)
- **Functions**:
- `deskew(image: &GrayImage) -> Result<(GrayImage, f64, Vec<Diagnostic>)>`: Main deskew function
- `grayimage_to_pix(image: &GrayImage) -> Result<*mut Pix>`: Convert GrayImage to leptonica Pix
- `pix_to_grayimage(pix: *mut Pix) -> Result<GrayImage>`: Convert leptonica Pix to GrayImage
- **Constants**:
- `DESKEW_THRESHOLD_DEG: f64 = 0.3`: Minimum angle for deskewing
- `DESKEW_MAX_RANGE_DEG: f64 = 15.0`: Maximum detection range
### 3. Added diagnostic code
- **File**: `crates/pdftract-core/src/diagnostics.rs`
- **Code**: `ImgDeskewOutOfRange`
- **Usage**: Emitted when detected skew angle exceeds +/- 15 degrees
### 4. Exposed module
- **File**: `crates/pdftract-core/src/lib.rs`
- **Change**: Added `#[cfg(feature = "ocr")] pub mod preprocess;`
### 5. Added acceptance criteria tests (2026-05-23)
- **File**: `crates/pdftract-core/src/preprocess.rs` (test module)
- **New tests**:
- `test_deskew_2_degree_skew`: Verifies 2-degree skew is deskewed within 0.1 deg
- `test_deskew_0_2_degree_skew_skipped`: Verifies 0.2-degree skew is skipped (unchanged)
- `test_deskew_20_degree_skew_out_of_range`: Verifies 20-degree skew emits IMG_DESKEW_OUT_OF_RANGE diagnostic
- **Helper functions**:
- `create_skewed_text_lines()`: Creates synthetic test images with known skew angles
- `verify_deskewed()`: Verifies an image is properly deskewed via double-pass check
## Implementation Details
The `deskew()` function:
1. Converts the input `GrayImage` to a leptonica `Pix` (8-bit grayscale)
2. Calls `pixFindSkewAndDeskew` to detect and correct skew in one operation
3. Returns the original image unchanged if angle < 0.3 degrees (negligible skew)
4. Emits `IMG_DESKEW_OUT_OF_RANGE` diagnostic if angle > 15 degrees (out of detection range)
5. Returns tuple of `(deskewed_image, detected_angle_deg, diagnostics)`
The function uses `pixFindSkewAndDeskew` instead of separate `pixFindSkew` + `pixRotate` because:
- It's more efficient (one FFI call instead of two)
- It returns both the deskewed image and the detected angle
- The angle is needed for quality tracking/debugging
## Acceptance Criteria
| Criterion | Status | Notes |
|-----------|--------|-------|
| 2-deg synthetic skewed fixture: deskewed within 0.1 deg | TEST ADDED | `test_deskew_2_degree_skew` creates synthetic 2° skewed image, verifies deskewing produces < 0.1° residual skew |
| 0.2-deg skewed fixture: untouched | TEST ADDED | `test_deskew_0_2_degree_skew_skipped` verifies sub-threshold angles return original unchanged |
| 20-deg skewed fixture: IMG_DESKEW_OUT_OF_RANGE diagnostic | TEST ADDED | `test_deskew_20_degree_skew_out_of_range` verifies diagnostic emitted for out-of-range angles |
| WER on standard deskew fixture: deskew + OCR < deskew-disabled + OCR | WARN | Requires OCR integration and test fixtures - deferred to later phase |
## Infrastructure Notes
**WARN**: Tests cannot run on this machine due to missing leptonica library. The system is NixOS-based and leptonica is not available in the current environment. This is a known infrastructure limitation documented in `CLAUDE.md`.
The implementation is correct by code review:
- Uses leptonica-plumbing's `pixFindSkewAndDeskew` as specified
- Implements the 0.3 deg threshold correctly
- Emits the required diagnostic for out-of-range angles
- Returns the detected angle for quality tracking
- Properly manages leptonica Pix memory (pixDestroy on drop)
- Tests compile and are ready to run once leptonica is available
## Test Implementation Details
The new tests use synthetic test images created programmatically:
- `create_skewed_text_lines()` draws horizontal text-like lines at a specified angle
- Uses small-angle trigonometric approximations to avoid external math library dependencies
- The 2-degree test verifies deskewing by running deskew twice and checking the second pass detects near-zero skew
- The 0.2-degree test verifies the skip branch by checking the angle is exactly 0.0 (returned unchanged)
- The 20-degree test verifies the out-of-range diagnostic is emitted
## Future Work
1. **Per-page quality tracking**: The deskew angle is returned but not yet recorded in `extraction_quality.deskew_angle_deg`. This requires adding a per-page quality struct to the extraction pipeline.
2. **WER benchmark**: Compare OCR accuracy with/without deskewing once the OCR pipeline is integrated.
3. **Leptonica test environment**: Set up a CI environment with leptonica available to run these tests automatically.
## Commits
- **Hash**: `5ef9ef7` - Initial implementation
- **Hash**: `pending` - Added acceptance criteria tests