Added three new tests to verify the deskew acceptance criteria: - test_deskew_2_degree_skew: Verifies 2-degree skew is deskewed within 0.1 deg - test_deskew_0_2_degree_skew_skipped: Verifies 0.2-degree skew is skipped - test_deskew_20_degree_skew_out_of_range: Verifies out-of-range diagnostic Helper function create_skewed_text_lines() creates synthetic test images with known skew angles using small-angle trigonometric approximations. Note: Tests compile but cannot run without leptonica library (NixOS limitation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
96 lines
5.2 KiB
Markdown
96 lines
5.2 KiB
Markdown
# pdftract-3wku: Deskew via pixDeskew (Hough transform)
|
|
|
|
## Summary
|
|
|
|
Implemented the deskew preprocessing step using leptonica's `pixFindSkewAndDeskew` function. The implementation detects the dominant text angle using a Hough line transform and rotates the image if the angle is >= 0.3 degrees.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Added leptonica-plumbing dependency
|
|
- **File**: `crates/pdftract-core/Cargo.toml`
|
|
- **Change**: Added `leptonica-plumbing = { version = "1.4", optional = true }`
|
|
- **Feature gate**: Added to `ocr` feature: `ocr = ["dep:image", "dep:leptonica-plumbing"]`
|
|
|
|
### 2. Created preprocess module
|
|
- **File**: `crates/pdftract-core/src/preprocess.rs` (new)
|
|
- **Functions**:
|
|
- `deskew(image: &GrayImage) -> Result<(GrayImage, f64, Vec<Diagnostic>)>`: Main deskew function
|
|
- `grayimage_to_pix(image: &GrayImage) -> Result<*mut Pix>`: Convert GrayImage to leptonica Pix
|
|
- `pix_to_grayimage(pix: *mut Pix) -> Result<GrayImage>`: Convert leptonica Pix to GrayImage
|
|
- **Constants**:
|
|
- `DESKEW_THRESHOLD_DEG: f64 = 0.3`: Minimum angle for deskewing
|
|
- `DESKEW_MAX_RANGE_DEG: f64 = 15.0`: Maximum detection range
|
|
|
|
### 3. Added diagnostic code
|
|
- **File**: `crates/pdftract-core/src/diagnostics.rs`
|
|
- **Code**: `ImgDeskewOutOfRange`
|
|
- **Usage**: Emitted when detected skew angle exceeds +/- 15 degrees
|
|
|
|
### 4. Exposed module
|
|
- **File**: `crates/pdftract-core/src/lib.rs`
|
|
- **Change**: Added `#[cfg(feature = "ocr")] pub mod preprocess;`
|
|
|
|
### 5. Added acceptance criteria tests (2026-05-23)
|
|
- **File**: `crates/pdftract-core/src/preprocess.rs` (test module)
|
|
- **New tests**:
|
|
- `test_deskew_2_degree_skew`: Verifies 2-degree skew is deskewed within 0.1 deg
|
|
- `test_deskew_0_2_degree_skew_skipped`: Verifies 0.2-degree skew is skipped (unchanged)
|
|
- `test_deskew_20_degree_skew_out_of_range`: Verifies 20-degree skew emits IMG_DESKEW_OUT_OF_RANGE diagnostic
|
|
- **Helper functions**:
|
|
- `create_skewed_text_lines()`: Creates synthetic test images with known skew angles
|
|
- `verify_deskewed()`: Verifies an image is properly deskewed via double-pass check
|
|
|
|
## Implementation Details
|
|
|
|
The `deskew()` function:
|
|
1. Converts the input `GrayImage` to a leptonica `Pix` (8-bit grayscale)
|
|
2. Calls `pixFindSkewAndDeskew` to detect and correct skew in one operation
|
|
3. Returns the original image unchanged if angle < 0.3 degrees (negligible skew)
|
|
4. Emits `IMG_DESKEW_OUT_OF_RANGE` diagnostic if angle > 15 degrees (out of detection range)
|
|
5. Returns tuple of `(deskewed_image, detected_angle_deg, diagnostics)`
|
|
|
|
The function uses `pixFindSkewAndDeskew` instead of separate `pixFindSkew` + `pixRotate` because:
|
|
- It's more efficient (one FFI call instead of two)
|
|
- It returns both the deskewed image and the detected angle
|
|
- The angle is needed for quality tracking/debugging
|
|
|
|
## Acceptance Criteria
|
|
|
|
| Criterion | Status | Notes |
|
|
|-----------|--------|-------|
|
|
| 2-deg synthetic skewed fixture: deskewed within 0.1 deg | TEST ADDED | `test_deskew_2_degree_skew` creates synthetic 2° skewed image, verifies deskewing produces < 0.1° residual skew |
|
|
| 0.2-deg skewed fixture: untouched | TEST ADDED | `test_deskew_0_2_degree_skew_skipped` verifies sub-threshold angles return original unchanged |
|
|
| 20-deg skewed fixture: IMG_DESKEW_OUT_OF_RANGE diagnostic | TEST ADDED | `test_deskew_20_degree_skew_out_of_range` verifies diagnostic emitted for out-of-range angles |
|
|
| WER on standard deskew fixture: deskew + OCR < deskew-disabled + OCR | WARN | Requires OCR integration and test fixtures - deferred to later phase |
|
|
|
|
## Infrastructure Notes
|
|
|
|
**WARN**: Tests cannot run on this machine due to missing leptonica library. The system is NixOS-based and leptonica is not available in the current environment. This is a known infrastructure limitation documented in `CLAUDE.md`.
|
|
|
|
The implementation is correct by code review:
|
|
- Uses leptonica-plumbing's `pixFindSkewAndDeskew` as specified
|
|
- Implements the 0.3 deg threshold correctly
|
|
- Emits the required diagnostic for out-of-range angles
|
|
- Returns the detected angle for quality tracking
|
|
- Properly manages leptonica Pix memory (pixDestroy on drop)
|
|
- Tests compile and are ready to run once leptonica is available
|
|
|
|
## Test Implementation Details
|
|
|
|
The new tests use synthetic test images created programmatically:
|
|
- `create_skewed_text_lines()` draws horizontal text-like lines at a specified angle
|
|
- Uses small-angle trigonometric approximations to avoid external math library dependencies
|
|
- The 2-degree test verifies deskewing by running deskew twice and checking the second pass detects near-zero skew
|
|
- The 0.2-degree test verifies the skip branch by checking the angle is exactly 0.0 (returned unchanged)
|
|
- The 20-degree test verifies the out-of-range diagnostic is emitted
|
|
|
|
## Future Work
|
|
|
|
1. **Per-page quality tracking**: The deskew angle is returned but not yet recorded in `extraction_quality.deskew_angle_deg`. This requires adding a per-page quality struct to the extraction pipeline.
|
|
2. **WER benchmark**: Compare OCR accuracy with/without deskewing once the OCR pipeline is integrated.
|
|
3. **Leptonica test environment**: Set up a CI environment with leptonica available to run these tests automatically.
|
|
|
|
## Commits
|
|
|
|
- **Hash**: `5ef9ef7` - Initial implementation
|
|
- **Hash**: `pending` - Added acceptance criteria tests
|