pdftract/notes/pdftract-3wku.md
jedarden 3ea7fe051d test(pdftract-3wku): add acceptance criteria tests for deskew
Added three new tests to verify the deskew acceptance criteria:
- test_deskew_2_degree_skew: Verifies 2-degree skew is deskewed within 0.1 deg
- test_deskew_0_2_degree_skew_skipped: Verifies 0.2-degree skew is skipped
- test_deskew_20_degree_skew_out_of_range: Verifies out-of-range diagnostic

Helper function create_skewed_text_lines() creates synthetic test images
with known skew angles using small-angle trigonometric approximations.

Note: Tests compile but cannot run without leptonica library (NixOS limitation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 21:21:59 -04:00

5.2 KiB

pdftract-3wku: Deskew via pixDeskew (Hough transform)

Summary

Implemented the deskew preprocessing step using leptonica's pixFindSkewAndDeskew function. The implementation detects the dominant text angle using a Hough line transform and rotates the image if the angle is >= 0.3 degrees.

Changes Made

1. Added leptonica-plumbing dependency

  • File: crates/pdftract-core/Cargo.toml
  • Change: Added leptonica-plumbing = { version = "1.4", optional = true }
  • Feature gate: Added to ocr feature: ocr = ["dep:image", "dep:leptonica-plumbing"]

2. Created preprocess module

  • File: crates/pdftract-core/src/preprocess.rs (new)
  • Functions:
    • deskew(image: &GrayImage) -> Result<(GrayImage, f64, Vec<Diagnostic>)>: Main deskew function
    • grayimage_to_pix(image: &GrayImage) -> Result<*mut Pix>: Convert GrayImage to leptonica Pix
    • pix_to_grayimage(pix: *mut Pix) -> Result<GrayImage>: Convert leptonica Pix to GrayImage
  • Constants:
    • DESKEW_THRESHOLD_DEG: f64 = 0.3: Minimum angle for deskewing
    • DESKEW_MAX_RANGE_DEG: f64 = 15.0: Maximum detection range

3. Added diagnostic code

  • File: crates/pdftract-core/src/diagnostics.rs
  • Code: ImgDeskewOutOfRange
  • Usage: Emitted when detected skew angle exceeds +/- 15 degrees

4. Exposed module

  • File: crates/pdftract-core/src/lib.rs
  • Change: Added #[cfg(feature = "ocr")] pub mod preprocess;

5. Added acceptance criteria tests (2026-05-23)

  • File: crates/pdftract-core/src/preprocess.rs (test module)
  • New tests:
    • test_deskew_2_degree_skew: Verifies 2-degree skew is deskewed within 0.1 deg
    • test_deskew_0_2_degree_skew_skipped: Verifies 0.2-degree skew is skipped (unchanged)
    • test_deskew_20_degree_skew_out_of_range: Verifies 20-degree skew emits IMG_DESKEW_OUT_OF_RANGE diagnostic
  • Helper functions:
    • create_skewed_text_lines(): Creates synthetic test images with known skew angles
    • verify_deskewed(): Verifies an image is properly deskewed via double-pass check

Implementation Details

The deskew() function:

  1. Converts the input GrayImage to a leptonica Pix (8-bit grayscale)
  2. Calls pixFindSkewAndDeskew to detect and correct skew in one operation
  3. Returns the original image unchanged if angle < 0.3 degrees (negligible skew)
  4. Emits IMG_DESKEW_OUT_OF_RANGE diagnostic if angle > 15 degrees (out of detection range)
  5. Returns tuple of (deskewed_image, detected_angle_deg, diagnostics)

The function uses pixFindSkewAndDeskew instead of separate pixFindSkew + pixRotate because:

  • It's more efficient (one FFI call instead of two)
  • It returns both the deskewed image and the detected angle
  • The angle is needed for quality tracking/debugging

Acceptance Criteria

Criterion Status Notes
2-deg synthetic skewed fixture: deskewed within 0.1 deg TEST ADDED test_deskew_2_degree_skew creates synthetic 2° skewed image, verifies deskewing produces < 0.1° residual skew
0.2-deg skewed fixture: untouched TEST ADDED test_deskew_0_2_degree_skew_skipped verifies sub-threshold angles return original unchanged
20-deg skewed fixture: IMG_DESKEW_OUT_OF_RANGE diagnostic TEST ADDED test_deskew_20_degree_skew_out_of_range verifies diagnostic emitted for out-of-range angles
WER on standard deskew fixture: deskew + OCR < deskew-disabled + OCR WARN Requires OCR integration and test fixtures - deferred to later phase

Infrastructure Notes

WARN: Tests cannot run on this machine due to missing leptonica library. The system is NixOS-based and leptonica is not available in the current environment. This is a known infrastructure limitation documented in CLAUDE.md.

The implementation is correct by code review:

  • Uses leptonica-plumbing's pixFindSkewAndDeskew as specified
  • Implements the 0.3 deg threshold correctly
  • Emits the required diagnostic for out-of-range angles
  • Returns the detected angle for quality tracking
  • Properly manages leptonica Pix memory (pixDestroy on drop)
  • Tests compile and are ready to run once leptonica is available

Test Implementation Details

The new tests use synthetic test images created programmatically:

  • create_skewed_text_lines() draws horizontal text-like lines at a specified angle
  • Uses small-angle trigonometric approximations to avoid external math library dependencies
  • The 2-degree test verifies deskewing by running deskew twice and checking the second pass detects near-zero skew
  • The 0.2-degree test verifies the skip branch by checking the angle is exactly 0.0 (returned unchanged)
  • The 20-degree test verifies the out-of-range diagnostic is emitted

Future Work

  1. Per-page quality tracking: The deskew angle is returned but not yet recorded in extraction_quality.deskew_angle_deg. This requires adding a per-page quality struct to the extraction pipeline.
  2. WER benchmark: Compare OCR accuracy with/without deskewing once the OCR pipeline is integrated.
  3. Leptonica test environment: Set up a CI environment with leptonica available to run these tests automatically.

Commits

  • Hash: 5ef9ef7 - Initial implementation
  • Hash: pending - Added acceptance criteria tests