docs(pdftract-27n3): add verification note
Documents the implementation of border padding, pipeline orchestration, and fixtures for Phase 5.3 step 5. Acceptance criteria: - All 5.3 critical tests implemented (deskew, binarization, JBIG2 skip) - Padding adds exactly 10px on each side - preprocess() is deterministic - A4 benchmark < 500ms target WARN: Tests cannot run locally due to missing leptonica system deps; will run in CI where dependencies are configured. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
eff4b6054a
commit
37d231b0bc
1 changed files with 89 additions and 83 deletions
|
|
@ -1,100 +1,106 @@
|
|||
# Verification Note: pdftract-27n3 (5.3.4: Border padding + pipeline orchestration + fixtures)
|
||||
# pdftract-27n3: Border Padding + Pipeline Orchestration + Fixtures
|
||||
|
||||
## Summary
|
||||
|
||||
Implemented border padding (10px white margin), wired all preprocessing steps into the final `preprocess()` entry point, and created test fixtures for the three image-source paths.
|
||||
Implemented step 5 of Phase 5.3 (border padding), wired all preprocessing steps into the final `preprocess(input, ImageSource) -> GrayImage` entry point, and created fixtures for the three image-source paths.
|
||||
|
||||
## Work Completed
|
||||
## Implementation Details
|
||||
|
||||
### 1. Border Padding Implementation
|
||||
- **Function**: `add_border_padding()` at line 515 in `preprocess.rs`
|
||||
- **Behavior**: Creates (width+20) x (height+20) image, fills with white (255), copies input into center
|
||||
- **Constant**: `BORDER_PADDING = 10` pixels on each side
|
||||
- **Location**: Always runs (no skip), regardless of `ImageSource`
|
||||
### 1. Border Padding (10px white margin)
|
||||
- Location: `crates/pdftract-core/src/preprocess.rs:515-537`
|
||||
- Function: `add_border_padding(image: &GrayImage) -> GrayImage`
|
||||
- Implementation:
|
||||
- Creates a new image with dimensions (width+20) x (height+20)
|
||||
- Fills with white (255)
|
||||
- Copies input image into center at offset [10, 10]
|
||||
- Runs for all ImageSource types (PhysicalScan, DigitalOrigin, Jbig2)
|
||||
|
||||
### 2. Pipeline Orchestration
|
||||
- **Entry Point**: `preprocess(image, source)` at line 830 in `preprocess.rs`
|
||||
- **Pipeline Order**:
|
||||
1. Deskew (always) - uses `pixFindSkewAndDeskew` from leptonica
|
||||
2. Contrast normalization (skip for JBIG2) - histogram stretch to [0, 255]
|
||||
3. Binarization (skip for JBIG2) - Sauvola for physical, Otsu for digital
|
||||
- Location: `crates/pdftract-core/src/preprocess.rs:830-859`
|
||||
- Function: `preprocess(image: &GrayImage, source: ImageSource) -> Result<(GrayImage, Vec<Diagnostic>)>`
|
||||
- Pipeline order:
|
||||
1. Deskew (always) - via `deskew()`
|
||||
2. Contrast normalization (skip for JBIG2) - via `normalize_contrast()`
|
||||
3. Binarization (skip for JBIG2):
|
||||
- PhysicalScan → Sauvola local adaptive thresholding
|
||||
- DigitalOrigin → Otsu global thresholding
|
||||
4. Denoising (skip for JBIG2) - 3x3 median filter
|
||||
5. Border padding (always) - adds 10px white border
|
||||
5. Border padding (always) - via `add_border_padding()`
|
||||
|
||||
### 3. Fixtures Created
|
||||
Generated test fixture images in `tests/fixtures/preprocess/`:
|
||||
### 3. ImageSource Enum
|
||||
- Location: `crates/pdftract-core/src/preprocess.rs:27-60`
|
||||
- Variants: `PhysicalScan`, `DigitalOrigin`, `Jbig2`
|
||||
- Helper methods: `is_jbig2()`, `is_digital()`, `is_physical_scan()`
|
||||
|
||||
- **skewed_2deg/source.png** (3701 bytes) - 2-degree skewed text lines for deskew testing
|
||||
- **uneven_lighting/source.png** (2792 bytes) - gradient background with text patterns for Sauvola testing
|
||||
- **clean_digital/source.png** (1724 bytes) - crisp digital-origin text for Otsu testing
|
||||
- **jbig2_scan/source.png** (1724 bytes) - pure binary image simulating JBIG2
|
||||
### 4. Test Fixtures
|
||||
- Location: `tests/fixtures/preprocess/`
|
||||
- Directories:
|
||||
- `skewed_2deg/source.png` - 2-degree skewed scan for deskew testing
|
||||
- `uneven_lighting/source.png` - Uneven lighting for Sauvola binarization
|
||||
- `clean_digital/source.png` - Clean digital origin for Otsu binarization
|
||||
- `jbig2_scan/source.png` - Already binary JBIG2 image
|
||||
|
||||
### 4. Integration Tests Added
|
||||
Added comprehensive integration tests in `preprocess.rs` (lines 1066-1196):
|
||||
### 5. Tests
|
||||
All tests are in `crates/pdftract-core/src/preprocess.rs` (lines 862-1380):
|
||||
|
||||
- `test_preprocess_skewed_2deg_deskews()` - Verifies 2-degree skew is deskewed within 0.1°
|
||||
- `test_preprocess_uneven_lighting_binarizes()` - Verifies uneven lighting is binarized correctly
|
||||
- `test_preprocess_clean_digital_binarizes()` - Verifies digital origin uses Otsu binarization
|
||||
- `test_preprocess_jbig2_only_pads()` - Verifies JBIG2 only gets padding (no binarization/denoise)
|
||||
- `test_preprocess_deterministic()` - Verifies same input produces bit-identical output
|
||||
- `test_preprocess_border_padding_pixel_perfect()` - Verifies exactly 10px white border on all sides
|
||||
**Unit tests:**
|
||||
- `test_add_border_padding` - Verifies 10px padding on all sides
|
||||
- `test_normalize_contrast_*` - Contrast normalization tests
|
||||
- `test_binarize_otsu` - Otsu thresholding
|
||||
- `test_binarize_sauvola` - Sauvola adaptive thresholding
|
||||
- `test_denoise_median` - 3x3 median filter
|
||||
- `test_preprocess_*` - Pipeline tests for each ImageSource
|
||||
|
||||
### 5. Benchmark Added
|
||||
Added A4-page performance benchmarks in `preprocess.rs` (lines 1198-1283):
|
||||
**Integration tests (with fixtures):**
|
||||
- `test_preprocess_skewed_2deg_deskews` - Verifies 2-deg skew corrected within 0.1°
|
||||
- `test_preprocess_uneven_lighting_binarizes` - Verifies Sauvola handles uneven lighting
|
||||
- `test_preprocess_clean_digital_binarizes` - Verifies Otsu for digital origin
|
||||
- `test_preprocess_jbig2_only_pads` - Verifies JBIG2 skips processing except padding
|
||||
- `test_preprocess_deterministic` - Verifies same input produces bit-identical output
|
||||
- `test_preprocess_border_padding_pixel_perfect` - Verifies exact 10px padding
|
||||
|
||||
- `benchmark_preprocess_a4_physical_scan()` - Target: < 500ms for 2480x3508 (A4 300 DPI)
|
||||
- `benchmark_preprocess_a4_digital_origin()` - Target: < 500ms
|
||||
- `benchmark_preprocess_a4_jbig2()` - Target: < 200ms (faster, skips steps)
|
||||
- `benchmark_individual_steps()` - Breaks down timing by step
|
||||
**Benchmarks:**
|
||||
- `benchmark_preprocess_a4_physical_scan` - A4 (2480x3508) PhysicalScan < 500ms
|
||||
- `benchmark_preprocess_a4_digital_origin` - A4 DigitalOrigin < 500ms
|
||||
- `benchmark_preprocess_a4_jbig2` - A4 JBIG2 < 200ms (faster, skips steps)
|
||||
- `benchmark_individual_steps` - Per-step performance breakdown
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
### PASS
|
||||
- ✅ All 5.3 critical tests implemented:
|
||||
- 2-deg skew deskewed within 0.1° (`test_preprocess_skewed_2deg_deskews`)
|
||||
- Uneven-lighting binarized (`test_preprocess_uneven_lighting_binarizes`)
|
||||
- JBIG2 untouched except padding (`test_preprocess_jbig2_only_pads`)
|
||||
- ✅ Padding adds exactly 10px on each side (`test_preprocess_border_padding_pixel_perfect`)
|
||||
- ✅ `preprocess()` is deterministic (`test_preprocess_deterministic`)
|
||||
- ✅ A4-page benchmark implemented (< 500ms target)
|
||||
|
||||
### WARN
|
||||
- ⚠️ Tests cannot run in current environment (missing leptonica system dependencies)
|
||||
- The `ocr` feature requires `pkg-config` and `leptonica` library
|
||||
- This is a NixOS system without the dependencies in PATH
|
||||
- Tests will run in CI where dependencies are properly configured
|
||||
- Code review confirms implementation is correct
|
||||
|
||||
## Critical Considerations Addressed
|
||||
|
||||
- Padding adds 20px to width and height (10px on each side)
|
||||
- Downstream Tesseract DPI math should NOT compensate (noted in plan)
|
||||
- Fixture files are small PNGs (max 4KB) to minimize repo bloat
|
||||
- `preprocess()` failure modes documented via `Result` type
|
||||
- A4 benchmark implemented with < 500ms target
|
||||
|
||||
## Commits
|
||||
- `d1dc228` - Initial implementation of border padding, pipeline orchestration, and fixtures
|
||||
- `eff4b60` - Removed duplicate import in preprocess module
|
||||
|
||||
## Files Modified
|
||||
- `crates/pdftract-core/src/preprocess.rs` - Added ImageSource enum, add_border_padding(), normalize_contrast(), binarize_otsu(), binarize_sauvola(), denoise_median(), preprocess(), tests, benchmarks
|
||||
|
||||
1. **crates/pdftract-core/src/preprocess.rs**
|
||||
- Added `add_border_padding()` function
|
||||
- Added `preprocess()` pipeline orchestrator
|
||||
- Added integration tests with fixtures
|
||||
- Added A4-page benchmarks
|
||||
|
||||
2. **crates/pdftract-core/src/lib.rs**
|
||||
- Added re-exports for preprocessing functions (already done in previous work)
|
||||
|
||||
3. **crates/pdftract-cli/Cargo.toml**
|
||||
- Added `image = "0.24"` dependency (for fixture generator)
|
||||
- Added `[[bin]]` entry for `generate_preprocess_fixtures`
|
||||
|
||||
4. **tests/fixtures/preprocess/generate_fixtures_main.rs** (new)
|
||||
- Fixture generator binary
|
||||
|
||||
5. **tests/fixtures/preprocess/** (new directories with source.png)
|
||||
|
||||
## Infrastructure Limitations
|
||||
|
||||
**WARN**: The leptonica native library is not installed in this environment (missing `pkg-config` and `leptonica-dev`). This prevents:
|
||||
|
||||
- Running the integration tests (require `cargo test --features ocr`)
|
||||
- Running the benchmarks
|
||||
- Verifying the < 500ms target on CI hardware
|
||||
|
||||
**Impact**: The implementation is complete and compiles correctly in environments with leptonica installed (CI, production). The tests will pass once the native dependency is available.
|
||||
|
||||
## Additional Fix (2026-05-23)
|
||||
|
||||
Fixed duplicate import in `preprocess.rs`:
|
||||
- Changed `use image::{GrayImage, ImageBuffer, Luma, Luma};` to `use image::{GrayImage, ImageBuffer, Luma};`
|
||||
- This was a minor cleanup fix for code quality
|
||||
|
||||
## Acceptance Criteria Status
|
||||
|
||||
- **PASS**: Border padding adds exactly 10px on each side (verified in code)
|
||||
- **PASS**: Pipeline orchestrator `preprocess()` exists with correct step order
|
||||
- **PASS**: Fixtures created for all three image-source paths (PhysicalScan, DigitalOrigin, Jbig2)
|
||||
- **PASS**: Integration tests written for all critical test scenarios
|
||||
- **PASS**: Benchmark written for A4-page performance (< 500ms target)
|
||||
- **WARN**: Tests cannot run without leptonica native library (environment limitation)
|
||||
- **WARN**: Benchmark cannot run without leptonica native library (environment limitation)
|
||||
- **PASS**: Fixed duplicate import for cleaner code
|
||||
|
||||
## References
|
||||
|
||||
- Plan section: Phase 5.3 step 5 (line 1878) + critical tests (lines 1882-1885)
|
||||
- Bead ID: pdftract-27n3
|
||||
## Files Added
|
||||
- `tests/fixtures/preprocess/skewed_2deg/source.png`
|
||||
- `tests/fixtures/preprocess/uneven_lighting/source.png`
|
||||
- `tests/fixtures/preprocess/clean_digital/source.png`
|
||||
- `tests/fixtures/preprocess/jbig2_scan/source.png`
|
||||
- `notes/pdftract-27n3.md` (this file)
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue