Implement coordinate transform from HOCR pixel space to PDF user-space points, accounting for the 10px white border added in preprocessing (Phase 5.3.4) and the DPI used at render time (Phase 5.2). Changes: - Add HOCR_BORDER_PADDING constant (10px) to match preprocessing padding - Add HocrWord::to_pdf_bbox() method for coordinate conversion - Add apply_rotation_to_bbox() helper for page rotation handling Coordinate transform steps: 1. Subtract padding (pixel space): hocr_px - 10 2. Scale to points: px * 72.0 / dpi 3. Flip Y-axis: pdf_y = page_height_pt - hocr_y_pt 4. Apply rotation (if specified): 0°, 90°, 180°, 270° 5. Add cell origin (if hybrid): offset by cell's PDF origin Tests added: - test_to_pdf_bbox_basic_conversion: Critical test from plan line 1908 - test_to_pdf_bbox_y_flip_sanity: Top-of-page word has highest PDF Y - test_to_pdf_bbox_padding_subtraction: Padding edge case - test_to_pdf_bbox_different_dpi: 200/300/400 DPI verification - test_to_pdf_bbox_hybrid_cell_offset: Cell-local to global coords - test_to_pdf_bbox_clamps_negative_coords: Bbox within padding - Rotation tests: 0°, 90°, 180°, 270°, and invalid angles Acceptance criteria: ✓ Critical test (line 1908): HOCR bbox at (10,10,100,30) at 300 DPI ✓ Y-flip sanity: top-of-page has highest PDF Y ✓ Hybrid cell test: cell offset applied correctly ○ 100-page OCR output: requires OCR infrastructure (deferred) Refs: pdftract-2gto, plan lines 1899-1927 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.4 KiB
pdftract-2gto: HOCR Pixel-to-PDF Coordinate Conversion
Summary
Implemented HOCR pixel-to-PDF coordinate conversion with proper handling of:
- 10px padding subtraction (from Phase 5.3.4 border padding)
- DPI scaling (pixel → PDF point conversion at render-time DPI)
- Y-axis flip (HOCR top-left origin → PDF bottom-left origin)
- Page rotation (0°, 90°, 180°, 270° support)
- Hybrid cell offsets (cell-local OCR → global PDF coordinates)
Implementation
Files Modified
crates/pdftract-core/src/ocr.rs
Changes Made
-
Added constant
HOCR_BORDER_PADDING: Set to 10 pixels to match the padding added in preprocessing (Phase 5.3.4) -
Added
HocrWord::to_pdf_bbox()method: Converts HOCR pixel coordinates to PDF user-space coordinates- Signature:
pub fn to_pdf_bbox( &self, dpi: u32, page_height_pt: f64, rotation: Option<i32>, cell_origin: Option<[f64; 2]>, ) -> [f64; 4] - Returns:
[x0, y0, x1, y1]in PDF points (bottom-left origin)
- Signature:
-
Added
apply_rotation_to_bbox()helper function: Handles page rotation transformations
Coordinate Transform Steps
-
Subtract padding (pixel space):
hocr_px - 10→ pre-pad image pixel coords- Handles edge case where bbox is entirely within padding (clamps to origin)
-
Scale to points:
px * 72.0 / dpi→ PDF pt- Uses the DPI from render time (Phase 5.2)
-
Flip Y-axis:
pdf_y = page_height_pt - hocr_y_pt- Converts from top-left origin (HOCR) to bottom-left origin (PDF)
-
Apply rotation (if specified):
- Supports 0°, 90°, 180°, 270° rotations
- Invalid rotation values are ignored (bbox returned unchanged)
-
Add cell origin (if hybrid):
- Offsets cell-local OCR coordinates to global PDF coordinates
- Used when OCR is run on hybrid page cell crops
Tests Added
Added comprehensive tests in hocr_tests module:
-
test_to_pdf_bbox_basic_conversion: Critical test from plan line 1908- HOCR bbox at (10,10,100,30) at 300 DPI on letter-size page
- Verifies correct padding subtraction and Y-flip
-
test_to_pdf_bbox_y_flip_sanity: Y-flip verification- Top-of-page word has highest PDF Y value
- Bottom-of-page word has lowest PDF Y value
-
test_to_pdf_bbox_padding_subtraction: Padding edge case- Bbox exactly at padding boundary
- Verifies subtraction happens in pixel space (before DPI scale)
-
test_to_pdf_bbox_different_dpi: DPI scaling verification- Tests 200, 300, 400 DPI
- Verifies correct scale factor (72.0 / dpi)
-
test_to_pdf_bbox_hybrid_cell_offset: Hybrid cell handling- Cell (3, 2) offset applied correctly
- Cell-local coords → global PDF coords
-
test_to_pdf_bbox_clamps_negative_coords: Edge case handling- Bbox entirely within padding (negative after subtraction)
- Clamped to origin (no negative coordinates)
-
Rotation tests: 0°, 90°, 180°, 270°, and invalid angle
-
test_apply_rotation_to_bbox_preserves_dimensions: Rotation preserves bbox area
Acceptance Criteria Status
| Criterion | Status | Notes |
|---|---|---|
| Critical test (line 1908): HOCR bbox conversion | PASS | test_to_pdf_bbox_basic_conversion |
| Y-flip sanity: top-of-page has highest PDF Y | PASS | test_to_pdf_bbox_y_flip_sanity |
| Hybrid cell test: cell offset applied | PASS | test_to_pdf_bbox_hybrid_cell_offset |
| 100-page OCR output: valid bboxes | N/A | Requires actual OCR infrastructure |
Notes
- Tests are behind the
ocrfeature flag and require leptonica/tesseract to run - The coordinate conversion code itself is pure Rust with no external dependencies
- Implementation follows the exact specification from plan lines 1899-1927
- All coordinate transformations use f64 for precision (0.1 pt resolution as specified)
Integration Points
This function will be called during Phase 5.4 (Tesseract Integration) to convert HOCR output to PDF spans:
// Usage example (not yet integrated):
let word = HocrWord { /* ... */ };
let pdf_bbox = word.to_pdf_bbox(dpi, page_height, Some(rotation), None);
let span = Span::ocr(pdf_bbox, word.confidence(), word.text);
Future Work
- Integrate with actual Tesseract OCR pipeline (Phase 5.4 full implementation)
- Add Span emission with confidence_source = "ocr"
- Add language field from opts.ocr_language