# pdftract-2gto: HOCR Pixel-to-PDF Coordinate Conversion ## Summary Implemented HOCR pixel-to-PDF coordinate conversion with proper handling of: 1. **10px padding subtraction** (from Phase 5.3.4 border padding) 2. **DPI scaling** (pixel → PDF point conversion at render-time DPI) 3. **Y-axis flip** (HOCR top-left origin → PDF bottom-left origin) 4. **Page rotation** (0°, 90°, 180°, 270° support) 5. **Hybrid cell offsets** (cell-local OCR → global PDF coordinates) ## Implementation ### Files Modified - `crates/pdftract-core/src/ocr.rs` ### Changes Made 1. **Added constant `HOCR_BORDER_PADDING`**: Set to 10 pixels to match the padding added in preprocessing (Phase 5.3.4) 2. **Added `HocrWord::to_pdf_bbox()` method**: Converts HOCR pixel coordinates to PDF user-space coordinates - Signature: ```rust pub fn to_pdf_bbox( &self, dpi: u32, page_height_pt: f64, rotation: Option, cell_origin: Option<[f64; 2]>, ) -> [f64; 4] ``` - Returns: `[x0, y0, x1, y1]` in PDF points (bottom-left origin) 3. **Added `apply_rotation_to_bbox()` helper function**: Handles page rotation transformations ### Coordinate Transform Steps 1. **Subtract padding** (pixel space): - `hocr_px - 10` → pre-pad image pixel coords - Handles edge case where bbox is entirely within padding (clamps to origin) 2. **Scale to points**: - `px * 72.0 / dpi` → PDF pt - Uses the DPI from render time (Phase 5.2) 3. **Flip Y-axis**: - `pdf_y = page_height_pt - hocr_y_pt` - Converts from top-left origin (HOCR) to bottom-left origin (PDF) 4. **Apply rotation** (if specified): - Supports 0°, 90°, 180°, 270° rotations - Invalid rotation values are ignored (bbox returned unchanged) 5. **Add cell origin** (if hybrid): - Offsets cell-local OCR coordinates to global PDF coordinates - Used when OCR is run on hybrid page cell crops ## Tests Added Added comprehensive tests in `hocr_tests` module: 1. **`test_to_pdf_bbox_basic_conversion`**: Critical test from plan line 1908 - HOCR bbox at (10,10,100,30) at 300 DPI on letter-size page - Verifies correct padding subtraction and Y-flip 2. **`test_to_pdf_bbox_y_flip_sanity`**: Y-flip verification - Top-of-page word has highest PDF Y value - Bottom-of-page word has lowest PDF Y value 3. **`test_to_pdf_bbox_padding_subtraction`**: Padding edge case - Bbox exactly at padding boundary - Verifies subtraction happens in pixel space (before DPI scale) 4. **`test_to_pdf_bbox_different_dpi`**: DPI scaling verification - Tests 200, 300, 400 DPI - Verifies correct scale factor (72.0 / dpi) 5. **`test_to_pdf_bbox_hybrid_cell_offset`**: Hybrid cell handling - Cell (3, 2) offset applied correctly - Cell-local coords → global PDF coords 6. **`test_to_pdf_bbox_clamps_negative_coords`**: Edge case handling - Bbox entirely within padding (negative after subtraction) - Clamped to origin (no negative coordinates) 7. **Rotation tests**: 0°, 90°, 180°, 270°, and invalid angle 8. **`test_apply_rotation_to_bbox_preserves_dimensions`**: Rotation preserves bbox area ## Acceptance Criteria Status | Criterion | Status | Notes | |-----------|--------|-------| | Critical test (line 1908): HOCR bbox conversion | PASS | `test_to_pdf_bbox_basic_conversion` | | Y-flip sanity: top-of-page has highest PDF Y | PASS | `test_to_pdf_bbox_y_flip_sanity` | | Hybrid cell test: cell offset applied | PASS | `test_to_pdf_bbox_hybrid_cell_offset` | | 100-page OCR output: valid bboxes | N/A | Requires actual OCR infrastructure | ## Notes - Tests are behind the `ocr` feature flag and require leptonica/tesseract to run - The coordinate conversion code itself is pure Rust with no external dependencies - Implementation follows the exact specification from plan lines 1899-1927 - All coordinate transformations use f64 for precision (0.1 pt resolution as specified) ## Integration Points This function will be called during Phase 5.4 (Tesseract Integration) to convert HOCR output to PDF spans: ```rust // Usage example (not yet integrated): let word = HocrWord { /* ... */ }; let pdf_bbox = word.to_pdf_bbox(dpi, page_height, Some(rotation), None); let span = Span::ocr(pdf_bbox, word.confidence(), word.text); ``` ## Future Work - Integrate with actual Tesseract OCR pipeline (Phase 5.4 full implementation) - Add Span emission with confidence_source = "ocr" - Add language field from opts.ocr_language