Implement coordinate transform from HOCR pixel space to PDF user-space points, accounting for the 10px white border added in preprocessing (Phase 5.3.4) and the DPI used at render time (Phase 5.2). Changes: - Add HOCR_BORDER_PADDING constant (10px) to match preprocessing padding - Add HocrWord::to_pdf_bbox() method for coordinate conversion - Add apply_rotation_to_bbox() helper for page rotation handling Coordinate transform steps: 1. Subtract padding (pixel space): hocr_px - 10 2. Scale to points: px * 72.0 / dpi 3. Flip Y-axis: pdf_y = page_height_pt - hocr_y_pt 4. Apply rotation (if specified): 0°, 90°, 180°, 270° 5. Add cell origin (if hybrid): offset by cell's PDF origin Tests added: - test_to_pdf_bbox_basic_conversion: Critical test from plan line 1908 - test_to_pdf_bbox_y_flip_sanity: Top-of-page word has highest PDF Y - test_to_pdf_bbox_padding_subtraction: Padding edge case - test_to_pdf_bbox_different_dpi: 200/300/400 DPI verification - test_to_pdf_bbox_hybrid_cell_offset: Cell-local to global coords - test_to_pdf_bbox_clamps_negative_coords: Bbox within padding - Rotation tests: 0°, 90°, 180°, 270°, and invalid angles Acceptance criteria: ✓ Critical test (line 1908): HOCR bbox at (10,10,100,30) at 300 DPI ✓ Y-flip sanity: top-of-page has highest PDF Y ✓ Hybrid cell test: cell offset applied correctly ○ 100-page OCR output: requires OCR infrastructure (deferred) Refs: pdftract-2gto, plan lines 1899-1927 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
121 lines
4.4 KiB
Markdown
121 lines
4.4 KiB
Markdown
# pdftract-2gto: HOCR Pixel-to-PDF Coordinate Conversion
|
|
|
|
## Summary
|
|
|
|
Implemented HOCR pixel-to-PDF coordinate conversion with proper handling of:
|
|
1. **10px padding subtraction** (from Phase 5.3.4 border padding)
|
|
2. **DPI scaling** (pixel → PDF point conversion at render-time DPI)
|
|
3. **Y-axis flip** (HOCR top-left origin → PDF bottom-left origin)
|
|
4. **Page rotation** (0°, 90°, 180°, 270° support)
|
|
5. **Hybrid cell offsets** (cell-local OCR → global PDF coordinates)
|
|
|
|
## Implementation
|
|
|
|
### Files Modified
|
|
- `crates/pdftract-core/src/ocr.rs`
|
|
|
|
### Changes Made
|
|
|
|
1. **Added constant `HOCR_BORDER_PADDING`**: Set to 10 pixels to match the padding added in preprocessing (Phase 5.3.4)
|
|
|
|
2. **Added `HocrWord::to_pdf_bbox()` method**: Converts HOCR pixel coordinates to PDF user-space coordinates
|
|
- Signature:
|
|
```rust
|
|
pub fn to_pdf_bbox(
|
|
&self,
|
|
dpi: u32,
|
|
page_height_pt: f64,
|
|
rotation: Option<i32>,
|
|
cell_origin: Option<[f64; 2]>,
|
|
) -> [f64; 4]
|
|
```
|
|
- Returns: `[x0, y0, x1, y1]` in PDF points (bottom-left origin)
|
|
|
|
3. **Added `apply_rotation_to_bbox()` helper function**: Handles page rotation transformations
|
|
|
|
### Coordinate Transform Steps
|
|
|
|
1. **Subtract padding** (pixel space):
|
|
- `hocr_px - 10` → pre-pad image pixel coords
|
|
- Handles edge case where bbox is entirely within padding (clamps to origin)
|
|
|
|
2. **Scale to points**:
|
|
- `px * 72.0 / dpi` → PDF pt
|
|
- Uses the DPI from render time (Phase 5.2)
|
|
|
|
3. **Flip Y-axis**:
|
|
- `pdf_y = page_height_pt - hocr_y_pt`
|
|
- Converts from top-left origin (HOCR) to bottom-left origin (PDF)
|
|
|
|
4. **Apply rotation** (if specified):
|
|
- Supports 0°, 90°, 180°, 270° rotations
|
|
- Invalid rotation values are ignored (bbox returned unchanged)
|
|
|
|
5. **Add cell origin** (if hybrid):
|
|
- Offsets cell-local OCR coordinates to global PDF coordinates
|
|
- Used when OCR is run on hybrid page cell crops
|
|
|
|
## Tests Added
|
|
|
|
Added comprehensive tests in `hocr_tests` module:
|
|
|
|
1. **`test_to_pdf_bbox_basic_conversion`**: Critical test from plan line 1908
|
|
- HOCR bbox at (10,10,100,30) at 300 DPI on letter-size page
|
|
- Verifies correct padding subtraction and Y-flip
|
|
|
|
2. **`test_to_pdf_bbox_y_flip_sanity`**: Y-flip verification
|
|
- Top-of-page word has highest PDF Y value
|
|
- Bottom-of-page word has lowest PDF Y value
|
|
|
|
3. **`test_to_pdf_bbox_padding_subtraction`**: Padding edge case
|
|
- Bbox exactly at padding boundary
|
|
- Verifies subtraction happens in pixel space (before DPI scale)
|
|
|
|
4. **`test_to_pdf_bbox_different_dpi`**: DPI scaling verification
|
|
- Tests 200, 300, 400 DPI
|
|
- Verifies correct scale factor (72.0 / dpi)
|
|
|
|
5. **`test_to_pdf_bbox_hybrid_cell_offset`**: Hybrid cell handling
|
|
- Cell (3, 2) offset applied correctly
|
|
- Cell-local coords → global PDF coords
|
|
|
|
6. **`test_to_pdf_bbox_clamps_negative_coords`**: Edge case handling
|
|
- Bbox entirely within padding (negative after subtraction)
|
|
- Clamped to origin (no negative coordinates)
|
|
|
|
7. **Rotation tests**: 0°, 90°, 180°, 270°, and invalid angle
|
|
|
|
8. **`test_apply_rotation_to_bbox_preserves_dimensions`**: Rotation preserves bbox area
|
|
|
|
## Acceptance Criteria Status
|
|
|
|
| Criterion | Status | Notes |
|
|
|-----------|--------|-------|
|
|
| Critical test (line 1908): HOCR bbox conversion | PASS | `test_to_pdf_bbox_basic_conversion` |
|
|
| Y-flip sanity: top-of-page has highest PDF Y | PASS | `test_to_pdf_bbox_y_flip_sanity` |
|
|
| Hybrid cell test: cell offset applied | PASS | `test_to_pdf_bbox_hybrid_cell_offset` |
|
|
| 100-page OCR output: valid bboxes | N/A | Requires actual OCR infrastructure |
|
|
|
|
## Notes
|
|
|
|
- Tests are behind the `ocr` feature flag and require leptonica/tesseract to run
|
|
- The coordinate conversion code itself is pure Rust with no external dependencies
|
|
- Implementation follows the exact specification from plan lines 1899-1927
|
|
- All coordinate transformations use f64 for precision (0.1 pt resolution as specified)
|
|
|
|
## Integration Points
|
|
|
|
This function will be called during Phase 5.4 (Tesseract Integration) to convert HOCR output to PDF spans:
|
|
|
|
```rust
|
|
// Usage example (not yet integrated):
|
|
let word = HocrWord { /* ... */ };
|
|
let pdf_bbox = word.to_pdf_bbox(dpi, page_height, Some(rotation), None);
|
|
let span = Span::ocr(pdf_bbox, word.confidence(), word.text);
|
|
```
|
|
|
|
## Future Work
|
|
|
|
- Integrate with actual Tesseract OCR pipeline (Phase 5.4 full implementation)
|
|
- Add Span emission with confidence_source = "ocr"
|
|
- Add language field from opts.ocr_language
|