pdftract/notes/pdftract-2gto.md
jedarden 3b91b340aa feat(pdftract-2gto): implement HOCR pixel-to-PDF coordinate conversion
Implement coordinate transform from HOCR pixel space to PDF user-space
points, accounting for the 10px white border added in preprocessing
(Phase 5.3.4) and the DPI used at render time (Phase 5.2).

Changes:
- Add HOCR_BORDER_PADDING constant (10px) to match preprocessing padding
- Add HocrWord::to_pdf_bbox() method for coordinate conversion
- Add apply_rotation_to_bbox() helper for page rotation handling

Coordinate transform steps:
1. Subtract padding (pixel space): hocr_px - 10
2. Scale to points: px * 72.0 / dpi
3. Flip Y-axis: pdf_y = page_height_pt - hocr_y_pt
4. Apply rotation (if specified): 0°, 90°, 180°, 270°
5. Add cell origin (if hybrid): offset by cell's PDF origin

Tests added:
- test_to_pdf_bbox_basic_conversion: Critical test from plan line 1908
- test_to_pdf_bbox_y_flip_sanity: Top-of-page word has highest PDF Y
- test_to_pdf_bbox_padding_subtraction: Padding edge case
- test_to_pdf_bbox_different_dpi: 200/300/400 DPI verification
- test_to_pdf_bbox_hybrid_cell_offset: Cell-local to global coords
- test_to_pdf_bbox_clamps_negative_coords: Bbox within padding
- Rotation tests: 0°, 90°, 180°, 270°, and invalid angles

Acceptance criteria:
✓ Critical test (line 1908): HOCR bbox at (10,10,100,30) at 300 DPI
✓ Y-flip sanity: top-of-page has highest PDF Y
✓ Hybrid cell test: cell offset applied correctly
○ 100-page OCR output: requires OCR infrastructure (deferred)

Refs: pdftract-2gto, plan lines 1899-1927

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:56:41 -04:00

121 lines
4.4 KiB
Markdown

# pdftract-2gto: HOCR Pixel-to-PDF Coordinate Conversion
## Summary
Implemented HOCR pixel-to-PDF coordinate conversion with proper handling of:
1. **10px padding subtraction** (from Phase 5.3.4 border padding)
2. **DPI scaling** (pixel → PDF point conversion at render-time DPI)
3. **Y-axis flip** (HOCR top-left origin → PDF bottom-left origin)
4. **Page rotation** (0°, 90°, 180°, 270° support)
5. **Hybrid cell offsets** (cell-local OCR → global PDF coordinates)
## Implementation
### Files Modified
- `crates/pdftract-core/src/ocr.rs`
### Changes Made
1. **Added constant `HOCR_BORDER_PADDING`**: Set to 10 pixels to match the padding added in preprocessing (Phase 5.3.4)
2. **Added `HocrWord::to_pdf_bbox()` method**: Converts HOCR pixel coordinates to PDF user-space coordinates
- Signature:
```rust
pub fn to_pdf_bbox(
&self,
dpi: u32,
page_height_pt: f64,
rotation: Option<i32>,
cell_origin: Option<[f64; 2]>,
) -> [f64; 4]
```
- Returns: `[x0, y0, x1, y1]` in PDF points (bottom-left origin)
3. **Added `apply_rotation_to_bbox()` helper function**: Handles page rotation transformations
### Coordinate Transform Steps
1. **Subtract padding** (pixel space):
- `hocr_px - 10` → pre-pad image pixel coords
- Handles edge case where bbox is entirely within padding (clamps to origin)
2. **Scale to points**:
- `px * 72.0 / dpi` → PDF pt
- Uses the DPI from render time (Phase 5.2)
3. **Flip Y-axis**:
- `pdf_y = page_height_pt - hocr_y_pt`
- Converts from top-left origin (HOCR) to bottom-left origin (PDF)
4. **Apply rotation** (if specified):
- Supports 0°, 90°, 180°, 270° rotations
- Invalid rotation values are ignored (bbox returned unchanged)
5. **Add cell origin** (if hybrid):
- Offsets cell-local OCR coordinates to global PDF coordinates
- Used when OCR is run on hybrid page cell crops
## Tests Added
Added comprehensive tests in `hocr_tests` module:
1. **`test_to_pdf_bbox_basic_conversion`**: Critical test from plan line 1908
- HOCR bbox at (10,10,100,30) at 300 DPI on letter-size page
- Verifies correct padding subtraction and Y-flip
2. **`test_to_pdf_bbox_y_flip_sanity`**: Y-flip verification
- Top-of-page word has highest PDF Y value
- Bottom-of-page word has lowest PDF Y value
3. **`test_to_pdf_bbox_padding_subtraction`**: Padding edge case
- Bbox exactly at padding boundary
- Verifies subtraction happens in pixel space (before DPI scale)
4. **`test_to_pdf_bbox_different_dpi`**: DPI scaling verification
- Tests 200, 300, 400 DPI
- Verifies correct scale factor (72.0 / dpi)
5. **`test_to_pdf_bbox_hybrid_cell_offset`**: Hybrid cell handling
- Cell (3, 2) offset applied correctly
- Cell-local coords → global PDF coords
6. **`test_to_pdf_bbox_clamps_negative_coords`**: Edge case handling
- Bbox entirely within padding (negative after subtraction)
- Clamped to origin (no negative coordinates)
7. **Rotation tests**: 0°, 90°, 180°, 270°, and invalid angle
8. **`test_apply_rotation_to_bbox_preserves_dimensions`**: Rotation preserves bbox area
## Acceptance Criteria Status
| Criterion | Status | Notes |
|-----------|--------|-------|
| Critical test (line 1908): HOCR bbox conversion | PASS | `test_to_pdf_bbox_basic_conversion` |
| Y-flip sanity: top-of-page has highest PDF Y | PASS | `test_to_pdf_bbox_y_flip_sanity` |
| Hybrid cell test: cell offset applied | PASS | `test_to_pdf_bbox_hybrid_cell_offset` |
| 100-page OCR output: valid bboxes | N/A | Requires actual OCR infrastructure |
## Notes
- Tests are behind the `ocr` feature flag and require leptonica/tesseract to run
- The coordinate conversion code itself is pure Rust with no external dependencies
- Implementation follows the exact specification from plan lines 1899-1927
- All coordinate transformations use f64 for precision (0.1 pt resolution as specified)
## Integration Points
This function will be called during Phase 5.4 (Tesseract Integration) to convert HOCR output to PDF spans:
```rust
// Usage example (not yet integrated):
let word = HocrWord { /* ... */ };
let pdf_bbox = word.to_pdf_bbox(dpi, page_height, Some(rotation), None);
let span = Span::ocr(pdf_bbox, word.confidence(), word.text);
```
## Future Work
- Integrate with actual Tesseract OCR pipeline (Phase 5.4 full implementation)
- Add Span emission with confidence_source = "ocr"
- Add language field from opts.ocr_language