jedarden
|
e3a149fbf8
|
feat(pdftract-sg6): implement DPI selection logic for OCR rendering
Implement Phase 5.2.3 DPI selection that picks per-page DPI based on
image filter signals (JBIG2 detection) and font size signals from Phase 4.
- Add select_dpi() function implementing the DPI selection table:
* JBIG2Decode filter present -> 200 DPI (already binary)
* Median font_size < 7.0 pt -> 400 DPI (fine print)
* Median font_size >= 7.0 pt -> 300 DPI (standard)
* Default -> 300 DPI for scanned pages
- Add Pdf1Filter enum for PDF 1.x filter name parsing
- Add FontSizeSpan struct for Phase 4 font size data
- Add ocr_dpi_override option to ExtractionOptions
- Export ExtractionQuality from schema module for DPI tracking
- Add comprehensive unit tests (19 tests, all passing)
Acceptance criteria:
- Unit tests: each branch tested with synthetic inputs
- Integration: legal-document -> 400 DPI, textbook -> 300 DPI, JBIG2 -> 200 DPI
- DPI override option works correctly
- extraction_quality.dpi_used schema field ready
Co-Authored-By: Claude Code <claude-code@anthropic.com>
|
2026-05-23 17:37:40 -04:00 |
|