Commit graph

1 commit

Author SHA1 Message Date
jedarden
e3a149fbf8 feat(pdftract-sg6): implement DPI selection logic for OCR rendering
Implement Phase 5.2.3 DPI selection that picks per-page DPI based on
image filter signals (JBIG2 detection) and font size signals from Phase 4.

- Add select_dpi() function implementing the DPI selection table:
  * JBIG2Decode filter present -> 200 DPI (already binary)
  * Median font_size < 7.0 pt -> 400 DPI (fine print)
  * Median font_size >= 7.0 pt -> 300 DPI (standard)
  * Default -> 300 DPI for scanned pages
- Add Pdf1Filter enum for PDF 1.x filter name parsing
- Add FontSizeSpan struct for Phase 4 font size data
- Add ocr_dpi_override option to ExtractionOptions
- Export ExtractionQuality from schema module for DPI tracking
- Add comprehensive unit tests (19 tests, all passing)

Acceptance criteria:
- Unit tests: each branch tested with synthetic inputs
- Integration: legal-document -> 400 DPI, textbook -> 300 DPI, JBIG2 -> 200 DPI
- DPI override option works correctly
- extraction_quality.dpi_used schema field ready

Co-Authored-By: Claude Code <claude-code@anthropic.com>
2026-05-23 17:37:40 -04:00