jedarden
|
2018d684ce
|
feat(pdftract-22p): implement signal evaluators for page classification
Implement five signal evaluators that feed PageClassifier::classify:
- text_operator_presence: 0 text ops + has images -> Scanned 0.95
- all_tr3_with_full_page_image: all Tr=3 + image >= 95% -> BrokenVector 0.99 (EC-12)
- image_coverage_fraction > 0.85 -> Scanned 0.85
- char_validity_rate < 0.4 -> BrokenVector 0.80
- char_validity_rate > 0.85 -> Vector 0.90
- char_density_ratio < 0.03 chars/in^2 -> Scanned 0.65
All thresholds centralized in SignalsConfig struct.
PageContext includes all required fields for evaluation.
Short-circuit classification at strength >= 0.95.
Comprehensive unit tests for each evaluator.
Closes: pdftract-22p
|
2026-05-31 23:56:17 -04:00 |
|