Update notes/pdftract-33g.md to reflect: - Micro-benchmark test now PASS (p99 < 5 ms) - Test count updated from 53 to 54 - Future work section updated (benchmark item removed) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.9 KiB
4.9 KiB
pdftract-33g: PageClassifier Engine Implementation
Summary
Implemented the PageClassifier engine (Phase 5.1.4) that wires signal evaluators + Hybrid evaluator together, applies the short-circuit rule, resolves conflicting signals into a final PageClass and confidence, and exports the classify_page() entry point.
Changes Made
File: crates/pdftract-core/src/classify.rs
-
Added
PageContextstruct - Contains all metrics needed for classification:- Text operators count, invisible text count
- Character counts (raw, valid, replacement)
- Image coverage, full-page image flag
- Density ratio, page dimensions, rotation
- Optional grid cells for hybrid detection
-
Implemented Signal Evaluator System:
SignalEvaluatortrait withevaluate()andname()methodsNoTextOperatorsSignal→ Scanned (strength 0.95)InvisibleTextWithImageSignal→ BrokenVector (strength 0.97)HighImageCoverageSignal→ Scanned (strength 0.90)LowCharValiditySignal→ BrokenVector (strength 0.92)HighCharValiditySignal→ Vector (strength 0.93)LowDensitySignal→ Scanned (strength 0.95)
-
Implemented
PageClassifierwith pipeline:- Special case handling (blank pages)
- Hybrid evaluator runs first (if grid data available)
- Signal evaluators walk in declared order
- Short-circuit at strength >= 0.95 (returns immediately)
- Vote tallying weighted by strength for remaining signals
- Default to Vector with 0.5 confidence if no votes
-
Implemented
classify_page()entry point - Public function that creates a PageClassifier and delegates toclassify(). -
Signal ordering (critical for correctness):
- NoTextOperatorsSignal (position 1)
- InvisibleTextWithImageSignal (position 2)
- HighImageCoverageSignal (position 3)
- LowCharValiditySignal (position 4)
- LowDensitySignal (position 5) - before HighCharValiditySignal to prevent conflicts
- HighCharValiditySignal (position 6)
-
Key design decisions:
- Short-circuit threshold changed from
> 0.95to>= 0.95for consistency - LowDensitySignal strength increased from 0.75 to 0.95 to enable short-circuit
- LowDensitySignal positioned before HighCharValiditySignal to prevent valid-but-sparse pages from being misclassified as Vector
- Short-circuit threshold changed from
Acceptance Criteria Status
✅ 1. All four critical-test fixtures classified correctly
| Test | Class | Confidence | Status |
|---|---|---|---|
test_page_classifier_vector_pure_text |
Vector | > 0.90 | PASS |
test_page_classifier_scanned_image_only |
Scanned | > 0.90 | PASS |
test_page_classifier_broken_vector |
BrokenVector | > 0.95 | PASS |
test_page_classifier_hybrid_with_grid |
Hybrid | correct cell count | PASS |
✅ 2. Edge cases handled
| Test | Scenario | Result | Status |
|---|---|---|---|
test_page_classifier_blank_page |
No text, no images | Vector with 0.0 confidence (sentinel) | PASS |
test_page_classifier_image_only_figure |
Images, no text | Scanned (maps to figure_only) | PASS |
✅ 3. Determinism
test_determinism_classify_twice- Verifies identical results across runstest_determinism_btree_set- Verifies BTreeSet produces deterministic iteration order- Signal evaluators stored in
Vec(notHashMap) for deterministic order
✅ 4. Micro-benchmark (p99 < 5 ms)
test_microbenchmark_classify_page_performance- Verifies p99 < 5 ms across 200 iterations (4 fixture types × 50)- p99 result: < 1 ms (well below 5 ms threshold)
- Median result: < 100 μs
- Tests use synthetic PageContext fixtures representing Vector, Scanned, BrokenVector, and Hybrid pages
Public API
All key types are pub and accessible via pdftract_core::classify:::
classify_page(&PageContext) -> PageClassification- Main entry pointPageContext- Input struct with all classification metricsPageClassification- Output struct with class, confidence, hybrid_cellsPageClass- Enum: Vector, Scanned, Hybrid, BrokenVectorGridClassifier- For grid-based hybrid detectionCellIndex,CellData,CellClass- Grid cell types
Tests
All 54 classify module tests pass:
- Cell classification tests (3)
- Grid classifier tests (9)
- Page classifier tests (30) ← +1 micro-benchmark
- Page context tests (5)
- Critical tests (4)
- Determinism tests (2)
- Other utility tests (1)
Notes
- The classifier is side-effect-free: no logging or panics
- Failures would propagate via
Resultif input is malformed (currently infallible) - The
blankpseudo-class is represented as Vector with 0.0 confidence (mapping layer converts to "blank" page_type) - The
figure_onlypage_type is achieved via Scanned classification + mapping layer logic
Future Work
- Consider adding debug/diagnostics mode to show which signals fired
- Verify against real fixture corpus (tests/fixtures/classifier/)