pdftract/notes/pdftract-33g.md
jedarden 6ff825a23f docs(pdftract-33g): update verification note with micro-benchmark PASS
Update notes/pdftract-33g.md to reflect:
- Micro-benchmark test now PASS (p99 < 5 ms)
- Test count updated from 53 to 54
- Future work section updated (benchmark item removed)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 14:16:19 -04:00

4.9 KiB
Raw Permalink Blame History

pdftract-33g: PageClassifier Engine Implementation

Summary

Implemented the PageClassifier engine (Phase 5.1.4) that wires signal evaluators + Hybrid evaluator together, applies the short-circuit rule, resolves conflicting signals into a final PageClass and confidence, and exports the classify_page() entry point.

Changes Made

File: crates/pdftract-core/src/classify.rs

  1. Added PageContext struct - Contains all metrics needed for classification:

    • Text operators count, invisible text count
    • Character counts (raw, valid, replacement)
    • Image coverage, full-page image flag
    • Density ratio, page dimensions, rotation
    • Optional grid cells for hybrid detection
  2. Implemented Signal Evaluator System:

    • SignalEvaluator trait with evaluate() and name() methods
    • NoTextOperatorsSignal → Scanned (strength 0.95)
    • InvisibleTextWithImageSignal → BrokenVector (strength 0.97)
    • HighImageCoverageSignal → Scanned (strength 0.90)
    • LowCharValiditySignal → BrokenVector (strength 0.92)
    • HighCharValiditySignal → Vector (strength 0.93)
    • LowDensitySignal → Scanned (strength 0.95)
  3. Implemented PageClassifier with pipeline:

    • Special case handling (blank pages)
    • Hybrid evaluator runs first (if grid data available)
    • Signal evaluators walk in declared order
    • Short-circuit at strength >= 0.95 (returns immediately)
    • Vote tallying weighted by strength for remaining signals
    • Default to Vector with 0.5 confidence if no votes
  4. Implemented classify_page() entry point - Public function that creates a PageClassifier and delegates to classify().

  5. Signal ordering (critical for correctness):

    • NoTextOperatorsSignal (position 1)
    • InvisibleTextWithImageSignal (position 2)
    • HighImageCoverageSignal (position 3)
    • LowCharValiditySignal (position 4)
    • LowDensitySignal (position 5) - before HighCharValiditySignal to prevent conflicts
    • HighCharValiditySignal (position 6)
  6. Key design decisions:

    • Short-circuit threshold changed from > 0.95 to >= 0.95 for consistency
    • LowDensitySignal strength increased from 0.75 to 0.95 to enable short-circuit
    • LowDensitySignal positioned before HighCharValiditySignal to prevent valid-but-sparse pages from being misclassified as Vector

Acceptance Criteria Status

1. All four critical-test fixtures classified correctly

Test Class Confidence Status
test_page_classifier_vector_pure_text Vector > 0.90 PASS
test_page_classifier_scanned_image_only Scanned > 0.90 PASS
test_page_classifier_broken_vector BrokenVector > 0.95 PASS
test_page_classifier_hybrid_with_grid Hybrid correct cell count PASS

2. Edge cases handled

Test Scenario Result Status
test_page_classifier_blank_page No text, no images Vector with 0.0 confidence (sentinel) PASS
test_page_classifier_image_only_figure Images, no text Scanned (maps to figure_only) PASS

3. Determinism

  • test_determinism_classify_twice - Verifies identical results across runs
  • test_determinism_btree_set - Verifies BTreeSet produces deterministic iteration order
  • Signal evaluators stored in Vec (not HashMap) for deterministic order

4. Micro-benchmark (p99 < 5 ms)

  • test_microbenchmark_classify_page_performance - Verifies p99 < 5 ms across 200 iterations (4 fixture types × 50)
  • p99 result: < 1 ms (well below 5 ms threshold)
  • Median result: < 100 μs
  • Tests use synthetic PageContext fixtures representing Vector, Scanned, BrokenVector, and Hybrid pages

Public API

All key types are pub and accessible via pdftract_core::classify:::

  • classify_page(&PageContext) -> PageClassification - Main entry point
  • PageContext - Input struct with all classification metrics
  • PageClassification - Output struct with class, confidence, hybrid_cells
  • PageClass - Enum: Vector, Scanned, Hybrid, BrokenVector
  • GridClassifier - For grid-based hybrid detection
  • CellIndex, CellData, CellClass - Grid cell types

Tests

All 54 classify module tests pass:

  • Cell classification tests (3)
  • Grid classifier tests (9)
  • Page classifier tests (30) ← +1 micro-benchmark
  • Page context tests (5)
  • Critical tests (4)
  • Determinism tests (2)
  • Other utility tests (1)

Notes

  • The classifier is side-effect-free: no logging or panics
  • Failures would propagate via Result if input is malformed (currently infallible)
  • The blank pseudo-class is represented as Vector with 0.0 confidence (mapping layer converts to "blank" page_type)
  • The figure_only page_type is achieved via Scanned classification + mapping layer logic

Future Work

  • Consider adding debug/diagnostics mode to show which signals fired
  • Verify against real fixture corpus (tests/fixtures/classifier/)