pdftract/notes/pdftract-2hag2.md

3.2 KiB

Verification Note: pdftract-2hag2

Bead: Signal evaluator: all_tr3_with_full_page_image (Tr=3 + image >= 95% page -> BrokenVector definitive)

Summary

The all_tr3_with_full_page_image signal evaluator was implemented in commit 39ca6a3 as part of bead pdftract-2b7ff (image_coverage_fraction). The implementation is correct and all acceptance criteria are met.

Implementation Location

  • File: crates/pdftract-core/src/classify.rs
  • Function: pub fn all_tr3_with_full_page_image(ctx: &PageContext) -> Option<Vote> (lines 339-357)
  • Tests: Lines 2459-2695

Acceptance Criteria Verification

1. text_op_count=10, tr3_op_count=10, full_page_image=true → Some(Vote { 0.99, BrokenVector })

  • Test: test_all_tr3_with_full_page_image_exact_match (line 2462)
  • Status: PASS

2. text_op_count=10, tr3_op_count=5 → None (mix of Tr=3 and visible)

  • Test: test_all_tr3_with_full_page_image_mixed_tr3 (line 2516)
  • Status: PASS

3. text_op_count=0 → None (no text)

  • Test: test_all_tr3_with_full_page_image_no_text (line 2532)
  • Status: PASS

4. full_page_image=false → None

  • Test: test_all_tr3_with_full_page_image_no_full_page_image (line 2548)
  • Status: PASS

5. Unit tests

  • Total tests: 10 tests covering all edge cases
  • Additional tests:
    • Exactly 95% coverage (threshold edge case)
    • Just below 95% threshold
    • Multiple images (one large enough)
    • Zero page area (division by zero guard)
    • Empty image areas
    • Definitive short-circuit verification
    • Standard US Letter and A4 page sizes
  • Status: PASS

Implementation Review

Signature

pub fn all_tr3_with_full_page_image(ctx: &PageContext) -> Option<Vote>

Status: ✓ Correct

Logic

let all_tr3 = ctx.text_op_count > 0 && ctx.tr3_op_count == ctx.text_op_count;
let page_area = ctx.width * ctx.height;
let full_page_image = if page_area > 0.0 {
    ctx.image_xobject_areas.iter().any(|&area| area / page_area >= 0.95)
} else {
    false
};
if all_tr3 && full_page_image {
    return Some(Vote::broken_vector(0.99));
}
None

Status: ✓ Correct

Key Features

  • All text operators must be Tr=3 (not just some) - enforced by tr3_op_count == text_op_count
  • Single image XObject covering >= 95% of page area - uses iter().any() to check if ANY image meets threshold
  • Definitive strength 0.99 for short-circuit behavior in PageClassifier
  • Division by zero guard when page_area <= 0.0
  • Returns None for all non-matching cases

Integration

The signal is integrated into the PageClassifier via InvisibleTextWithImageSignal (lines 195-207):

struct InvisibleTextWithImageSignal;

impl SignalEvaluator for InvisibleTextWithImageSignal {
    fn evaluate(&self, ctx: &PageContext) -> Option<Vote> {
        all_tr3_with_full_page_image(ctx)
    }

    fn name(&self) -> &'static str {
        "all_tr3_with_full_page_image"
    }
}

References

  • EC-12: Tr=3 + full-page-image is a deterministic BrokenVector signal
  • Plan section: Phase 5.1.2
  • Commit: 39ca6a3 (feat(pdftract-2b7ff): implement image_coverage_fraction signal evaluator)

Conclusion

PASS: All acceptance criteria are met. The implementation is correct, complete, and fully tested.