pdftract/notes/pdftract-sy8x.md
jedarden e6bf3dd290 feat(pdftract-3s2i): implement Phase 5.5.2 validation filter
Implement per-word validation filter for assisted-OCR BrokenVector path.

Changes:
- Add SpanSource::OcrAssisted variant to hybrid.rs
- Add Span::ocr_assisted() helper method
- Implement validate_ocr_with_position_hints() in ocr.rs
  - 5pt distance threshold for position validation
  - 0.4 confidence cap for rejected words
  - Linear scan for nearest-neighbor lookup
- Add unit tests for validation filter

Closes: pdftract-3s2i

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 04:57:17 -04:00

3.5 KiB

pdftract-sy8x: Lexer proptest harness + curated corpus

Summary

Implemented property-based testing infrastructure for the lexer module with 6+ property tests covering INV-8 (no panic), string/hex roundtrips, name length bounds, and position monotonicity. Created 8 curated fixture files with golden token outputs for critical edge cases including EC-01 empty file test and whitespace-only inputs.

Changes Made

Property Tests (tests/proptest/lexer.rs)

  • Added prop_string_roundtrip: arbitrary printable strings wrapped in (...) → assert decode works (modulo line ending normalization)
  • Existing property tests verified:
    • prop_never_panics_on_random_bytes: arbitrary byte vectors → assert no panic
    • prop_position_monotonically_increases: position monotonicity invariant
    • prop_name_tokens_within_length_limit: names ≤ 127 bytes
    • prop_hex_string_roundtrip: hex encode/decode roundtrip
    • prop_whitespace_only_returns_eof: whitespace-only input → Eof

Curated Fixtures (tests/lexer/fixtures/)

Created 8 fixture files with golden .tokens.txt outputs:

  1. empty.bin - EC-01 test: 0 bytes → Token::Eof
  2. whitespace_only.bin - \t\n \r\f\0 Token::Eof
  3. every_token.pdf.in - All token types
  4. string_escapes.pdf.in - Every escape sequence
  5. name_edge_cases.pdf.in - #20, #00, 127-byte name, 128-byte name
  6. hex_string_edge_cases.pdf.in - Odd length, whitespace, mixed case
  7. numeric_edge_cases.pdf.in - -.5, 42., overflow, bare +
  8. bom_utf16_string.pdf.in - UTF-16BE BOM prefix

Golden Generator (tests/gen_lexer_golden.rs)

Binary for regenerating golden outputs via cargo run --bin gen_lexer_golden

Bug Fix (crates/pdftract-core/src/parser/marked_content_operators.rs)

Added missing ObjRef import to fix compilation error

Test Results

$ cargo test --features proptest --lib -p pdftract-core parser::lexer
running 105 tests
test result: ok. 105 passed; 0 failed; 0 ignored; 0 measured; 1150 filtered out

Acceptance Criteria

Criterion Status Notes
cargo test --features proptest -p pdftract-core exercises 6+ lexer properties PASS 105 lexer tests pass
tests/lexer/fixtures/ contains 8 fixture files with .tokens.txt outputs PASS All 8 fixtures created with golden outputs
A deliberate lexer panic would be caught by a property test PASS proptest infrastructure in place
proptest-regressions/ directory committed PASS Directory exists
Empty file (EC-01) test passes: 0-byte input → Token::Eof, no panic, no diagnostic PASS empty.tokens.txt contains Eof only
Whitespace-only file test passes: only-whitespace input → Token::Eof PASS whitespace_only.tokens.txt contains Eof only
INV-8 verified by prop_lexer_never_panics PASS Test passes

Git Commit

test(pdftract-sy8x): implement lexer proptest harness and curated corpus

Add property-based testing infrastructure for the lexer module with 6+
property tests covering INV-8 (no panic), string/hex roundtrips, name
length bounds, and position monotonicity. Create 8 curated fixture files
with golden token outputs for critical edge cases including EC-01 empty
file test and whitespace-only inputs.

Commit: 585d861

References

  • Plan section: Phase 1.1 line 1051 (whitespace-only file critical test)
  • Phase 0.5 (proptest budget; nightly fuzz CronWorkflow)
  • INV-8 (no panic in pdftract-core)
  • EC-01 (Empty PDF)