Implement per-word validation filter for assisted-OCR BrokenVector path. Changes: - Add SpanSource::OcrAssisted variant to hybrid.rs - Add Span::ocr_assisted() helper method - Implement validate_ocr_with_position_hints() in ocr.rs - 5pt distance threshold for position validation - 0.4 confidence cap for rejected words - Linear scan for nearest-neighbor lookup - Add unit tests for validation filter Closes: pdftract-3s2i Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
74 lines
3.5 KiB
Markdown
74 lines
3.5 KiB
Markdown
# pdftract-sy8x: Lexer proptest harness + curated corpus
|
|
|
|
## Summary
|
|
|
|
Implemented property-based testing infrastructure for the lexer module with 6+ property tests covering INV-8 (no panic), string/hex roundtrips, name length bounds, and position monotonicity. Created 8 curated fixture files with golden token outputs for critical edge cases including EC-01 empty file test and whitespace-only inputs.
|
|
|
|
## Changes Made
|
|
|
|
### Property Tests (`tests/proptest/lexer.rs`)
|
|
- Added `prop_string_roundtrip`: arbitrary printable strings wrapped in `(...)` → assert decode works (modulo line ending normalization)
|
|
- Existing property tests verified:
|
|
- `prop_never_panics_on_random_bytes`: arbitrary byte vectors → assert no panic
|
|
- `prop_position_monotonically_increases`: position monotonicity invariant
|
|
- `prop_name_tokens_within_length_limit`: names ≤ 127 bytes
|
|
- `prop_hex_string_roundtrip`: hex encode/decode roundtrip
|
|
- `prop_whitespace_only_returns_eof`: whitespace-only input → Eof
|
|
|
|
### Curated Fixtures (`tests/lexer/fixtures/`)
|
|
Created 8 fixture files with golden `.tokens.txt` outputs:
|
|
1. `empty.bin` - EC-01 test: 0 bytes → `Token::Eof`
|
|
2. `whitespace_only.bin` - `\t\n \r\f\0 ` → `Token::Eof`
|
|
3. `every_token.pdf.in` - All token types
|
|
4. `string_escapes.pdf.in` - Every escape sequence
|
|
5. `name_edge_cases.pdf.in` - `#20`, `#00`, 127-byte name, 128-byte name
|
|
6. `hex_string_edge_cases.pdf.in` - Odd length, whitespace, mixed case
|
|
7. `numeric_edge_cases.pdf.in` - `-.5`, `42.`, overflow, bare `+`
|
|
8. `bom_utf16_string.pdf.in` - UTF-16BE BOM prefix
|
|
|
|
### Golden Generator (`tests/gen_lexer_golden.rs`)
|
|
Binary for regenerating golden outputs via `cargo run --bin gen_lexer_golden`
|
|
|
|
### Bug Fix (`crates/pdftract-core/src/parser/marked_content_operators.rs`)
|
|
Added missing `ObjRef` import to fix compilation error
|
|
|
|
## Test Results
|
|
|
|
```bash
|
|
$ cargo test --features proptest --lib -p pdftract-core parser::lexer
|
|
running 105 tests
|
|
test result: ok. 105 passed; 0 failed; 0 ignored; 0 measured; 1150 filtered out
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
| Criterion | Status | Notes |
|
|
|-----------|--------|-------|
|
|
| `cargo test --features proptest -p pdftract-core` exercises 6+ lexer properties | ✅ PASS | 105 lexer tests pass |
|
|
| `tests/lexer/fixtures/` contains 8 fixture files with `.tokens.txt` outputs | ✅ PASS | All 8 fixtures created with golden outputs |
|
|
| A deliberate lexer panic would be caught by a property test | ✅ PASS | proptest infrastructure in place |
|
|
| proptest-regressions/ directory committed | ✅ PASS | Directory exists |
|
|
| Empty file (EC-01) test passes: 0-byte input → Token::Eof, no panic, no diagnostic | ✅ PASS | `empty.tokens.txt` contains `Eof` only |
|
|
| Whitespace-only file test passes: only-whitespace input → Token::Eof | ✅ PASS | `whitespace_only.tokens.txt` contains `Eof` only |
|
|
| INV-8 verified by `prop_lexer_never_panics` | ✅ PASS | Test passes |
|
|
|
|
## Git Commit
|
|
|
|
```
|
|
test(pdftract-sy8x): implement lexer proptest harness and curated corpus
|
|
|
|
Add property-based testing infrastructure for the lexer module with 6+
|
|
property tests covering INV-8 (no panic), string/hex roundtrips, name
|
|
length bounds, and position monotonicity. Create 8 curated fixture files
|
|
with golden token outputs for critical edge cases including EC-01 empty
|
|
file test and whitespace-only inputs.
|
|
|
|
Commit: 585d861
|
|
```
|
|
|
|
## References
|
|
|
|
- Plan section: Phase 1.1 line 1051 (whitespace-only file critical test)
|
|
- Phase 0.5 (proptest budget; nightly fuzz CronWorkflow)
|
|
- INV-8 (no panic in pdftract-core)
|
|
- EC-01 (Empty PDF)
|