pdftract/tests
jedarden 3d795a2d11 feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts
Created tests/fixtures/scanned/ directory structure for WER gate testing:

- README.md: Corpus overview and WER targets (<3% on clean 300-DPI scans)
- GEN_MANIFEST.md: Fixture specifications and generation checklist
- receipt/receipt-300dpi.txt: Ground truth for AS-02 test scenario (37 lines)
- documents/invoice-300dpi.txt: Business invoice ground truth (55 lines)
- documents/form-300dpi.txt: Employment application form (78 lines)
- multi-page/doc-10page-300dpi.txt: Performance fixture (255 lines, 10 pages)

Generation tools:
- generate_scanned_fixtures.py: Python script for PDF generation
- generate_scanned_fixtures.rs: Rust alternative for fixture metadata
- calculate_wer.py: WER/CER calculation utility for OCR validation

Test stub:
- wer_gate_stub.rs: Placeholder for WER gate tests (marked #[ignore])

Total ground-truth content: 425 lines across 4 fixtures

Next steps:
1. Generate PDFs from ground truth using generation script
2. Verify WER < 3% on generated fixtures
3. Enable WER gate tests

Closes bf-2he4t
2026-06-01 09:25:53 -04:00
..
c-client feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
conformance feat(pdftract-5omc): implement per-language conformance test runner pattern 2026-05-18 01:32:24 -04:00
document_model wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
error_recovery/fixtures test(pdftract-4w0v4): implement adversarial test corpus + integration harness 2026-05-25 14:30:24 -04:00
fingerprint/fixtures fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
fixtures feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts 2026-06-01 09:25:53 -04:00
lexer/fixtures test(pdftract-sy8x): implement lexer proptest harness and curated corpus 2026-05-24 02:36:37 -04:00
proptest feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
proptest-regressions docs(pdftract-49f8): establish Cargo.lock policy and documentation 2026-05-20 18:13:14 -04:00
python-conformance feat(pdftract-5omc): implement SDK conformance test runner pattern 2026-05-18 01:22:23 -04:00
remote fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
sdk-conformance wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
stream_decoder/fixtures wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
xref/fixtures feat(pdftract-1s2uj): add xref test fixture corpus and integration test runner 2026-05-24 08:20:04 -04:00
conformance.c feat(pdftract-1eaxm): implement libpdftract C FFI library 2026-05-23 08:55:12 -04:00
conformance_fixed feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
conformance_fixed.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
conformance_run feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
conformance_test feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
conformance_test_simple feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
conformance_test_simple.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
debug_a85_filter.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
debug_content_fingerprint.rs fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
debug_content_streams.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
debug_filter_array.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
debug_fingerprint_content.rs wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
debug_lzw.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
debug_missing_mediabox.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
debug_page_count.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
debug_parse.rs feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
debug_parse_simple.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
debug_stream feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
debug_stream.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
doctor_runbook_coverage.rs docs(pdftract-653ah): add runbook integration for pdftract doctor 2026-05-24 13:26:31 -04:00
document_model.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
fingerprint_fixtures.rs wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
fingerprint_reproducibility.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
fingerprint_test_single_one.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
gen_lexer_golden.rs test(pdftract-sy8x): implement lexer proptest harness and curated corpus 2026-05-24 02:36:37 -04:00
json_schema.rs fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
log_secret_fuzz.rs fix(pdftract-4pnmd): build.rs doc comment format string parsing 2026-05-28 14:36:45 -04:00
proptest-panic-verification.rs feat(pdftract-33v): add property tests and nightly fuzz job 2026-05-20 19:18:03 -04:00
stream_decoder_fixtures.rs feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
test_api_basic feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_api_basic.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_api_null feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_api_null.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_api_real feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_api_real.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_api_valid feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_api_valid.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_atomic_writer.rs feat(pdftract-68wfa): implement AtomicFileWriter for atomic file writes 2026-05-24 13:02:37 -04:00
test_bomb_limit.rs wip: AcroForm improvements, debug tooling, test corpus, and fixture updates 2026-05-30 09:48:14 -04:00
test_cycle_detection.rs fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
test_debug feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_debug.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_fingerprint_debug.rs fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
test_parse_fixture.rs feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_simple.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_simple_run feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_stream feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_stream.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_valid.c feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
test_valid_run feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00