pdftract/tests/fixtures/page_class
jedarden 1e10692fd3 feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate
This commit completes bead pdftract-2zw by adding:
- 4 page classification fixtures in tests/fixtures/page_class/
  - vector_pure: Pure text PDF (born-digital)
  - scanned_single: Image-only PDF (scanned)
  - brokenvector_pdfa: PDF/A with invisible text over image
  - hybrid_header_body: Text header + scanned body (hybrid)
- Expected classification JSON files for each fixture
- Integration tests in crates/pdftract-core/tests/page_classification.rs
  - test_page_classification_fixtures: validates classification correctness
  - test_page_classification_reproducibility: byte-identical JSON on re-classification
  - test_fixture_files_exist_and_size: validates fixture size < 1 MB
  - test_expected_json_validity: validates JSON schema
- Fixture generator: tests/fixtures/generate_page_class_fixtures.rs
- Updated PROVENANCE.md with new SHA256 hashes

Acceptance criteria PASS:
- 4 fixtures present 
- cargo test page_classification passes  (4/4 tests)
- Fixtures total 2927 bytes (< 1 MB) 
- Reproducibility gate implemented 

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 15:04:05 -04:00
..
brokenvector_pdfa feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate 2026-05-23 15:04:05 -04:00
hybrid_header_body feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate 2026-05-23 15:04:05 -04:00
scanned_single feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate 2026-05-23 15:04:05 -04:00
vector_pure feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate 2026-05-23 15:04:05 -04:00