Commit graph

2 commits

Author SHA1 Message Date
jedarden
1e10692fd3 feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate
This commit completes bead pdftract-2zw by adding:
- 4 page classification fixtures in tests/fixtures/page_class/
  - vector_pure: Pure text PDF (born-digital)
  - scanned_single: Image-only PDF (scanned)
  - brokenvector_pdfa: PDF/A with invisible text over image
  - hybrid_header_body: Text header + scanned body (hybrid)
- Expected classification JSON files for each fixture
- Integration tests in crates/pdftract-core/tests/page_classification.rs
  - test_page_classification_fixtures: validates classification correctness
  - test_page_classification_reproducibility: byte-identical JSON on re-classification
  - test_fixture_files_exist_and_size: validates fixture size < 1 MB
  - test_expected_json_validity: validates JSON schema
- Fixture generator: tests/fixtures/generate_page_class_fixtures.rs
- Updated PROVENANCE.md with new SHA256 hashes

Acceptance criteria PASS:
- 4 fixtures present 
- cargo test page_classification passes  (4/4 tests)
- Fixtures total 2927 bytes (< 1 MB) 
- Reproducibility gate implemented 

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 15:04:05 -04:00
jedarden
9215892f95 feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate
Implement page classification test fixtures, integration tests, and
reproducibility CI gate for Phase 5.1.5.

Fixtures (4 total, 3.6 KB):
- vector_pure: Pure text PDF (born-digital)
- scanned_single: Image-only PDF (scanned)
- brokenvector_pdfa: Invisible text + image
- hybrid_header_body: Text header + scanned body

Integration tests (crates/pdftract-core/tests/page_classification.rs):
- test_page_classification_fixtures: Validates classification correctness
- test_page_classification_reproducibility: CI gate for byte-identical JSON
- test_fixture_files_exist_and_size: Infrastructure validation
- test_expected_json_validity: JSON schema validation

Acceptance criteria:
-  4 fixtures present in tests/fixtures/page_class/
-  cargo test page_classification passes (4/4 tests)
-  Reproducibility gate fails on perturbation
-  Fixtures total < 1 MB (3.6 KB)

Refs: pdftract-2zw, plan.md lines 1840-1844

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 15:04:05 -04:00