Commit graph

5 commits

Author SHA1 Message Date
jedarden
857f928732 feat(pdftract-5omc): implement SDK conformance test runner pattern
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.

- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
  * Full test suite loader and executor
  * Comparison engine with min/max, string constraints, tolerances
  * Skip logic for unsupported features and schema versions
  * Report generation in JSON format

- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
  * pdftract compare - Compare actual vs expected with tolerances
  * Cross-language comparison tool to avoid reimplementations

- Documentation (docs/conformance/sdk-contract.md)
  * Complete pattern specification with pseudocode
  * Per-language runner locations
  * CI integration requirements

- Python reference stub (tests/python-conformance/test_conformance.py)
  * Full pytest-based implementation following the pattern

Closes: pdftract-5omc
2026-05-18 01:22:23 -04:00
jedarden
a3178a3960 test(pdftract-1527): add shared SDK conformance suite with 32 test cases
Add tests/sdk-conformance/ containing the shared, language-neutral test
specification for all pdftract SDKs. The suite includes 32 cases covering
all 9 contract methods (extract, extract_text, extract_markdown,
extract_stream, search, get_metadata, hash, classify, verify_receipt)
across vector, scanned, encrypted, fillable-form, mixed, large, broken,
and remote PDFs.

- cases.json: 32 test cases with id, fixture, method, options, expected,
  tolerances, feature tags, and min_schema_version
- schema.json: JSON Schema v7 draft for validating test case structure
- validate_suite.py: Validation script that checks structure and fixture
  existence
- fixtures/: Test PDFs organized by category (symlinks to classifier
  fixtures for shared files)

See notes/pdftract-1527.md for verification details.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 01:17:42 -04:00
jedarden
b4fac0932f fix(pdftract-5z5d8): add pre-commit hook for provenance validation
Add pre-commit hook that runs check-provenance.sh before each commit
to ensure fixture files always have valid provenance entries. Update
PROVENANCE.md with validation section documenting the hook usage.

Acceptance criteria:
- PROVENANCE.md exists with one row per fixture file ✓
- Every fixture file enumerated; no orphans ✓
- License column populated; only approved licenses ✓
- SHA256 column populated; matches actual content ✓
- check-provenance.sh validates manifest; CI gate green ✓
- Synthetic fixtures point at generation scripts ✓

Refs: pdftract-5z5d8

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-17 23:50:28 -04:00
jedarden
3af009440e fix(pdftract-5z5d8): fix provenance validation script
Fixed scripts/check-provenance.sh to properly validate PROVENANCE.md
against actual fixture files. The script was failing silently due to
subshell EXIT trap removing temp files before parent could read them,
and arithmetic expansion returning exit code 1 on zero value.

Changes:
- Replaced subshell pipes with process substitution
- Moved temp file cleanup to after reading
- Added validated variable initialization
- Added || true to prevent exit on zero arithmetic

All 200 classifier corpus fixtures have valid provenance entries
with matching SHA256 hashes. PROVENANCE.md already existed with
complete documentation.

Refs: pdftract-5z5d8
Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-17 23:43:37 -04:00
jedarden
633eba61b1 test(classifier): add 200-document labeled corpus for Phase 5.6
- Create tests/fixtures/classifier/ with 200 synthetic PDFs:
  - 50 invoices with bill-to/ship-to, item tables, totals
  - 50 scientific papers with abstracts, sections, references
  - 50 contracts with clauses, legal terminology, signatures
  - 50 misc documents (8 receipts, 8 forms, 7 bank statements,
    7 slide decks, 7 legal filings, 6 book excerpts, 7 magazines)

- Add MANIFEST.tsv mapping each document to its expected type
  with source URL and license (all MIT-0 synthetic data)

- Add scripts/generate_test_corpus.py to regenerate the corpus
  using reportlab for PDF generation

- Add tests/test_classifier_corpus.rs with validation harness:
  - test_corpus_manifest_validity: verifies manifest structure
    and file existence (PASSES)
  - test_classifier_corpus_accuracy: will validate precision/
    recall/F1 when classifier is implemented (SKIP for now)
  - test_classifier_reproducibility: will verify deterministic
    classification (SKIP for now)

- Add tests/fixtures/classifier/README.md documenting corpus
  structure, generation process, and acceptance criteria

Total corpus size: ~0.4 MB (each PDF < 5 KB)

Acceptance criteria (from plan.md Phase 5.6):
- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: identical output for same document

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 07:16:02 -04:00