pdftract/tests/fixtures/classifier/invoice
jedarden 633eba61b1 test(classifier): add 200-document labeled corpus for Phase 5.6
- Create tests/fixtures/classifier/ with 200 synthetic PDFs:
  - 50 invoices with bill-to/ship-to, item tables, totals
  - 50 scientific papers with abstracts, sections, references
  - 50 contracts with clauses, legal terminology, signatures
  - 50 misc documents (8 receipts, 8 forms, 7 bank statements,
    7 slide decks, 7 legal filings, 6 book excerpts, 7 magazines)

- Add MANIFEST.tsv mapping each document to its expected type
  with source URL and license (all MIT-0 synthetic data)

- Add scripts/generate_test_corpus.py to regenerate the corpus
  using reportlab for PDF generation

- Add tests/test_classifier_corpus.rs with validation harness:
  - test_corpus_manifest_validity: verifies manifest structure
    and file existence (PASSES)
  - test_classifier_corpus_accuracy: will validate precision/
    recall/F1 when classifier is implemented (SKIP for now)
  - test_classifier_reproducibility: will verify deterministic
    classification (SKIP for now)

- Add tests/fixtures/classifier/README.md documenting corpus
  structure, generation process, and acceptance criteria

Total corpus size: ~0.4 MB (each PDF < 5 KB)

Acceptance criteria (from plan.md Phase 5.6):
- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: identical output for same document

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 07:16:02 -04:00
..
01.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
02.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
03.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
04.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
05.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
06.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
07.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
08.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
09.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
10.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
11.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
12.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
13.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
14.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
15.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
16.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
17.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
18.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
19.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
20.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
21.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
22.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
23.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
24.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
25.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
26.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
27.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
28.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
29.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
30.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
31.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
32.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
33.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
34.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
35.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
36.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
37.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
38.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
39.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
40.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
41.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
42.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
43.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
44.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
45.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
46.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
47.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
48.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
49.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
50.pdf test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00