pdftract/tests/fixtures/classifier/MANIFEST.tsv
jedarden 633eba61b1 test(classifier): add 200-document labeled corpus for Phase 5.6
- Create tests/fixtures/classifier/ with 200 synthetic PDFs:
  - 50 invoices with bill-to/ship-to, item tables, totals
  - 50 scientific papers with abstracts, sections, references
  - 50 contracts with clauses, legal terminology, signatures
  - 50 misc documents (8 receipts, 8 forms, 7 bank statements,
    7 slide decks, 7 legal filings, 6 book excerpts, 7 magazines)

- Add MANIFEST.tsv mapping each document to its expected type
  with source URL and license (all MIT-0 synthetic data)

- Add scripts/generate_test_corpus.py to regenerate the corpus
  using reportlab for PDF generation

- Add tests/test_classifier_corpus.rs with validation harness:
  - test_corpus_manifest_validity: verifies manifest structure
    and file existence (PASSES)
  - test_classifier_corpus_accuracy: will validate precision/
    recall/F1 when classifier is implemented (SKIP for now)
  - test_classifier_reproducibility: will verify deterministic
    classification (SKIP for now)

- Add tests/fixtures/classifier/README.md documenting corpus
  structure, generation process, and acceptance criteria

Total corpus size: ~0.4 MB (each PDF < 5 KB)

Acceptance criteria (from plan.md Phase 5.6):
- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: identical output for same document

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 07:16:02 -04:00

19 KiB

1pathexpected_document_typesource_urllicense
2invoice/01.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
3invoice/02.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
4invoice/03.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
5invoice/04.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
6invoice/05.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
7invoice/06.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
8invoice/07.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
9invoice/08.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
10invoice/09.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
11invoice/10.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
12invoice/11.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
13invoice/12.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
14invoice/13.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
15invoice/14.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
16invoice/15.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
17invoice/16.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
18invoice/17.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
19invoice/18.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
20invoice/19.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
21invoice/20.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
22invoice/21.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
23invoice/22.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
24invoice/23.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
25invoice/24.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
26invoice/25.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
27invoice/26.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
28invoice/27.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
29invoice/28.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
30invoice/29.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
31invoice/30.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
32invoice/31.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
33invoice/32.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
34invoice/33.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
35invoice/34.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
36invoice/35.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
37invoice/36.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
38invoice/37.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
39invoice/38.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
40invoice/39.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
41invoice/40.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
42invoice/41.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
43invoice/42.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
44invoice/43.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
45invoice/44.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
46invoice/45.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
47invoice/46.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
48invoice/47.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
49invoice/48.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
50invoice/49.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
51invoice/50.pdfinvoiceSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
52scientific_paper/01.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
53scientific_paper/02.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
54scientific_paper/03.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
55scientific_paper/04.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
56scientific_paper/05.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
57scientific_paper/06.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
58scientific_paper/07.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
59scientific_paper/08.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
60scientific_paper/09.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
61scientific_paper/10.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
62scientific_paper/11.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
63scientific_paper/12.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
64scientific_paper/13.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
65scientific_paper/14.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
66scientific_paper/15.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
67scientific_paper/16.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
68scientific_paper/17.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
69scientific_paper/18.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
70scientific_paper/19.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
71scientific_paper/20.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
72scientific_paper/21.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
73scientific_paper/22.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
74scientific_paper/23.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
75scientific_paper/24.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
76scientific_paper/25.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
77scientific_paper/26.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
78scientific_paper/27.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
79scientific_paper/28.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
80scientific_paper/29.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
81scientific_paper/30.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
82scientific_paper/31.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
83scientific_paper/32.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
84scientific_paper/33.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
85scientific_paper/34.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
86scientific_paper/35.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
87scientific_paper/36.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
88scientific_paper/37.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
89scientific_paper/38.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
90scientific_paper/39.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
91scientific_paper/40.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
92scientific_paper/41.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
93scientific_paper/42.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
94scientific_paper/43.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
95scientific_paper/44.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
96scientific_paper/45.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
97scientific_paper/46.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
98scientific_paper/47.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
99scientific_paper/48.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
100scientific_paper/49.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
101scientific_paper/50.pdfscientific_paperSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
102contract/01.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
103contract/02.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
104contract/03.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
105contract/04.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
106contract/05.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
107contract/06.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
108contract/07.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
109contract/08.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
110contract/09.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
111contract/10.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
112contract/11.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
113contract/12.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
114contract/13.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
115contract/14.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
116contract/15.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
117contract/16.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
118contract/17.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
119contract/18.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
120contract/19.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
121contract/20.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
122contract/21.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
123contract/22.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
124contract/23.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
125contract/24.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
126contract/25.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
127contract/26.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
128contract/27.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
129contract/28.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
130contract/29.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
131contract/30.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
132contract/31.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
133contract/32.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
134contract/33.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
135contract/34.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
136contract/35.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
137contract/36.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
138contract/37.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
139contract/38.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
140contract/39.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
141contract/40.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
142contract/41.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
143contract/42.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
144contract/43.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
145contract/44.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
146contract/45.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
147contract/46.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
148contract/47.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
149contract/48.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
150contract/49.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
151contract/50.pdfcontractSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
152misc/01.pdfreceiptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
153misc/02.pdfreceiptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
154misc/03.pdfreceiptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
155misc/04.pdfreceiptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
156misc/05.pdfreceiptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
157misc/06.pdfreceiptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
158misc/07.pdfreceiptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
159misc/08.pdfreceiptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
160misc/09.pdfformSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
161misc/10.pdfformSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
162misc/11.pdfformSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
163misc/12.pdfformSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
164misc/13.pdfformSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
165misc/14.pdfformSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
166misc/15.pdfformSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
167misc/16.pdfformSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
168misc/17.pdfbank_statementSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
169misc/18.pdfbank_statementSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
170misc/19.pdfbank_statementSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
171misc/20.pdfbank_statementSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
172misc/21.pdfbank_statementSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
173misc/22.pdfbank_statementSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
174misc/23.pdfbank_statementSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
175misc/24.pdfslide_deckSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
176misc/25.pdfslide_deckSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
177misc/26.pdfslide_deckSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
178misc/27.pdfslide_deckSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
179misc/28.pdfslide_deckSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
180misc/29.pdfslide_deckSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
181misc/30.pdfslide_deckSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
182misc/31.pdflegal_filingSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
183misc/32.pdflegal_filingSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
184misc/33.pdflegal_filingSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
185misc/34.pdflegal_filingSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
186misc/35.pdflegal_filingSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
187misc/36.pdflegal_filingSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
188misc/37.pdflegal_filingSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
189misc/38.pdfbook_excerptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
190misc/39.pdfbook_excerptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
191misc/40.pdfbook_excerptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
192misc/41.pdfbook_excerptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
193misc/42.pdfbook_excerptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
194misc/43.pdfbook_excerptSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
195misc/44.pdfmagazineSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
196misc/45.pdfmagazineSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
197misc/46.pdfmagazineSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
198misc/47.pdfmagazineSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
199misc/48.pdfmagazineSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
200misc/49.pdfmagazineSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0
201misc/50.pdfmagazineSynthetic test data generated by scripts/generate_test_corpus.pyMIT-0