- Create tests/fixtures/classifier/ with 200 synthetic PDFs:
- 50 invoices with bill-to/ship-to, item tables, totals
- 50 scientific papers with abstracts, sections, references
- 50 contracts with clauses, legal terminology, signatures
- 50 misc documents (8 receipts, 8 forms, 7 bank statements,
7 slide decks, 7 legal filings, 6 book excerpts, 7 magazines)
- Add MANIFEST.tsv mapping each document to its expected type
with source URL and license (all MIT-0 synthetic data)
- Add scripts/generate_test_corpus.py to regenerate the corpus
using reportlab for PDF generation
- Add tests/test_classifier_corpus.rs with validation harness:
- test_corpus_manifest_validity: verifies manifest structure
and file existence (PASSES)
- test_classifier_corpus_accuracy: will validate precision/
recall/F1 when classifier is implemented (SKIP for now)
- test_classifier_reproducibility: will verify deterministic
classification (SKIP for now)
- Add tests/fixtures/classifier/README.md documenting corpus
structure, generation process, and acceptance criteria
Total corpus size: ~0.4 MB (each PDF < 5 KB)
Acceptance criteria (from plan.md Phase 5.6):
- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: identical output for same document
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
143 lines
3.7 KiB
Markdown
143 lines
3.7 KiB
Markdown
# Classifier Corpus
|
|
|
|
A 200-document labeled corpus for validating the pdftract document type classifier.
|
|
|
|
## Structure
|
|
|
|
```
|
|
classifier/
|
|
├── invoice/ # 50 invoice PDFs
|
|
├── scientific_paper/ # 50 scientific paper PDFs
|
|
├── contract/ # 50 contract PDFs
|
|
├── misc/ # 50 misc PDFs
|
|
│ ├── receipt/ # 8 receipts (01-08)
|
|
│ ├── form/ # 8 forms (09-16)
|
|
│ ├── bank_statement/ # 7 bank statements (17-23)
|
|
│ ├── slide_deck/ # 7 slide decks (24-30)
|
|
│ ├── legal_filing/ # 7 legal filings (31-37)
|
|
│ ├── book_excerpt/ # 6 book excerpts (38-43)
|
|
│ └── magazine/ # 7 magazines (44-50)
|
|
└── MANIFEST.tsv # Expected classifications and metadata
|
|
```
|
|
|
|
## MANIFEST.tsv Format
|
|
|
|
```
|
|
path expected_document_type source_url license
|
|
invoice/01.pdf invoice Synthetic test data... MIT-0
|
|
```
|
|
|
|
- `path`: Relative path to the PDF file
|
|
- `expected_document_type`: The correct classification for the document
|
|
- `source_url`: Origin of the document (synthetic generation script)
|
|
- `license`: Document license (MIT-0 for all synthetic test data)
|
|
|
|
## Generating the Corpus
|
|
|
|
The corpus is generated by `scripts/generate_test_corpus.py`:
|
|
|
|
```bash
|
|
python3 scripts/generate_test_corpus.py
|
|
```
|
|
|
|
This creates:
|
|
- 200 synthetic PDF documents with appropriate content for each type
|
|
- MANIFEST.tsv with expected classifications
|
|
|
|
## Validation
|
|
|
|
The corpus is validated by `tests/classifier/test_corpus.rs`:
|
|
|
|
```bash
|
|
cargo test --test classifier
|
|
```
|
|
|
|
### Acceptance Criteria
|
|
|
|
From plan.md Phase 5.6:
|
|
|
|
- Per-class precision and recall >= 0.85
|
|
- Macro-F1 >= 0.88
|
|
- Reproducibility: classifying the same document twice produces identical output
|
|
|
|
## Document Types
|
|
|
|
### invoice
|
|
Commercial invoices with:
|
|
- "INVOICE" header
|
|
- Bill-to/ship-to sections
|
|
- Item tables with quantity, unit price, amount
|
|
- Subtotal, tax, total due
|
|
|
|
### scientific_paper
|
|
Academic papers with:
|
|
- Title and abstract
|
|
- Section headers (Introduction, Methods, Results, Discussion, Conclusion)
|
|
- References section
|
|
- arXiv-style identifiers
|
|
|
|
### contract
|
|
Legal agreements with:
|
|
- "AGREEMENT" header
|
|
- Parties clause
|
|
- Numbered sections/clauses
|
|
- Legal terminology (shall, whereby, hereby)
|
|
- Signature blocks
|
|
|
|
### misc subtypes
|
|
|
|
#### receipt (01-08)
|
|
Simple receipts with:
|
|
- "RECEIPT" header
|
|
- Received from, amount, date
|
|
- Payment method
|
|
- Authorization signature
|
|
|
|
#### form (09-16)
|
|
Application forms with:
|
|
- "FORM" or "APPLICATION" header
|
|
- Field labels with underline blanks
|
|
- Signature line
|
|
|
|
#### bank_statement (17-23)
|
|
Monthly statements with:
|
|
- "STATEMENT" header
|
|
- Account number, statement period
|
|
- Transaction table (date, description, withdrawal, deposit, balance)
|
|
|
|
#### slide_deck (24-30)
|
|
Presentation slides with:
|
|
- Title slide
|
|
- Agenda/outline
|
|
- Bullet points
|
|
|
|
#### legal_filing (31-37)
|
|
Court documents with:
|
|
- Court header
|
|
- Case number
|
|
- Plaintiff/defendant
|
|
- Numbered counts/claims
|
|
|
|
#### book_excerpt (38-43)
|
|
Book chapters with:
|
|
- Chapter header
|
|
- Narrative text with paragraphs
|
|
|
|
#### magazine (44-50)
|
|
Magazine articles with:
|
|
- Issue/volume header
|
|
- Feature story
|
|
- Table of contents
|
|
|
|
## Provenance
|
|
|
|
All documents are synthetic test data generated by `scripts/generate_test_corpus.py`.
|
|
No personally-identifiable information (PII) is included.
|
|
All data is licensed under MIT-0 (no attribution required).
|
|
|
|
## Notes
|
|
|
|
- The corpus is for **validation only**, not training (the classifier is rule-based)
|
|
- Each document is < 5 KB to keep the repo size manageable
|
|
- Total corpus size: ~0.4 MB
|
|
- The corpus should not be used as training data for ML-based classifiers
|