pdftract/tests/fixtures/classifier/README.md

# Classifier Corpus

A 200-document labeled corpus for validating the pdftract document type classifier.

## Structure

```
classifier/
├── invoice/           # 50 invoice PDFs
├── scientific_paper/  # 50 scientific paper PDFs
├── contract/          # 50 contract PDFs
├── misc/              # 50 misc PDFs
│   ├── receipt/          # 8 receipts (01-08)
│   ├── form/             # 8 forms (09-16)
│   ├── bank_statement/   # 7 bank statements (17-23)
│   ├── slide_deck/       # 7 slide decks (24-30)
│   ├── legal_filing/     # 7 legal filings (31-37)
│   ├── book_excerpt/     # 6 book excerpts (38-43)
│   └── magazine/         # 7 magazines (44-50)
└── MANIFEST.tsv       # Expected classifications and metadata
```

## MANIFEST.tsv Format

```
path                    expected_document_type    source_url                                  license
invoice/01.pdf          invoice                   Synthetic test data...                      MIT-0
```

- `path`: Relative path to the PDF file
- `expected_document_type`: The correct classification for the document
- `source_url`: Origin of the document (synthetic generation script)
- `license`: Document license (MIT-0 for all synthetic test data)

## Generating the Corpus

The corpus is generated by `scripts/generate_test_corpus.py`:

```bash
python3 scripts/generate_test_corpus.py
```

This creates:
- 200 synthetic PDF documents with appropriate content for each type
- MANIFEST.tsv with expected classifications

## Validation

The corpus is validated by `tests/classifier/test_corpus.rs`:

```bash
cargo test --test classifier
```

### Acceptance Criteria

From plan.md Phase 5.6:

- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: classifying the same document twice produces identical output

## Document Types

### invoice
Commercial invoices with:
- "INVOICE" header
- Bill-to/ship-to sections
- Item tables with quantity, unit price, amount
- Subtotal, tax, total due

### scientific_paper
Academic papers with:
- Title and abstract
- Section headers (Introduction, Methods, Results, Discussion, Conclusion)
- References section
- arXiv-style identifiers

### contract
Legal agreements with:
- "AGREEMENT" header
- Parties clause
- Numbered sections/clauses
- Legal terminology (shall, whereby, hereby)
- Signature blocks

### misc subtypes

#### receipt (01-08)
Simple receipts with:
- "RECEIPT" header
- Received from, amount, date
- Payment method
- Authorization signature

#### form (09-16)
Application forms with:
- "FORM" or "APPLICATION" header
- Field labels with underline blanks
- Signature line

#### bank_statement (17-23)
Monthly statements with:
- "STATEMENT" header
- Account number, statement period
- Transaction table (date, description, withdrawal, deposit, balance)

#### slide_deck (24-30)
Presentation slides with:
- Title slide
- Agenda/outline
- Bullet points

#### legal_filing (31-37)
Court documents with:
- Court header
- Case number
- Plaintiff/defendant
- Numbered counts/claims

#### book_excerpt (38-43)
Book chapters with:
- Chapter header
- Narrative text with paragraphs

#### magazine (44-50)
Magazine articles with:
- Issue/volume header
- Feature story
- Table of contents

## Provenance

All documents are synthetic test data generated by `scripts/generate_test_corpus.py`.
No personally-identifiable information (PII) is included.
All data is licensed under MIT-0 (no attribution required).

## Notes

- The corpus is for **validation only**, not training (the classifier is rule-based)
- Each document is < 5 KB to keep the repo size manageable
- Total corpus size: ~0.4 MB
- The corpus should not be used as training data for ML-based classifiers