# Classifier Corpus A 200-document labeled corpus for validating the pdftract document type classifier. ## Structure ``` classifier/ ├── invoice/ # 50 invoice PDFs ├── scientific_paper/ # 50 scientific paper PDFs ├── contract/ # 50 contract PDFs ├── misc/ # 50 misc PDFs │ ├── receipt/ # 8 receipts (01-08) │ ├── form/ # 8 forms (09-16) │ ├── bank_statement/ # 7 bank statements (17-23) │ ├── slide_deck/ # 7 slide decks (24-30) │ ├── legal_filing/ # 7 legal filings (31-37) │ ├── book_excerpt/ # 6 book excerpts (38-43) │ └── magazine/ # 7 magazines (44-50) └── MANIFEST.tsv # Expected classifications and metadata ``` ## MANIFEST.tsv Format ``` path expected_document_type source_url license invoice/01.pdf invoice Synthetic test data... MIT-0 ``` - `path`: Relative path to the PDF file - `expected_document_type`: The correct classification for the document - `source_url`: Origin of the document (synthetic generation script) - `license`: Document license (MIT-0 for all synthetic test data) ## Generating the Corpus The corpus is generated by `scripts/generate_test_corpus.py`: ```bash python3 scripts/generate_test_corpus.py ``` This creates: - 200 synthetic PDF documents with appropriate content for each type - MANIFEST.tsv with expected classifications ## Validation The corpus is validated by `tests/classifier/test_corpus.rs`: ```bash cargo test --test classifier ``` ### Acceptance Criteria From plan.md Phase 5.6: - Per-class precision and recall >= 0.85 - Macro-F1 >= 0.88 - Reproducibility: classifying the same document twice produces identical output ## Document Types ### invoice Commercial invoices with: - "INVOICE" header - Bill-to/ship-to sections - Item tables with quantity, unit price, amount - Subtotal, tax, total due ### scientific_paper Academic papers with: - Title and abstract - Section headers (Introduction, Methods, Results, Discussion, Conclusion) - References section - arXiv-style identifiers ### contract Legal agreements with: - "AGREEMENT" header - Parties clause - Numbered sections/clauses - Legal terminology (shall, whereby, hereby) - Signature blocks ### misc subtypes #### receipt (01-08) Simple receipts with: - "RECEIPT" header - Received from, amount, date - Payment method - Authorization signature #### form (09-16) Application forms with: - "FORM" or "APPLICATION" header - Field labels with underline blanks - Signature line #### bank_statement (17-23) Monthly statements with: - "STATEMENT" header - Account number, statement period - Transaction table (date, description, withdrawal, deposit, balance) #### slide_deck (24-30) Presentation slides with: - Title slide - Agenda/outline - Bullet points #### legal_filing (31-37) Court documents with: - Court header - Case number - Plaintiff/defendant - Numbered counts/claims #### book_excerpt (38-43) Book chapters with: - Chapter header - Narrative text with paragraphs #### magazine (44-50) Magazine articles with: - Issue/volume header - Feature story - Table of contents ## Provenance All documents are synthetic test data generated by `scripts/generate_test_corpus.py`. No personally-identifiable information (PII) is included. All data is licensed under MIT-0 (no attribution required). ## Notes - The corpus is for **validation only**, not training (the classifier is rule-based) - Each document is < 5 KB to keep the repo size manageable - Total corpus size: ~0.4 MB - The corpus should not be used as training data for ML-based classifiers