pdftract/tests/fixtures/classifier
jedarden 660a9401ef feat(pdftract-59zz): implement MCP bearer token ingress channels and TH-03 enforcement
Implements secure MCP bearer-token ingress channels and TH-03 startup abort
enforcement per plan lines 874, 915-921, 922-924.

## Changes
- Add `--auth-token-file PATH` flag (RECOMMENDED channel)
- Add `PDFTRACT_MCP_TOKEN` env var support
- Reject `--auth-token VALUE` unless `PDFTRACT_INSECURE_CLI_TOKEN=1`
- Enforce TH-03: require token for non-loopback bind addresses (exit 78)
- Loopback exemption for 127.0.0.0/8 and ::1/128

## Files
- crates/pdftract-cli/src/mcp/auth.rs: Token resolution with priority order
- crates/pdftract-cli/src/mcp/bind.rs: TH-03 bind security check
- crates/pdftract-cli/src/mcp/server.rs: MCP server entry point
- crates/pdftract-cli/src/mcp/mod.rs: Module exports
- crates/pdftract-cli/src/main.rs: CLI arguments
- crates/pdftract-cli/Cargo.toml: Add secrecy, tempfile dependencies

## Acceptance Criteria
-  --auth-token-file PATH flag implemented
-  PDFTRACT_MCP_TOKEN env var resolved
-  --auth-token VALUE rejected (exit 64) unless PDFTRACT_INSECURE_CLI_TOKEN=1
-  mcp --bind ADDR with non-loopback ADDR and no token: aborts with exit 78
-  mcp --bind ADDR with loopback ADDR and no token: succeeds
-  mcp --bind ADDR with token: succeeds regardless of address
- ⏸️ Inspector token: Phase 7.9 (not yet implemented)
- ⏸️ TH-03 test: separate bead

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 02:47:54 -04:00
..
contract test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
invoice test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
misc test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
scientific_paper feat(pdftract-59zz): implement MCP bearer token ingress channels and TH-03 enforcement 2026-05-18 02:47:54 -04:00
MANIFEST.tsv test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00
README.md test(classifier): add 200-document labeled corpus for Phase 5.6 2026-05-17 07:16:02 -04:00

Classifier Corpus

A 200-document labeled corpus for validating the pdftract document type classifier.

Structure

classifier/
├── invoice/           # 50 invoice PDFs
├── scientific_paper/  # 50 scientific paper PDFs
├── contract/          # 50 contract PDFs
├── misc/              # 50 misc PDFs
│   ├── receipt/          # 8 receipts (01-08)
│   ├── form/             # 8 forms (09-16)
│   ├── bank_statement/   # 7 bank statements (17-23)
│   ├── slide_deck/       # 7 slide decks (24-30)
│   ├── legal_filing/     # 7 legal filings (31-37)
│   ├── book_excerpt/     # 6 book excerpts (38-43)
│   └── magazine/         # 7 magazines (44-50)
└── MANIFEST.tsv       # Expected classifications and metadata

MANIFEST.tsv Format

path                    expected_document_type    source_url                                  license
invoice/01.pdf          invoice                   Synthetic test data...                      MIT-0
  • path: Relative path to the PDF file
  • expected_document_type: The correct classification for the document
  • source_url: Origin of the document (synthetic generation script)
  • license: Document license (MIT-0 for all synthetic test data)

Generating the Corpus

The corpus is generated by scripts/generate_test_corpus.py:

python3 scripts/generate_test_corpus.py

This creates:

  • 200 synthetic PDF documents with appropriate content for each type
  • MANIFEST.tsv with expected classifications

Validation

The corpus is validated by tests/classifier/test_corpus.rs:

cargo test --test classifier

Acceptance Criteria

From plan.md Phase 5.6:

  • Per-class precision and recall >= 0.85
  • Macro-F1 >= 0.88
  • Reproducibility: classifying the same document twice produces identical output

Document Types

invoice

Commercial invoices with:

  • "INVOICE" header
  • Bill-to/ship-to sections
  • Item tables with quantity, unit price, amount
  • Subtotal, tax, total due

scientific_paper

Academic papers with:

  • Title and abstract
  • Section headers (Introduction, Methods, Results, Discussion, Conclusion)
  • References section
  • arXiv-style identifiers

contract

Legal agreements with:

  • "AGREEMENT" header
  • Parties clause
  • Numbered sections/clauses
  • Legal terminology (shall, whereby, hereby)
  • Signature blocks

misc subtypes

receipt (01-08)

Simple receipts with:

  • "RECEIPT" header
  • Received from, amount, date
  • Payment method
  • Authorization signature

form (09-16)

Application forms with:

  • "FORM" or "APPLICATION" header
  • Field labels with underline blanks
  • Signature line

bank_statement (17-23)

Monthly statements with:

  • "STATEMENT" header
  • Account number, statement period
  • Transaction table (date, description, withdrawal, deposit, balance)

slide_deck (24-30)

Presentation slides with:

  • Title slide
  • Agenda/outline
  • Bullet points

Court documents with:

  • Court header
  • Case number
  • Plaintiff/defendant
  • Numbered counts/claims

book_excerpt (38-43)

Book chapters with:

  • Chapter header
  • Narrative text with paragraphs

magazine (44-50)

Magazine articles with:

  • Issue/volume header
  • Feature story
  • Table of contents

Provenance

All documents are synthetic test data generated by scripts/generate_test_corpus.py. No personally-identifiable information (PII) is included. All data is licensed under MIT-0 (no attribution required).

Notes

  • The corpus is for validation only, not training (the classifier is rule-based)
  • Each document is < 5 KB to keep the repo size manageable
  • Total corpus size: ~0.4 MB
  • The corpus should not be used as training data for ML-based classifiers