Implements secure MCP bearer-token ingress channels and TH-03 startup abort enforcement per plan lines 874, 915-921, 922-924. ## Changes - Add `--auth-token-file PATH` flag (RECOMMENDED channel) - Add `PDFTRACT_MCP_TOKEN` env var support - Reject `--auth-token VALUE` unless `PDFTRACT_INSECURE_CLI_TOKEN=1` - Enforce TH-03: require token for non-loopback bind addresses (exit 78) - Loopback exemption for 127.0.0.0/8 and ::1/128 ## Files - crates/pdftract-cli/src/mcp/auth.rs: Token resolution with priority order - crates/pdftract-cli/src/mcp/bind.rs: TH-03 bind security check - crates/pdftract-cli/src/mcp/server.rs: MCP server entry point - crates/pdftract-cli/src/mcp/mod.rs: Module exports - crates/pdftract-cli/src/main.rs: CLI arguments - crates/pdftract-cli/Cargo.toml: Add secrecy, tempfile dependencies ## Acceptance Criteria - ✅ --auth-token-file PATH flag implemented - ✅ PDFTRACT_MCP_TOKEN env var resolved - ✅ --auth-token VALUE rejected (exit 64) unless PDFTRACT_INSECURE_CLI_TOKEN=1 - ✅ mcp --bind ADDR with non-loopback ADDR and no token: aborts with exit 78 - ✅ mcp --bind ADDR with loopback ADDR and no token: succeeds - ✅ mcp --bind ADDR with token: succeeds regardless of address - ⏸️ Inspector token: Phase 7.9 (not yet implemented) - ⏸️ TH-03 test: separate bead Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| contract | ||
| invoice | ||
| misc | ||
| scientific_paper | ||
| MANIFEST.tsv | ||
| README.md | ||
Classifier Corpus
A 200-document labeled corpus for validating the pdftract document type classifier.
Structure
classifier/
├── invoice/ # 50 invoice PDFs
├── scientific_paper/ # 50 scientific paper PDFs
├── contract/ # 50 contract PDFs
├── misc/ # 50 misc PDFs
│ ├── receipt/ # 8 receipts (01-08)
│ ├── form/ # 8 forms (09-16)
│ ├── bank_statement/ # 7 bank statements (17-23)
│ ├── slide_deck/ # 7 slide decks (24-30)
│ ├── legal_filing/ # 7 legal filings (31-37)
│ ├── book_excerpt/ # 6 book excerpts (38-43)
│ └── magazine/ # 7 magazines (44-50)
└── MANIFEST.tsv # Expected classifications and metadata
MANIFEST.tsv Format
path expected_document_type source_url license
invoice/01.pdf invoice Synthetic test data... MIT-0
path: Relative path to the PDF fileexpected_document_type: The correct classification for the documentsource_url: Origin of the document (synthetic generation script)license: Document license (MIT-0 for all synthetic test data)
Generating the Corpus
The corpus is generated by scripts/generate_test_corpus.py:
python3 scripts/generate_test_corpus.py
This creates:
- 200 synthetic PDF documents with appropriate content for each type
- MANIFEST.tsv with expected classifications
Validation
The corpus is validated by tests/classifier/test_corpus.rs:
cargo test --test classifier
Acceptance Criteria
From plan.md Phase 5.6:
- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: classifying the same document twice produces identical output
Document Types
invoice
Commercial invoices with:
- "INVOICE" header
- Bill-to/ship-to sections
- Item tables with quantity, unit price, amount
- Subtotal, tax, total due
scientific_paper
Academic papers with:
- Title and abstract
- Section headers (Introduction, Methods, Results, Discussion, Conclusion)
- References section
- arXiv-style identifiers
contract
Legal agreements with:
- "AGREEMENT" header
- Parties clause
- Numbered sections/clauses
- Legal terminology (shall, whereby, hereby)
- Signature blocks
misc subtypes
receipt (01-08)
Simple receipts with:
- "RECEIPT" header
- Received from, amount, date
- Payment method
- Authorization signature
form (09-16)
Application forms with:
- "FORM" or "APPLICATION" header
- Field labels with underline blanks
- Signature line
bank_statement (17-23)
Monthly statements with:
- "STATEMENT" header
- Account number, statement period
- Transaction table (date, description, withdrawal, deposit, balance)
slide_deck (24-30)
Presentation slides with:
- Title slide
- Agenda/outline
- Bullet points
legal_filing (31-37)
Court documents with:
- Court header
- Case number
- Plaintiff/defendant
- Numbered counts/claims
book_excerpt (38-43)
Book chapters with:
- Chapter header
- Narrative text with paragraphs
magazine (44-50)
Magazine articles with:
- Issue/volume header
- Feature story
- Table of contents
Provenance
All documents are synthetic test data generated by scripts/generate_test_corpus.py.
No personally-identifiable information (PII) is included.
All data is licensed under MIT-0 (no attribution required).
Notes
- The corpus is for validation only, not training (the classifier is rule-based)
- Each document is < 5 KB to keep the repo size manageable
- Total corpus size: ~0.4 MB
- The corpus should not be used as training data for ML-based classifiers