pdftract/src/lib.rs at 8cfbe70ab75ddb81fcd1660c999eef2ccae90dea - jedarden/pdftract - Forgejo

jedarden/pdftract

jedarden 633eba61b1 test(classifier): add 200-document labeled corpus for Phase 5.6

- Create tests/fixtures/classifier/ with 200 synthetic PDFs:
  - 50 invoices with bill-to/ship-to, item tables, totals
  - 50 scientific papers with abstracts, sections, references
  - 50 contracts with clauses, legal terminology, signatures
  - 50 misc documents (8 receipts, 8 forms, 7 bank statements,
    7 slide decks, 7 legal filings, 6 book excerpts, 7 magazines)

- Add MANIFEST.tsv mapping each document to its expected type
  with source URL and license (all MIT-0 synthetic data)

- Add scripts/generate_test_corpus.py to regenerate the corpus
  using reportlab for PDF generation

- Add tests/test_classifier_corpus.rs with validation harness:
  - test_corpus_manifest_validity: verifies manifest structure
    and file existence (PASSES)
  - test_classifier_corpus_accuracy: will validate precision/
    recall/F1 when classifier is implemented (SKIP for now)
  - test_classifier_reproducibility: will verify deterministic
    classification (SKIP for now)

- Add tests/fixtures/classifier/README.md documenting corpus
  structure, generation process, and acceptance criteria

Total corpus size: ~0.4 MB (each PDF < 5 KB)

Acceptance criteria (from plan.md Phase 5.6):
- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: identical output for same document

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-17 07:16:02 -04:00

15 lines

368 B

Rust

Raw Blame History

 //! pdftract — PDF text extraction library that gets the hard parts right.
 //!
 //! This library provides correct reading order, font encoding recovery,
 //! structure tree extraction, and per-page hybrid routing.
 pub mod graphics_state;
 pub use graphics_state::{
     Color,
     Diagnostic,
     GraphicsState,
     GraphicsStateStack,
     Matrix3x3,
     Severity,
 };