Commit graph

2 commits

Author SHA1 Message Date
jedarden
b535638104 feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.

Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields

Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences

All 27 tests pass including proptests for fuzzing robustness (INV-8).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:45:45 -04:00
jedarden
633eba61b1 test(classifier): add 200-document labeled corpus for Phase 5.6
- Create tests/fixtures/classifier/ with 200 synthetic PDFs:
  - 50 invoices with bill-to/ship-to, item tables, totals
  - 50 scientific papers with abstracts, sections, references
  - 50 contracts with clauses, legal terminology, signatures
  - 50 misc documents (8 receipts, 8 forms, 7 bank statements,
    7 slide decks, 7 legal filings, 6 book excerpts, 7 magazines)

- Add MANIFEST.tsv mapping each document to its expected type
  with source URL and license (all MIT-0 synthetic data)

- Add scripts/generate_test_corpus.py to regenerate the corpus
  using reportlab for PDF generation

- Add tests/test_classifier_corpus.rs with validation harness:
  - test_corpus_manifest_validity: verifies manifest structure
    and file existence (PASSES)
  - test_classifier_corpus_accuracy: will validate precision/
    recall/F1 when classifier is implemented (SKIP for now)
  - test_classifier_reproducibility: will verify deterministic
    classification (SKIP for now)

- Add tests/fixtures/classifier/README.md documenting corpus
  structure, generation process, and acceptance criteria

Total corpus size: ~0.4 MB (each PDF < 5 KB)

Acceptance criteria (from plan.md Phase 5.6):
- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: identical output for same document

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 07:16:02 -04:00