Commit graph

8 commits

Author SHA1 Message Date
jedarden
58a177d3b4 docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files
Add dual MIT OR Apache-2.0 licensing at repo root with proper copyright
notices. Configure all workspace and non-workspace crates to declare the
license. Wire license files into Python wheels and Docker images.

Files added:
- LICENSE-MIT: MIT License with "Copyright (c) 2026 Jed Cabanero"
- LICENSE-APACHE: Apache License 2.0 (verbatim from apache.org)

Files modified:
- Cargo.toml: Updated authors to "Jed Cabanero <me@jedcabanero.com>"
- crates/pdftract-py/pyproject.toml: Added license-files to maturin config
- crates/pdftract-cer-diff/Cargo.toml: Added license.workspace = true
- xtask/Cargo.toml: Added license = "MIT OR Apache-2.0"
- fuzz/Cargo.toml: Added license = "MIT OR Apache-2.0"
- Cargo-dist.toml: Created to include license files in binary archives
- notes/pdftract-aawrz.md: Verification note

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:36:28 -04:00
jedarden
f26f9e3c0f feat(pdftract-uyhq7): scaffold libpdftract cdylib+staticlib crate
Add pdftract-libpdftract as 4th workspace member with dual crate-type
configuration (cdylib + staticlib) for C/C++ SDK flexibility.

Changes:
- Create crates/pdftract-libpdftract/Cargo.toml with cdylib+staticlib
- Create crates/pdftract-libpdftract/src/lib.rs scaffold
- Update root Cargo.toml workspace.members
- Configure [lib] name="pdftract" for correct artifact naming

Artifacts produced:
- target/debug/libpdftract.so (shared, cdylib)
- target/debug/libpdftract.a (static, staticlib)

Acceptance criteria:
- PASS: cargo build -p pdftract-libpdftract produces libpdftract.so/.a
- PASS: Workspace cargo build builds all 4 crates without regression
- PASS: cargo metadata shows pdftract-libpdftract in workspace members
- PASS: nm -D shows no exported symbols (empty API scaffold)

References: pdftract-uyhq7, Phase SDK epic

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 07:29:47 -04:00
jedarden
9f18c6cb9c feat(pdftract-5zm86): implement Receipt struct + lite-mode serialization
Implement the Receipt struct and lite-mode JSON serialization for
visual citation receipts. This provides cryptographic proof of
provenance for extracted text.

Changes:
- Add Receipt struct with 6 fields (pdf_fingerprint, page_index,
  bbox, content_hash, extraction_version, svg_clip)
- Implement Receipt::lite() constructor with NFC normalization
- Integrate Receipt into SpanJson and BlockJson schemas
- Add unicode-normalization and serde_json dependencies

Acceptance criteria:
- Receipt::lite() produces valid receipts with svg_clip=None
- Lite mode JSON omits svg_clip key via skip_serializing_if
- Content hash uses NFC normalization for cross-platform stability
- Receipt wired into SpanJson and BlockJson types

Note: 100 receipts aggregate size is ~27 KB (not 15 KB as planned).
The 15 KB target is not achievable with required field sizes.

Refs: pdftract-5zm86, Phase 6.8 Visual Citation Receipts (lines 2351-2417)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 03:30:24 -04:00
jedarden
1959ff2446 feat(pdftract-3uu6v): implement LZWDecode with /EarlyChange parameter
- Add LZWDecoder filter using lzw crate v0.10
- Support /EarlyChange parameter (default 1, late 0)
  - Early change (1): Adobe/TIFF variant, code size increases BEFORE
  - Late change (0): GIF variant, code size increases AFTER
- Full predictor support (TIFF predictor 2, PNG predictors 10-15)
- Bomb limit protection with partial bytes on exceed
- INV-8 maintained: partial bytes returned on decode errors
- 23 tests pass (19 unit tests + 4 proptests)
- Fixtures generated using lzw crate for verification

Acceptance criteria:
- Critical test /EarlyChange=0 byte-perfect: PASS
- LZWDecode without /DecodeParms defaults: PASS
- LZWDecode + /Predictor 12: PASS
- Truncated stream partial bytes: PASS
- Bomb limit honored: PASS
- proptest no panic: PASS
- INV-8 maintained: PASS

Refs: Plan Phase 1.5 line 1142, PDF spec 7.4.4

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-22 22:38:31 -04:00
jedarden
5e3e0a6983 feat(pdftract-279): stand up Cargo workspace with three member crates
- Configure workspace with pdftract-core, pdftract-cli, pdftract-py members
- Add workspace.package metadata: version, edition, rust-version (1.78), license (MIT OR Apache-2.0)
- Add workspace.dependencies for shared external deps (anyhow, flate2, lzw, memchr, secrecy, serde, thiserror, tracing)
- Create .cargo/config.toml with CI and development build aliases
- All member crates reference workspace metadata via workspace = true
- pdftract-py configured as cdylib with pyo3 extension-module feature

Acceptance criteria:
- PASS: 3 workspace members listed by cargo metadata
- PASS: All crates use workspace metadata references
- WARN: cargo build fails due to code compilation errors (separate concern)

Refs: pdftract-279, plan lines 3343-3367

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 18:09:34 -04:00
jedarden
857f928732 feat(pdftract-5omc): implement SDK conformance test runner pattern
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.

- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
  * Full test suite loader and executor
  * Comparison engine with min/max, string constraints, tolerances
  * Skip logic for unsupported features and schema versions
  * Report generation in JSON format

- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
  * pdftract compare - Compare actual vs expected with tolerances
  * Cross-language comparison tool to avoid reimplementations

- Documentation (docs/conformance/sdk-contract.md)
  * Complete pattern specification with pseudocode
  * Per-language runner locations
  * CI integration requirements

- Python reference stub (tests/python-conformance/test_conformance.py)
  * Full pytest-based implementation following the pattern

Closes: pdftract-5omc
2026-05-18 01:22:23 -04:00
jedarden
b535638104 feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.

Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields

Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences

All 27 tests pass including proptests for fuzzing robustness (INV-8).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:45:45 -04:00
jedarden
633eba61b1 test(classifier): add 200-document labeled corpus for Phase 5.6
- Create tests/fixtures/classifier/ with 200 synthetic PDFs:
  - 50 invoices with bill-to/ship-to, item tables, totals
  - 50 scientific papers with abstracts, sections, references
  - 50 contracts with clauses, legal terminology, signatures
  - 50 misc documents (8 receipts, 8 forms, 7 bank statements,
    7 slide decks, 7 legal filings, 6 book excerpts, 7 magazines)

- Add MANIFEST.tsv mapping each document to its expected type
  with source URL and license (all MIT-0 synthetic data)

- Add scripts/generate_test_corpus.py to regenerate the corpus
  using reportlab for PDF generation

- Add tests/test_classifier_corpus.rs with validation harness:
  - test_corpus_manifest_validity: verifies manifest structure
    and file existence (PASSES)
  - test_classifier_corpus_accuracy: will validate precision/
    recall/F1 when classifier is implemented (SKIP for now)
  - test_classifier_reproducibility: will verify deterministic
    classification (SKIP for now)

- Add tests/fixtures/classifier/README.md documenting corpus
  structure, generation process, and acceptance criteria

Total corpus size: ~0.4 MB (each PDF < 5 KB)

Acceptance criteria (from plan.md Phase 5.6):
- Per-class precision and recall >= 0.85
- Macro-F1 >= 0.88
- Reproducibility: identical output for same document

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 07:16:02 -04:00