jedarden/pdftract

Author	SHA1	Message	Date
jedarden	dd2d3502c6	feat(glyph-shape): implement font corpus fetch script and shape DB generation Implemented scripts/fetch-shape-corpus.sh for downloading open-licensed font corpus and generating glyph shape database for L4 recognition. - Script downloads fonts from build/shape-corpus-manifest.txt - Copies LICENSE files to build/font-licenses/ for compliance - Idempotent: skips already-present fonts - Fixed xtask center_bitmap_32x32 overflow bug (width/height > 32) Generated build/glyph-shapes.json with 9,141 glyphs (> 4500 target): - DejaVu Sans: 4,459 glyphs (Latin Extended, Greek, Cyrillic) - Roboto: 2,392 glyphs (Latin Basic, extended) - JetBrains Mono: 1,176 glyphs (monospace) - Source Code Pro: 1,124 glyphs (monospace) build/font-licenses/COMPLIANCE.md documents OFL derivative-work analysis for pHash data redistribution. Closes: pdftract-1i8n	2026-05-24 09:48:29 -04:00
jedarden	f08369bbf0	feat(xtask): implement gen-shape-db subcommand for glyph pHash database Add cargo xtask gen-shape-db command that walks font directories, rasterizes glyphs at 32x32 via fontdue, computes pHash, and outputs build/glyph-shapes.json. Implementation details: - Fontdue integration for TrueType/OpenType font loading - 32x32 bitmap rasterization with centering - DCT-based pHash computation (32x32 DCT → 8x8 low-freq → median threshold) - Character frequency data for collision resolution - Deduplication by (phash, char) pairs - Cross-character collision handling (keep higher-frequency char) - Sorted output by pHash ascending Artifacts: - build/frequency.json: Character frequency rankings - build/README.md: Command documentation and usage Acceptance criteria: - ✅ cargo xtask gen-shape-db --fonts <dir> produces valid JSON - ✅ Deterministic output (byte-identical on same inputs) - ✅ Fontdue integration and 32x32 rasterization - ✅ pHash computation via DCT - ⚠️ No system fonts for full integration test (documented) Closes: pdftract-2aq0	2026-05-24 05:40:44 -04:00
jedarden	e6bf3dd290	feat(pdftract-3s2i): implement Phase 5.5.2 validation filter Implement per-word validation filter for assisted-OCR BrokenVector path. Changes: - Add SpanSource::OcrAssisted variant to hybrid.rs - Add Span::ocr_assisted() helper method - Implement validate_ocr_with_position_hints() in ocr.rs - 5pt distance threshold for position validation - 0.4 confidence cap for rejected words - Linear scan for nearest-neighbor lookup - Add unit tests for validation filter Closes: pdftract-3s2i Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 04:57:17 -04:00
jedarden	92e90af0b0	feat(pdftract-zy2jx): generate JSON Schema from Rust output types - Add schemars dependency to pdftract-core (v1.2) - Add JsonSchema derives to output types (ExtractionResult, PageResult, ExtractionMetadata, SpanJson, BlockJson, CellJson, RowJson, TableJson, ExtractionQuality, Receipt, ReceiptsMode) - Create xtask/src/bin/gen_schema.rs for schema generation - Add gen-schema command to xtask main.rs - Generate docs/schema/v1.0/pdftract.schema.json using Draft 2020-12 Schema includes: - $schema: "https://json-schema.org/draft/2020-12/schema" - $defs with all output type definitions - Proper type annotations for all fields Closes: pdftract-zy2jx	2026-05-24 01:29:14 -04:00
jedarden	9215892f95	feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate Implement page classification test fixtures, integration tests, and reproducibility CI gate for Phase 5.1.5. Fixtures (4 total, 3.6 KB): - vector_pure: Pure text PDF (born-digital) - scanned_single: Image-only PDF (scanned) - brokenvector_pdfa: Invisible text + image - hybrid_header_body: Text header + scanned body Integration tests (crates/pdftract-core/tests/page_classification.rs): - test_page_classification_fixtures: Validates classification correctness - test_page_classification_reproducibility: CI gate for byte-identical JSON - test_fixture_files_exist_and_size: Infrastructure validation - test_expected_json_validity: JSON schema validation Acceptance criteria: - ✅ 4 fixtures present in tests/fixtures/page_class/ - ✅ cargo test page_classification passes (4/4 tests) - ✅ Reproducibility gate fails on perturbation - ✅ Fixtures total < 1 MB (3.6 KB) Refs: pdftract-2zw, plan.md lines 1840-1844 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 15:04:05 -04:00
jedarden	c621947686	feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement Implements Tier-1 memory ceiling gate that enforces RSS budgets for PDF extraction, analogous to cargo-bloat for binary size. Changes: - CI: Add memory-ceiling template with cgroup MemoryMax (1.5 GB) - CI: Add cgroup MemoryMax enforcement to test-glibc (6 GB) and test-musl (4 GB) - CI: Add cgroup MemoryMax + libfuzzer rss/malloc limits to fuzz workflow - xtask: Implement memory-ceiling command with peak RSS sampling - Add perf fixtures (100-page, 10k-page) for memory testing - Add run-fuzz-with-limits.sh for local fuzz testing with memory caps - Register perf fixtures in PROVENANCE.md Memory budgets enforced: - Buffered 100-page PDF: < 512 MB - Streaming mode: < 256 MB (constant in page count) - Adversarial fixtures: < 1 GB hard ceiling Closes bf-1g1fd Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 13:22:55 -04:00
jedarden	58a177d3b4	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files Add dual MIT OR Apache-2.0 licensing at repo root with proper copyright notices. Configure all workspace and non-workspace crates to declare the license. Wire license files into Python wheels and Docker images. Files added: - LICENSE-MIT: MIT License with "Copyright (c) 2026 Jed Cabanero" - LICENSE-APACHE: Apache License 2.0 (verbatim from apache.org) Files modified: - Cargo.toml: Updated authors to "Jed Cabanero <me@jedcabanero.com>" - crates/pdftract-py/pyproject.toml: Added license-files to maturin config - crates/pdftract-cer-diff/Cargo.toml: Added license.workspace = true - xtask/Cargo.toml: Added license = "MIT OR Apache-2.0" - fuzz/Cargo.toml: Added license = "MIT OR Apache-2.0" - Cargo-dist.toml: Created to include license files in binary archives - notes/pdftract-aawrz.md: Verification note Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:36:28 -04:00
jedarden	25ddcba641	docs(pdftract-4iier): complete per-profile README documentation Complete the per-profile README documentation for all 9 built-in profiles: - slide_deck: Add Known Limitations section - form: Add Match Criteria Summary and Known Limitations - bank_statement: Add Match Criteria Summary and Known Limitations - legal_filing: Add Match Criteria Summary and Known Limitations - book_chapter: Add Match Criteria Summary and Known Limitations The xtask doc-profile skeleton generator already existed and provides automated README generation from profile.yaml files. All READMEs now follow the consistent 6-section structure: 1. Title and description 2. Match Criteria Summary (prose description) 3. Extracted Fields (table with field details) 4. Known Limitations (document-specific edge cases) 5. Sample Input Pointer (fixture references) 6. Configuration Tips (override instructions) Acceptance criteria: - All nine README files exist at profiles/builtin/<type>/README.md - Each follows the consistent 6-section structure - Extracted Fields tables match the corresponding profile YAML - Known Limitations is non-empty and document-specific - Sample Input Pointer links to actual fixtures - xtask doc-profile skeleton generator exists Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-05-18 00:19:44 -04:00
jedarden	b535638104	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree Implement the document catalog parser (/Root traversal) for PDF documents. The catalog parser extracts all key entries from the document catalog including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names, Metadata, PageLabels, OCProperties, OpenAction, AA, and Version. Key structures: - MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects - PageLabelStyle: enum for all label styles (D, R, r, A, a) - PageLabel: single page label with style, prefix, and start value - PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support - OcProperties: stub for OCG implementation (delegated to dedicated bead) - Catalog: main catalog struct with all required and optional fields Number tree implementation: - Parses /Nums arrays (leaf nodes with alternating key-value pairs) - Supports /Kids arrays (internal nodes for recursive tree traversal) - Provides get_label_with_start() and get_label() methods for lookup - Correctly formats roman numerals (uppercase/lowercase) and letter sequences All 27 tests pass including proptests for fuzzing robustness (INV-8). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:45:45 -04:00
jedarden	17f581897f	fix(pdftract-4iier): correct typo in scientific_paper README and fix xtask path handling - Fix typo: "scific_paper" -> "scientific_paper" in fixture path - Fix xtask path resolution: use relative path ".." to access workspace root - Fix xtask format string: remove unused profile_name placeholder - Add workspace exclusion to xtask/Cargo.toml for standalone build These are minor improvements to the existing per-profile README documentation that was already created in commit `8b5dd4f`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:22:39 -04:00
jedarden	8b5dd4febb	docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles This commit creates user-facing documentation for each built-in profile: - Profile YAML files defining match criteria, priority, and extracted fields - Per-profile READMEs with match criteria summary, extracted fields table, known limitations, sample input pointers, and configuration tips - xtask skeleton generator for automated README generation Profiles documented: - invoice: Commercial invoices with line items, vendor/customer, totals - receipt: POS receipts with items, payment method - contract: Legal contracts with parties, effective date, term, signatures - scientific_paper: Academic papers with title, authors, abstract, DOI, references - slide_deck: Presentation slides with title, presenter, date, slide titles - form: Fillable forms (degenerate case: uses Phase 7.4 form_fields) - bank_statement: Bank statements with account info, period, balances, transactions - legal_filing: Court filings with case number, court, parties, filing date, docket - book_chapter: Book chapters with title, chapter number, author, section headings Closes: pdftract-4iier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:19:00 -04:00

11 commits