jedarden/pdftract

Author	SHA1	Message	Date
jedarden	ad29d9dadc	fix(pdftract-1j0f8): prevent newline accumulation in CLI reference generator The gen-cli-reference binary was accumulating extra blank lines after the <!-- AUTOGEN END --> marker on each regeneration because it preserved all content after the marker (including leading whitespace) and then added its own newlines. Fix: Trim leading whitespace from hand-curated content before appending. Also regenerated cli-reference.md to remove accumulated blank lines. Closes pdftract-1j0f8	2026-06-08 16:00:28 -04:00
jedarden	d0f52751ce	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs The indent trigger was using .abs() which fired on both increased indent (non-indented → indented) AND decreased indent (indented → non-indented). This caused drop-cap style paragraphs (indented first line, flush-left continuation) to incorrectly split into two blocks. Per plan Phase 4.4 heuristic #2, indent change should only trigger when the current line is MORE indented (to the right, larger x0) than the block average - i.e., a new paragraph starting after non-indented text. It should NOT trigger for decreased indent (first line indented, rest flush-left). Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold. Tests: - test_indented_first_line_new_block: PASS (non-indented → indented splits) - test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together) - All 179 line module tests: PASS	2026-06-07 13:43:19 -04:00
jedarden	dd2cb0b8c9	feat(pdftract-5lvpu): implement Swift SDK subprocess templates - Add Pdftract.swift.tera for main public API with type aliases - Update Methods.swift.tera with async throws functions and AsyncThrowingStream for streaming - Update Errors.swift.tera with 8 error types implementing LocalizedError - Update Types.swift.tera with Source enum, Options structs, and all Codable types - Update ConformanceTests.swift.tera with XCTest-based conformance suite - Update README.md.tera with full documentation (install, usage, error handling) - Update Package.swift.tera with macOS(.v13) and Linux platform support Closes pdftract-5lvpu	2026-06-01 10:47:20 -04:00
jedarden	246befd8d1	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing - Add jedarden/pdftract Composer package (sdk/php/) - Implement Client.php with proc_open subprocess execution - Add PSR-3 LoggerInterface integration (defaults to NullLogger) - Add 9 contract methods: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt - Add readonly model classes: Document, Page, Metadata, Fingerprint, Classification, Match, Receipt - Add exception classes: PdftractException base + 8 subclasses - Add PHPUnit conformance test suite - Add phpunit.xml configuration - Add composer.json with jedarden/pdftract package name - Add .ci/argo-workflows/pdftract-php-publish.yaml (Packagist auto-discovery from git tags) Also includes Ruby SDK scaffold from parallel workflow. Closes pdftract-2m3gl	2026-06-01 10:27:03 -04:00
jedarden	1c6f26ecaa	fix(bf-4mkhv): clean up unused imports in hash.rs The bead description mentioned compile errors in hash.rs from API drift, but those errors were either already fixed or misattributed. The API usage was already correct: - compute_fingerprint already takes 3 arguments with source - len() already propagates Result with ? - read_at method already used correctly - Catalog fields accessed via trailer correctly Only cleanup: removed unused std::fs::File and std::io imports. Verification: notes/bf-4mkhv.md	2026-06-01 09:43:48 -04:00
jedarden	3f8daba449	feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts Complete scanned PDF fixtures corpus for OCR testing at 300 DPI with paired ground-truth transcripts. Corpus includes: - receipt-300dpi: Single-page receipt for AS-02 scenario - invoice-300dpi: Business invoice document - form-300dpi: Employment application form - doc-10page-300dpi: 10-page document for performance testing Each fixture has: - Vector PDF source (clean text rendering) - Rasterized scanned PDF (simulated 300 DPI scan) - Ground-truth transcript for WER verification Files: - tests/fixtures/scanned/receipt/receipt-300dpi{-scanned,.pdf,.txt} - tests/fixtures/scanned/documents/{invoice,form}-300dpi{-scanned,.pdf,.txt} - tests/fixtures/scanned/multi-page/doc-10page-300dpi{-scanned,.pdf,.txt} Also added native Rust generator (xtask/src/bin/gen_scanned_fixtures.rs) and updated generation script. Verification: notes/bf-2he4t.md Acceptance Criteria: - [x] Corpus assembled with 4 fixture types - [x] All fixtures at 300 DPI - [x] Ground truth transcripts paired with each fixture - [x] Files verified present and valid - [ ] WER < 3% verified with pdftract OCR pipeline (WARN: blocked by compilation errors) Closes bf-2he4t	2026-06-01 09:35:02 -04:00
jedarden	895f1ce43d	fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs Fix two compilation errors at lines 584 and 658 where code was calling .code on &String diagnostics. Replaced d.code.to_string() with direct Vec<String> clone since diagnostics is already Vec<String>. Accepts criteria: - cargo check -p pdftract-cli emits no 'no field code' errors - serve.rs compiles cleanly	2026-06-01 04:14:05 -04:00
jedarden	804524a983	fix(pdftract-1wy98): box closure in MigrationRegistry to fix compilation - Add explicit type annotation to migrations HashMap - Box the identity closure to match Box<dyn Fn> signature - All 9 unit tests pass - CLI identity migration and error handling verified Verification: notes/pdftract-1wy98.md	2026-06-01 03:15:08 -04:00
jedarden	62a36ea756	docs(pdftract-3eohy): add rustdoc examples to Glyph and Span types - Add worked example to Glyph struct showing all 11 fields - Add worked example to Span struct showing all 10 fields - Examples use rust,no_run for internal dependencies - cargo doc passes with docs.rs feature set - Verification note added at notes/pdftract-3eohy.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 01:16:24 -04:00
jedarden	432514d350	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates Collects in-progress work across forms (Ch/Tx field handling, value_text edge cases), layout corrections, stream parser fixes, conformance test expansion, security audit test (TH-08), stream-decoder bomb fixture, debug examples reorganization under examples/debug/, sdk module scaffold, xtask CLI enhancements, and provenance entries for new fixtures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 09:48:14 -04:00
jedarden	225f96c241	fix(pyo3): correct extract_text_fn call in extract_markdown stub The extract_markdown stub was calling extract_text instead of extract_text_fn, causing a compilation error. This fixes the function name to match the exported function from extract_text.rs. This completes the extract_text PyO3 entry point implementation, which was already present in extract_text.rs and lib.rs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 20:28:25 -04:00
jedarden	84981f7c9b	fix(pdftract-25igv): fix emit! macro usage in codespace parser Some checks are pending Schema Generation Validation / Validate JSON Schema (push) Waiting to run Details Schema Generation Validation / Validate JSON Syntax (push) Waiting to run Details The emit! macro expects diagnostic codes without the DiagCode:: prefix. Changed three occurrences in codespace.rs: - Line 281: DiagCode::CmapInvalidCodespace → CmapInvalidCodespace - Line 290: DiagCode::CmapInvalidCodespace → CmapInvalidCodespace - Line 412: DiagCode::CmapInvalidCodespace → CmapInvalidCodespace This fixes compilation errors that prevented the codebase from building. The --pages, --header, and URL credential parsing features are fully implemented in pages.rs, header.rs, and url.rs modules with comprehensive tests and integration in main.rs, grep/mod.rs, and hash.rs. References: pdftract-25igv, notes/pdftract-25igv.md	2026-05-28 07:29:33 -04:00
jedarden	23322f79d1	feat(pdftract-2qw5j): add explicit enum constraints to JSON Schema Some checks are pending Schema Generation Validation / Validate JSON Schema (push) Waiting to run Details Schema Generation Validation / Validate JSON Syntax (push) Waiting to run Details Add explicit enum constraints to page_type, severity, and confidence_source fields in the generated JSON Schema for better validation. Changes: - Modified xtask/src/bin/gen_schema.rs to add explicit enum constraints during schema generation via add_enum_constraints() function - page_type enum: ["text", "scanned", "mixed", "broken_vector", "blank", "figure_only"] - severity enum: ["info", "warning", "error", "fatal"] - confidence_source enum: ["native", "heuristic", "ocr"] - Regenerated docs/schema/v1.0/pdftract.schema.json with enum constraints - Added .github/workflows/schema-gen.yml CI workflow for schema validation The CI workflow validates: 1. Generated schema matches committed file (fails on diff) 2. JSON syntax is valid 3. Schema structure is correct ($id, $schema, title, $defs) 4. Enum constraints are present and have correct values This ensures schema changes are reviewable in PRs and forces developers to commit the updated schema when type definitions change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 02:47:54 -04:00
jedarden	016c738188	feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata Implement the xtask gen-schema binary at xtask/src/bin/gen_schema.rs that derives JSON Schema Draft 2020-12 from the Rust ExtractionResult type via the schemars crate. Changes: - Add stable key sorting (sort_keys_recursive) for byte-identical output - Set $id to stable URL: https://pdftract.com/schema/v1.0/pdftract.schema.json - Set title to "pdftract Output v1.0" - Add cargo alias `gen-schema` for convenient invocation - Emit schema to docs/schema/v1.0/pdftract.schema.json The schema is generated from the Rust types with schemars derives, ensuring the JSON schema is always in sync with the source types. Acceptance criteria: - cargo gen-schema regenerates docs/schema/v1.0/pdftract.schema.json - Generated schema validates against JSON Schema Draft 2020-12 - Schema $id is the stable URL - Title is "pdftract Output v1.0" - Stable ordering: regenerating twice produces byte-identical output - All expected types appear in $defs (BlockJson, SpanJson, PageResult, etc.) Note: page_type and confidence_source enums are not yet implemented in the Rust types (marked as TODO in schema/mod.rs). These will be added by sibling beads pdftract-1ob and pdftract-1f8we respectively. Closes: pdftract-5nv9h	2026-05-24 17:31:16 -04:00
jedarden	05be70d36f	feat(pdftract-48ea): implement BrokenVector fixtures + WER delta CI gate Add two PDF/A fixtures for testing assisted-OCR (BrokenVector path): - Aligned fixture with correctly-positioned invisible text layer - Misaligned fixture with text layer offset by (10pt, 5pt) Extend ci/wer-gate.sh with WER validation for BrokenVector fixtures. Acceptance criteria: - Two BrokenVector fixtures committed (both 1.5 KB, well under 200 KB limit) - ci/wer-gate.sh extended with new fixture invocations - WER delta tests will skip gracefully when OCR environment unavailable Closes: pdftract-48ea Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 10:52:41 -04:00
jedarden	dd2d3502c6	feat(glyph-shape): implement font corpus fetch script and shape DB generation Implemented scripts/fetch-shape-corpus.sh for downloading open-licensed font corpus and generating glyph shape database for L4 recognition. - Script downloads fonts from build/shape-corpus-manifest.txt - Copies LICENSE files to build/font-licenses/ for compliance - Idempotent: skips already-present fonts - Fixed xtask center_bitmap_32x32 overflow bug (width/height > 32) Generated build/glyph-shapes.json with 9,141 glyphs (> 4500 target): - DejaVu Sans: 4,459 glyphs (Latin Extended, Greek, Cyrillic) - Roboto: 2,392 glyphs (Latin Basic, extended) - JetBrains Mono: 1,176 glyphs (monospace) - Source Code Pro: 1,124 glyphs (monospace) build/font-licenses/COMPLIANCE.md documents OFL derivative-work analysis for pHash data redistribution. Closes: pdftract-1i8n	2026-05-24 09:48:29 -04:00
jedarden	f08369bbf0	feat(xtask): implement gen-shape-db subcommand for glyph pHash database Add cargo xtask gen-shape-db command that walks font directories, rasterizes glyphs at 32x32 via fontdue, computes pHash, and outputs build/glyph-shapes.json. Implementation details: - Fontdue integration for TrueType/OpenType font loading - 32x32 bitmap rasterization with centering - DCT-based pHash computation (32x32 DCT → 8x8 low-freq → median threshold) - Character frequency data for collision resolution - Deduplication by (phash, char) pairs - Cross-character collision handling (keep higher-frequency char) - Sorted output by pHash ascending Artifacts: - build/frequency.json: Character frequency rankings - build/README.md: Command documentation and usage Acceptance criteria: - ✅ cargo xtask gen-shape-db --fonts <dir> produces valid JSON - ✅ Deterministic output (byte-identical on same inputs) - ✅ Fontdue integration and 32x32 rasterization - ✅ pHash computation via DCT - ⚠️ No system fonts for full integration test (documented) Closes: pdftract-2aq0	2026-05-24 05:40:44 -04:00
jedarden	e6bf3dd290	feat(pdftract-3s2i): implement Phase 5.5.2 validation filter Implement per-word validation filter for assisted-OCR BrokenVector path. Changes: - Add SpanSource::OcrAssisted variant to hybrid.rs - Add Span::ocr_assisted() helper method - Implement validate_ocr_with_position_hints() in ocr.rs - 5pt distance threshold for position validation - 0.4 confidence cap for rejected words - Linear scan for nearest-neighbor lookup - Add unit tests for validation filter Closes: pdftract-3s2i Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 04:57:17 -04:00
jedarden	92e90af0b0	feat(pdftract-zy2jx): generate JSON Schema from Rust output types - Add schemars dependency to pdftract-core (v1.2) - Add JsonSchema derives to output types (ExtractionResult, PageResult, ExtractionMetadata, SpanJson, BlockJson, CellJson, RowJson, TableJson, ExtractionQuality, Receipt, ReceiptsMode) - Create xtask/src/bin/gen_schema.rs for schema generation - Add gen-schema command to xtask main.rs - Generate docs/schema/v1.0/pdftract.schema.json using Draft 2020-12 Schema includes: - $schema: "https://json-schema.org/draft/2020-12/schema" - $defs with all output type definitions - Proper type annotations for all fields Closes: pdftract-zy2jx	2026-05-24 01:29:14 -04:00
jedarden	9215892f95	feat(pdftract-2zw): page classification fixtures + integration tests + reproducibility gate Implement page classification test fixtures, integration tests, and reproducibility CI gate for Phase 5.1.5. Fixtures (4 total, 3.6 KB): - vector_pure: Pure text PDF (born-digital) - scanned_single: Image-only PDF (scanned) - brokenvector_pdfa: Invisible text + image - hybrid_header_body: Text header + scanned body Integration tests (crates/pdftract-core/tests/page_classification.rs): - test_page_classification_fixtures: Validates classification correctness - test_page_classification_reproducibility: CI gate for byte-identical JSON - test_fixture_files_exist_and_size: Infrastructure validation - test_expected_json_validity: JSON schema validation Acceptance criteria: - ✅ 4 fixtures present in tests/fixtures/page_class/ - ✅ cargo test page_classification passes (4/4 tests) - ✅ Reproducibility gate fails on perturbation - ✅ Fixtures total < 1 MB (3.6 KB) Refs: pdftract-2zw, plan.md lines 1840-1844 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 15:04:05 -04:00
jedarden	c621947686	feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement Implements Tier-1 memory ceiling gate that enforces RSS budgets for PDF extraction, analogous to cargo-bloat for binary size. Changes: - CI: Add memory-ceiling template with cgroup MemoryMax (1.5 GB) - CI: Add cgroup MemoryMax enforcement to test-glibc (6 GB) and test-musl (4 GB) - CI: Add cgroup MemoryMax + libfuzzer rss/malloc limits to fuzz workflow - xtask: Implement memory-ceiling command with peak RSS sampling - Add perf fixtures (100-page, 10k-page) for memory testing - Add run-fuzz-with-limits.sh for local fuzz testing with memory caps - Register perf fixtures in PROVENANCE.md Memory budgets enforced: - Buffered 100-page PDF: < 512 MB - Streaming mode: < 256 MB (constant in page count) - Adversarial fixtures: < 1 GB hard ceiling Closes bf-1g1fd Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 13:22:55 -04:00
jedarden	58a177d3b4	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files Add dual MIT OR Apache-2.0 licensing at repo root with proper copyright notices. Configure all workspace and non-workspace crates to declare the license. Wire license files into Python wheels and Docker images. Files added: - LICENSE-MIT: MIT License with "Copyright (c) 2026 Jed Cabanero" - LICENSE-APACHE: Apache License 2.0 (verbatim from apache.org) Files modified: - Cargo.toml: Updated authors to "Jed Cabanero <me@jedcabanero.com>" - crates/pdftract-py/pyproject.toml: Added license-files to maturin config - crates/pdftract-cer-diff/Cargo.toml: Added license.workspace = true - xtask/Cargo.toml: Added license = "MIT OR Apache-2.0" - fuzz/Cargo.toml: Added license = "MIT OR Apache-2.0" - Cargo-dist.toml: Created to include license files in binary archives - notes/pdftract-aawrz.md: Verification note Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:36:28 -04:00
jedarden	25ddcba641	docs(pdftract-4iier): complete per-profile README documentation Complete the per-profile README documentation for all 9 built-in profiles: - slide_deck: Add Known Limitations section - form: Add Match Criteria Summary and Known Limitations - bank_statement: Add Match Criteria Summary and Known Limitations - legal_filing: Add Match Criteria Summary and Known Limitations - book_chapter: Add Match Criteria Summary and Known Limitations The xtask doc-profile skeleton generator already existed and provides automated README generation from profile.yaml files. All READMEs now follow the consistent 6-section structure: 1. Title and description 2. Match Criteria Summary (prose description) 3. Extracted Fields (table with field details) 4. Known Limitations (document-specific edge cases) 5. Sample Input Pointer (fixture references) 6. Configuration Tips (override instructions) Acceptance criteria: - All nine README files exist at profiles/builtin/<type>/README.md - Each follows the consistent 6-section structure - Extracted Fields tables match the corresponding profile YAML - Known Limitations is non-empty and document-specific - Sample Input Pointer links to actual fixtures - xtask doc-profile skeleton generator exists Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-05-18 00:19:44 -04:00
jedarden	b535638104	feat(pdftract-2bsfc): implement document catalog parser with PageLabels number tree Implement the document catalog parser (/Root traversal) for PDF documents. The catalog parser extracts all key entries from the document catalog including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names, Metadata, PageLabels, OCProperties, OpenAction, AA, and Version. Key structures: - MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects - PageLabelStyle: enum for all label styles (D, R, r, A, a) - PageLabel: single page label with style, prefix, and start value - PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support - OcProperties: stub for OCG implementation (delegated to dedicated bead) - Catalog: main catalog struct with all required and optional fields Number tree implementation: - Parses /Nums arrays (leaf nodes with alternating key-value pairs) - Supports /Kids arrays (internal nodes for recursive tree traversal) - Provides get_label_with_start() and get_label() methods for lookup - Correctly formats roman numerals (uppercase/lowercase) and letter sequences All 27 tests pass including proptests for fuzzing robustness (INV-8). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:45:45 -04:00
jedarden	17f581897f	fix(pdftract-4iier): correct typo in scientific_paper README and fix xtask path handling - Fix typo: "scific_paper" -> "scientific_paper" in fixture path - Fix xtask path resolution: use relative path ".." to access workspace root - Fix xtask format string: remove unused profile_name placeholder - Add workspace exclusion to xtask/Cargo.toml for standalone build These are minor improvements to the existing per-profile README documentation that was already created in commit `8b5dd4f`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:22:39 -04:00
jedarden	8b5dd4febb	docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles This commit creates user-facing documentation for each built-in profile: - Profile YAML files defining match criteria, priority, and extracted fields - Per-profile READMEs with match criteria summary, extracted fields table, known limitations, sample input pointers, and configuration tips - xtask skeleton generator for automated README generation Profiles documented: - invoice: Commercial invoices with line items, vendor/customer, totals - receipt: POS receipts with items, payment method - contract: Legal contracts with parties, effective date, term, signatures - scientific_paper: Academic papers with title, authors, abstract, DOI, references - slide_deck: Presentation slides with title, presenter, date, slide titles - form: Fillable forms (degenerate case: uses Phase 7.4 form_fields) - bank_statement: Bank statements with account info, period, balances, transactions - legal_filing: Court filings with case number, court, parties, filing date, docket - book_chapter: Book chapters with title, chapter number, author, section headings Closes: pdftract-4iier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:19:00 -04:00

26 commits