jedarden/pdftract

Author	SHA1	Message	Date
jedarden	d0f52751ce	fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs The indent trigger was using .abs() which fired on both increased indent (non-indented → indented) AND decreased indent (indented → non-indented). This caused drop-cap style paragraphs (indented first line, flush-left continuation) to incorrectly split into two blocks. Per plan Phase 4.4 heuristic #2, indent change should only trigger when the current line is MORE indented (to the right, larger x0) than the block average - i.e., a new paragraph starting after non-indented text. It should NOT trigger for decreased indent (first line indented, rest flush-left). Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold. Tests: - test_indented_first_line_new_block: PASS (non-indented → indented splits) - test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together) - All 179 line module tests: PASS	2026-06-07 13:43:19 -04:00
jedarden	246befd8d1	feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing - Add jedarden/pdftract Composer package (sdk/php/) - Implement Client.php with proc_open subprocess execution - Add PSR-3 LoggerInterface integration (defaults to NullLogger) - Add 9 contract methods: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt - Add readonly model classes: Document, Page, Metadata, Fingerprint, Classification, Match, Receipt - Add exception classes: PdftractException base + 8 subclasses - Add PHPUnit conformance test suite - Add phpunit.xml configuration - Add composer.json with jedarden/pdftract package name - Add .ci/argo-workflows/pdftract-php-publish.yaml (Packagist auto-discovery from git tags) Also includes Ruby SDK scaffold from parallel workflow. Closes pdftract-2m3gl	2026-06-01 10:27:03 -04:00
jedarden	1c6f26ecaa	fix(bf-4mkhv): clean up unused imports in hash.rs The bead description mentioned compile errors in hash.rs from API drift, but those errors were either already fixed or misattributed. The API usage was already correct: - compute_fingerprint already takes 3 arguments with source - len() already propagates Result with ? - read_at method already used correctly - Catalog fields accessed via trailer correctly Only cleanup: removed unused std::fs::File and std::io imports. Verification: notes/bf-4mkhv.md	2026-06-01 09:43:48 -04:00
jedarden	38d1deb57c	wip: intermediate state from previous work	2026-05-29 08:25:23 -04:00
jedarden	bb7146cffe	fix(pdftract-2uk9z): wrap native module results in typed Python objects The native PyO3 module returns raw dicts via pythonize, but the Python SDK API expects typed dataclass objects (Document, Page, Metadata, etc.) to be consistent with the subprocess fallback and test expectations. Updated wrapper functions in __init__.py to convert native results: - extract(): wraps dict in Document.from_dict() - extract_stream(): wraps yielded page dicts in Page.from_dict() - get_metadata(): wraps dict in Metadata() - hash(): wraps string in Fingerprint.from_string() - classify(): wraps dict in Classification() - search(): wraps yielded match dicts in Match The native PyO3 entry points (extract, extract_text, extract_stream) were already implemented with: - extract: uses extract_pdf + pythonize for PyDict conversion - extract_text: uses extract_text for plain String return - extract_stream: uses extract_pdf_streaming with custom StreamIterator All kwargs parsing with strict validation (unknown kwargs raise TypeError) was already in place. Acceptance criteria: - pdftract.extract() returns Document object with pages/metadata - pdftract.extract_text() returns plain text string - pdftract.extract_stream() yields Page objects - Unknown kwarg raises TypeError	2026-05-28 21:18:38 -04:00
jedarden	225f96c241	fix(pyo3): correct extract_text_fn call in extract_markdown stub The extract_markdown stub was calling extract_text instead of extract_text_fn, causing a compilation error. This fixes the function name to match the exported function from extract_text.rs. This completes the extract_text PyO3 entry point implementation, which was already present in extract_text.rs and lib.rs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 20:28:25 -04:00
jedarden	f85e5149dd	feat(pdftract-91e1i): HTTP fetch sequence implementation Implement orchestration layer connecting HttpRangeSource to Phase 1.3 xref resolver and Phase 1.4 document model for remote PDF access: - Document::open_remote() public API for remote PDF loading - Progressive tail fetch (16 KB → 1 MB) for startxref location - Xref forward-scan disabled for remote sources (via is_remote check) - Page-by-page on-demand fetch via HttpRangeSource caching - Resource lazy load through XrefResolver cache - HEAD probe with 405 fallback, no Content-Length handling Acceptance criteria: ✅ open_remote(url) returns Document with correct page count ✅ HEAD failure modes (405, no Content-Length, 401) handled ✅ xref forward-scan disabled for remote (is_remote check) ✅ Page-by-page on-demand fetch (HttpRangeSource LRU cache) ✅ INV-8 maintained (all errors return Result) Files modified: - crates/pdftract-core/src/document.rs (Document::open_remote, from_source) - crates/pdftract-core/src/remote.rs (progressive tail fetch) - crates/pdftract-core/src/lib.rs (re-exports) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 13:17:00 -04:00
jedarden	dd2d3502c6	feat(glyph-shape): implement font corpus fetch script and shape DB generation Implemented scripts/fetch-shape-corpus.sh for downloading open-licensed font corpus and generating glyph shape database for L4 recognition. - Script downloads fonts from build/shape-corpus-manifest.txt - Copies LICENSE files to build/font-licenses/ for compliance - Idempotent: skips already-present fonts - Fixed xtask center_bitmap_32x32 overflow bug (width/height > 32) Generated build/glyph-shapes.json with 9,141 glyphs (> 4500 target): - DejaVu Sans: 4,459 glyphs (Latin Extended, Greek, Cyrillic) - Roboto: 2,392 glyphs (Latin Basic, extended) - JetBrains Mono: 1,176 glyphs (monospace) - Source Code Pro: 1,124 glyphs (monospace) build/font-licenses/COMPLIANCE.md documents OFL derivative-work analysis for pHash data redistribution. Closes: pdftract-1i8n	2026-05-24 09:48:29 -04:00
jedarden	61babb0991	test(bf-5dnh1): add memory ceiling enforcement for proptests Add scripts/run-proptest-with-limits.sh to run property tests under cgroup MemoryMax, ensuring pathological cases fail fast with allocation errors instead of OOMing the host. Coordinated with bf-1g1fd (CI memory-ceiling gate) to provide local development parity with CI enforcement. Changes: - Add scripts/run-proptest-with-limits.sh (cgroup v2/v1 wrapper) - Add scripts/README.md documenting memory ceiling enforcement Memory limits: - Proptests: 2048 MB cgroup MemoryMax (local) - Fuzz tests: 1536 MB cgroup + 1024 MB libfuzzer RSS (existing) Proptest input size caps (already in place): - Lexer/object parser: up to 10 KB inputs - Xref/stream parsers: up to 100 KB inputs - Nested structures: depth-limited Refs: bf-5dnh1, bf-1g1fd Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 13:39:04 -04:00
jedarden	c621947686	feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement Implements Tier-1 memory ceiling gate that enforces RSS budgets for PDF extraction, analogous to cargo-bloat for binary size. Changes: - CI: Add memory-ceiling template with cgroup MemoryMax (1.5 GB) - CI: Add cgroup MemoryMax enforcement to test-glibc (6 GB) and test-musl (4 GB) - CI: Add cgroup MemoryMax + libfuzzer rss/malloc limits to fuzz workflow - xtask: Implement memory-ceiling command with peak RSS sampling - Add perf fixtures (100-page, 10k-page) for memory testing - Add run-fuzz-with-limits.sh for local fuzz testing with memory caps - Register perf fixtures in PROVENANCE.md Memory budgets enforced: - Buffered 100-page PDF: < 512 MB - Streaming mode: < 256 MB (constant in page count) - Adversarial fixtures: < 1 GB hard ceiling Closes bf-1g1fd Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 13:22:55 -04:00
jedarden	660a9401ef	feat(pdftract-59zz): implement MCP bearer token ingress channels and TH-03 enforcement Implements secure MCP bearer-token ingress channels and TH-03 startup abort enforcement per plan lines 874, 915-921, 922-924. ## Changes - Add `--auth-token-file PATH` flag (RECOMMENDED channel) - Add `PDFTRACT_MCP_TOKEN` env var support - Reject `--auth-token VALUE` unless `PDFTRACT_INSECURE_CLI_TOKEN=1` - Enforce TH-03: require token for non-loopback bind addresses (exit 78) - Loopback exemption for 127.0.0.0/8 and ::1/128 ## Files - crates/pdftract-cli/src/mcp/auth.rs: Token resolution with priority order - crates/pdftract-cli/src/mcp/bind.rs: TH-03 bind security check - crates/pdftract-cli/src/mcp/server.rs: MCP server entry point - crates/pdftract-cli/src/mcp/mod.rs: Module exports - crates/pdftract-cli/src/main.rs: CLI arguments - crates/pdftract-cli/Cargo.toml: Add secrecy, tempfile dependencies ## Acceptance Criteria - ✅ --auth-token-file PATH flag implemented - ✅ PDFTRACT_MCP_TOKEN env var resolved - ✅ --auth-token VALUE rejected (exit 64) unless PDFTRACT_INSECURE_CLI_TOKEN=1 - ✅ mcp --bind ADDR with non-loopback ADDR and no token: aborts with exit 78 - ✅ mcp --bind ADDR with loopback ADDR and no token: succeeds - ✅ mcp --bind ADDR with token: succeeds regardless of address - ⏸️ Inspector token: Phase 7.9 (not yet implemented) - ⏸️ TH-03 test: separate bead Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 02:47:54 -04:00
jedarden	7fed5a0a6f	docs(pdftract-5l9m): add CI validation script and verification note Add CI validation script for checking unauthorized expose_secret() call sites. The script validates that all uses of expose_secret() are in approved locations (SecretFingerprint and test code). Also add verification note summarizing the bead completion status. Per pdftract-5l9m acceptance criteria: - CI grep guard rejects unauthorized expose_secret() call sites - Verification documents existing SecretString wrapping status Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-05-18 01:05:33 -04:00
jedarden	3af009440e	fix(pdftract-5z5d8): fix provenance validation script Fixed scripts/check-provenance.sh to properly validate PROVENANCE.md against actual fixture files. The script was failing silently due to subshell EXIT trap removing temp files before parent could read them, and arithmetic expansion returning exit code 1 on zero value. Changes: - Replaced subshell pipes with process substitution - Moved temp file cleanup to after reading - Added validated variable initialization - Added \|\| true to prevent exit on zero arithmetic All 200 classifier corpus fixtures have valid provenance entries with matching SHA256 hashes. PROVENANCE.md already existed with complete documentation. Refs: pdftract-5z5d8 Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-05-17 23:43:37 -04:00
jedarden	633eba61b1	test(classifier): add 200-document labeled corpus for Phase 5.6 - Create tests/fixtures/classifier/ with 200 synthetic PDFs: - 50 invoices with bill-to/ship-to, item tables, totals - 50 scientific papers with abstracts, sections, references - 50 contracts with clauses, legal terminology, signatures - 50 misc documents (8 receipts, 8 forms, 7 bank statements, 7 slide decks, 7 legal filings, 6 book excerpts, 7 magazines) - Add MANIFEST.tsv mapping each document to its expected type with source URL and license (all MIT-0 synthetic data) - Add scripts/generate_test_corpus.py to regenerate the corpus using reportlab for PDF generation - Add tests/test_classifier_corpus.rs with validation harness: - test_corpus_manifest_validity: verifies manifest structure and file existence (PASSES) - test_classifier_corpus_accuracy: will validate precision/ recall/F1 when classifier is implemented (SKIP for now) - test_classifier_reproducibility: will verify deterministic classification (SKIP for now) - Add tests/fixtures/classifier/README.md documenting corpus structure, generation process, and acceptance criteria Total corpus size: ~0.4 MB (each PDF < 5 KB) Acceptance criteria (from plan.md Phase 5.6): - Per-class precision and recall >= 0.85 - Macro-F1 >= 0.88 - Reproducibility: identical output for same document Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 07:16:02 -04:00

14 commits