jedarden/pdftract

Author	SHA1	Message	Date
jedarden	47df769e4b	feat(pdftract-5ls35): implement JSON-Lines output sink for grep Implement the --json output sink for pdftract grep with JSON-Lines format (one match per line). Includes MatchEvent, FileOnlyEvent, CountEvent structs and JsonSink line-buffered writer. Key features: - MatchEvent with all fields (path, page_index, bbox, match_text, span_text, span_confidence, pdf_fingerprint, crosses_spans) - crosses_spans omitted when false via skip_serializing_if - NaN/Infinity in span_confidence replaced with null - page_index is 0-based (machine convention) - FileOnlyEvent for -l mode, CountEvent for -c mode - Line-buffered writes with immediate flush - JSON schema at docs/schema/v1.0/grep-jsonl.schema.json Closes: pdftract-5ls35	2026-05-25 02:05:17 -04:00
jedarden	2ccdaecda1	docs(pdftract-5nare): add comprehensive FAQ with 24 questions Added docs/user-docs/src/faq.md with 24 FAQ entries covering: - General questions (what is pdftract, extract vs extract_text, JS execution) - Installation and setup (proxy, system requirements) - Usage (broken_vector, OCR speed, page ranges, images, batch processing) - Configuration (custom profiles, OCR accuracy, confidence scores) - Output formats (Markdown, tables, metadata, passwords) - Troubleshooting (errors, empty output, debugging, memory usage) Each answer is 1-3 paragraphs with cross-links to fuller docs. mdBook builds successfully. Acceptance criteria: - PASS: docs/user-docs/src/faq.md exists - PASS: 24 questions covered (target: 15-25) - PASS: Each answer is 1-3 paragraphs - PASS: Cross-links work - PASS: mdBook renders cleanly Closes: pdftract-5nare	2026-05-25 00:22:48 -04:00
jedarden	016c738188	feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata Implement the xtask gen-schema binary at xtask/src/bin/gen_schema.rs that derives JSON Schema Draft 2020-12 from the Rust ExtractionResult type via the schemars crate. Changes: - Add stable key sorting (sort_keys_recursive) for byte-identical output - Set $id to stable URL: https://pdftract.com/schema/v1.0/pdftract.schema.json - Set title to "pdftract Output v1.0" - Add cargo alias `gen-schema` for convenient invocation - Emit schema to docs/schema/v1.0/pdftract.schema.json The schema is generated from the Rust types with schemars derives, ensuring the JSON schema is always in sync with the source types. Acceptance criteria: - cargo gen-schema regenerates docs/schema/v1.0/pdftract.schema.json - Generated schema validates against JSON Schema Draft 2020-12 - Schema $id is the stable URL - Title is "pdftract Output v1.0" - Stable ordering: regenerating twice produces byte-identical output - All expected types appear in $defs (BlockJson, SpanJson, PageResult, etc.) Note: page_type and confidence_source enums are not yet implemented in the Rust types (marked as TODO in schema/mod.rs). These will be added by sibling beads pdftract-1ob and pdftract-1f8we respectively. Closes: pdftract-5nv9h	2026-05-24 17:31:16 -04:00
jedarden	84b4448648	feat(pdftract-5qca): implement form_fields JSON output + schema integration Phase 7.4.5 implementation: Wire combined Vec<(String, FormFieldValue)> from combiner into document-level /form_fields JSON output with tagged union schema. - Add FormFieldJson, FormFieldTypeJson, FormFieldValueJson, ChoiceValueJson to schema - Add form_fields: Vec<FormFieldJson> to ExtractionResult (always emitted, empty when none) - Implement acro_field_to_value() converter for Phase 7.4.2 type-specific extraction - Wire form field extraction in extract_pdf(): walk AcroForm, extract XFA, combine with XFA-wins - Add convert_form_field_to_json() helper for FormFieldValue → FormFieldJson conversion - Update docs/schema/v1.0/pdftract.schema.json with form_fields $defs and required field - Add form_fields_to_markdown() to markdown module for Form Fields footer table Schema shape: /form_fields is array of {name, type, value, default?, page_index?, rect?, required, read_only, multiline?, max_length?, options?, multi_select?, selected?, state_name?, pushbutton?, radio?}. Type field is tagged enum: "text", "button", "choice", "signature". Value field varies by type (string\|boolean\|string\|array\|uint\|null). Closes: pdftract-5qca Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 14:36:03 -04:00
jedarden	d9d21df157	docs(pdftract-653ah): add runbook integration for pdftract doctor - Created docs/operations/manual-platform-smoke.md with comprehensive smoke test runbook for KU-12 quarterly manual platform testing - Added troubleshooting table covering all 14 doctor checks - Cross-referenced runbook from installation.md and quickstart.md - Added CI gate test (doctor_runbook_coverage.rs) to verify troubleshooting table completeness Acceptance criteria: ✓ Step 1: pdftract doctor as first section in runbook ✓ Troubleshooting table covers all FAIL-capable checks ✓ installation.md mentions pdftract doctor with runbook link ✓ quickstart.md uses pdftract doctor as first example command ✓ CI gate parses runbook and asserts all checks are present ✓ mdBook build succeeds ✓ No broken internal links Closes: pdftract-653ah	2026-05-24 13:26:31 -04:00
jedarden	b6b9ed74a2	docs(pdftract-3om3): add MCP client configuration guide Add docs/integrations/mcp-clients.md with copy-paste-ready configuration snippets for Claude Desktop, Cursor, Continue, and a custom SDK template. Each section includes: - Per-OS config file locations - JSON/YAML snippets - Validation steps - Minimum client version verified Also includes: - Multi-client HTTP mode setup - TH-03 compliance note (auth required for public binds) - Troubleshooting for common failure modes - Cross-references to sdk-invocation.md, KU-5, OQ-07 Closes: pdftract-3om3 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 13:10:33 -04:00
jedarden	eb025f7b1a	docs(pdftract-3wrx): add release signing strategy note Resolves OQ-10: document v1.0.0 stance on binary signing. - Linux: GPG-signed (implemented) - macOS: Deferred to v1.1+ ($99/yr Apple Developer Program) - Windows: Deferred to v1.1+ ($200-400/yr Authenticode cert) - All platforms: SLSA Level 2 attestation (already committed) Closes: pdftract-3wrx Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 11:12:56 -04:00
jedarden	94b02dedfe	docs(pdftract-1tjn): finalize OpenType MATH and formula extraction research note v1.0 - Add Section 11: Formula-Region Detection Algorithm with pseudo-code - Add Section 12: Inline vs Display Formula Classification rules - Add Section 13: LaTeX-Like Reconstruction (Best-Effort) with feature-flag guidance - Add Section 14: Profile Classifier Signal `structural.has_math` definition - Add Section 15: Validation Methodology with arXiv fixture corpus strategy File grows from 168 to 426 lines. All acceptance criteria PASS. Closes: pdftract-1tjn	2026-05-24 10:41:39 -04:00
jedarden	8d6a1a07df	docs(pdftract-372e): finalize watermark and background separation research note v1.0 - Added Section 2: Combined Watermark Scoring Algorithm with signal definitions, pseudo-code, threshold tuning, and weight overrides - Added Section 4: Font-Based Signals (font size, color, weight/family) - Added Section 11: Text Output Mode behavior (pre/post Phase 7) - Added Section 12: Edge Cases (stamps vs watermarks, raster watermarks, form profile override, reading-order interaction) - Added Section 13: Validation Corpus with empirical baseline results - Expanded Section 10 with WatermarkSignals struct containing individual signal scores - File grows from 198 to 546 lines Closes: pdftract-372e Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 10:33:37 -04:00
jedarden	e25a4fc78d	docs(pdftract-10cf): finalize table structure reconstruction research note v1.0 Added complete pseudo-code listings for: - Line-based grid reconstruction algorithm (path segment collection, collinear merging, intersection finding, cell synthesis) - Borderless table detection via vertical projection profiles and column separator inference - Cell content assignment via centroid containment Also added version history section documenting v0.9 -> v1.0 changes. Closes: pdftract-10cf	2026-05-24 09:58:03 -04:00
jedarden	57df42f478	docs(pdftract-3b1x): finalize sdk-invocation.md with subprocess contract and TH-07 compliance Add comprehensive "Subprocess Contract" section documenting: - argv layout with canonical form - stdin discipline (password ingress, PDF bytes from stdin) - stdout/stderr discipline (what goes where, what never gets logged) - Exit code taxonomy (0, 64-78) with TH-03 (exit 78) and TH-07 (exit 64) refs - Environment variable pass-through (PDFTRACT_PASSWORD, PDFTRACT_MCP_TOKEN, etc.) - --progress-json event schema (ndjson format, all event types) - --capture-diagnostics archive layout (zip/tar, contained files, scrubbing rules) Update all language examples (Python, Node.js, Go, Ruby, Java, Rust) with TH-07-compliant password handling: - Pass password via PDFTRACT_PASSWORD env var (subprocess) - Pass password via multipart form field (HTTP) - Never use --password VALUE flag (rejected unless opt-in) Add progress JSON parsing examples for Python, Node.js, and Rust showing real-world event-driven progress tracking. File grows from 1100 to 1837 lines (+737 lines, ~67%). Closes: pdftract-3b1x	2026-05-24 07:48:09 -04:00
jedarden	1791bb6d80	docs(pdftract-32y9): finalize SDK architecture note with workspace layout, cross-compile matrix, and KU-12 alignment - Add workspace layout section documenting pdftract-core as the only direct dependency, with pdftract-cli, pdftract-py, and pdftract-inspector-ui as siblings - Update binary distribution table with correct target triples (musl not gnu for Linux) - Add KU-12 cross-platform test limitation section with verbatim wording from plan: "Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release" - Add Argo CI templates section (pdftract-cargo-build, pdftract-maturin-build) - Add feature flag composition section with tiers, dependencies, and binary size budgets - Add cross-references to sdk-invocation.md, sdk-contract.md, ocr-language-packs.md - Fix clippy warnings in build.rs files (expect_fun_call, get_first, manual_strip, unused imports) Closes: pdftract-32y9 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 06:38:23 -04:00
jedarden	67b3fde4d6	feat(pdftract-j6yd): implement signatures array output + validation_status enum + schema integration Add document-level /signatures array output per Phase 7.3 of the plan. Changes: - Add SignatureJson struct to schema module with all signature metadata fields - Update ExtractionResult to include signatures: Vec<SignatureJson> - Integrate signature extraction into extract_pdf() pipeline - Update result_to_json() to include signatures in JSON output - Update JSON schema with signatures array and SignatureJson definition - Add markdown sink signatures footer when signatures are present - Add comprehensive tests for signature JSON serialization and validation Acceptance criteria: - Schema tests: 5/5 signature JSON tests pass - Markdown sink emits Signatures footer when count > 0 - PyO3 binding automatically handles Vec<SignatureJson> via serde - docs/schema/v1.0/pdftract.schema.json updated with signatures shape Verification note: notes/pdftract-j6yd.md Closes: pdftract-j6yd	2026-05-24 04:05:34 -04:00
jedarden	d174725241	docs(pdftract-5vhp): bring word-boundary-reconstruction.md to v1.0 final-pass Complete documentation of the adaptive word-boundary algorithm including: - Initial threshold = 0.25 * font_size - 20-glyph median adjustment - 1.5x median formula - Full Tc/Tw/Tz (character-spacing, word-spacing, horizontal-scaling) corrections Expanded from 202 lines to 899 lines with: - Section 3.1: Tc/Tw/Tz formula with explicit parameter table - Section 3.2: Text-space vs. device-space comparison per plan line 1550 - Section 4: Adaptive algorithm specification (20-glyph window, 1.5× median, outlier exclusion) - Section 11: Complete pseudo-code (data structures, main loop, detection, threshold computation) - Section 12: Edge cases (ZWJ, combining marks, CJK, justified text, monospaced, RTL, ligatures, soft hyphens, tabs) - Section 13: Validation methodology (corpus at tests/fixtures/word-boundary-corpus/, 141 PDFs, 8 categories) - Section 14: Implementation checklist and references Closes: pdftract-5vhp	2026-05-24 03:55:43 -04:00
jedarden	28c31ba0a1	feat(pdftract-vk0gc): implement markdown anchors with parser regex Add --md-anchors flag that emits HTML comment markers before each block in Markdown output, allowing downstream tools to map excerpts back to precise PDF locations. Changes: - Add markdown module with Anchor struct and parse_anchors() function - Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) --> - Add markdown_anchors: bool to ExtractionOptions - Add --md-anchors CLI flag - Implement block_to_markdown() and page_to_markdown() functions - Add comprehensive documentation in docs/integrations/markdown-anchors.md - 16 unit tests pass, including roundtrip test Closes: pdftract-vk0gc	2026-05-24 02:49:16 -04:00
jedarden	cf8f04e3ec	docs(pdftract-26r8): finalize glyph recognition research note v1.0 - Reorganize around the four-level Unicode recovery cascade from plan - Document all cascade levels with confidence scores: - Level 1: ToUnicode CMap (1.0) - Level 2: Encoding + AGL (0.9) - Level 3: Font fingerprint cache (0.85) - Level 4: Glyph shape recognition (0.7) - Add shape database design (pHash algorithm, query, format) - Document pHash collision tie-break rules (frequency-based) - Add Type 3 font handling section - Cross-reference Phase 2.2, 2.4, 2.5 and OQ-02 File grows from 112 to 210 lines. Covers all acceptance criteria. Closes: pdftract-26r8	2026-05-24 02:10:06 -04:00
jedarden	92e90af0b0	feat(pdftract-zy2jx): generate JSON Schema from Rust output types - Add schemars dependency to pdftract-core (v1.2) - Add JsonSchema derives to output types (ExtractionResult, PageResult, ExtractionMetadata, SpanJson, BlockJson, CellJson, RowJson, TableJson, ExtractionQuality, Receipt, ReceiptsMode) - Create xtask/src/bin/gen_schema.rs for schema generation - Add gen-schema command to xtask main.rs - Generate docs/schema/v1.0/pdftract.schema.json using Draft 2020-12 Schema includes: - $schema: "https://json-schema.org/draft/2020-12/schema" - $defs with all output type definitions - Proper type annotations for all fields Closes: pdftract-zy2jx	2026-05-24 01:29:14 -04:00
jedarden	bf37f0f05f	docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields This commit brings docs/research/extraction-output-schema.md to v1.0 final-pass specification, aligning with Phase 6.1 deliverables and plan requirements. Key additions: - page_number field documented with page_index relationship (1-based vs 0-based) - page_type enum expanded with all six values: text, scanned, mixed, broken_vector, blank, figure_only — with broken_vector cross-referenced to Phase 5.5 - Block kind enum fully documented: paragraph, heading, list, table, figure, caption, code, formula, watermark, header, footer - Attachments schema with base64 contentEncoding and 50MB truncation rule - Profile-based classification fields (document_type, document_type_confidence, document_type_reasons, profile_name, profile_version, profile_fields) - Schema Version Compatibility section with additive-evolution rules - JSON Schema cross-reference throughout Format changes: - Restructured with ATX headings (## for sections) - Added explicit field tables for each major schema section - Cross-linked to machine-readable JSON Schema at docs/schema/v1.0/pdftract.schema.json - Grew from 81 lines to 304 lines per acceptance criteria Plan references: - Lines 97, 2002-2030, 2017, 1836, 2640, 1709, 1752, 2989-3006, 3659 - INV-9 page_type taxonomy stability Co-Authored-By: Claude Code (GLM-4.7) <noreply@anthropic.com>	2026-05-24 00:59:23 -04:00
jedarden	d14ec92fcb	feat(pdftract-3zhf): add unified TableDetector::detect entry point Add unified detect() method to TableDetector that combines both line-based and borderless table detection pipelines. This completes the coordinator bead for Phase 7.2: Table Detection and Structure Reconstruction. All child beads (7.2.1-7.2.6) are closed: - 7.2.1: Line-based detection (path segment clustering) - 7.2.2: Borderless detection (x0 alignment heuristic) - 7.2.3: Span-to-cell assignment (centroid containment) - 7.2.4: Header row detection (bold + StructTree TH) - 7.2.5: Merged cell detection (missing interior edges) - 7.2.6: Table JSON output schema integration Critical tests pass: - 5x3 bordered table (15 cells extracted) - Merged header cell colspan=3 - Borderless 3-column table detection - Two-page table continuation detection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 00:51:59 -04:00
jedarden	33372c23ae	fix(pdftract-3c4i): export detect_merged_cells from table module The detect_merged_cells function was implemented but not exported from the table module, making it inaccessible to library users. This commit adds the function to the public API exports. Also adds a verification note documenting the complete implementation and the export fix. Acceptance criteria status: - All 6 merged cell detection tests pass - Public Cell.rowspan/colspan fields exist with default 1 - Absorbed cells are excluded from output - Bbox of merged cell covers absorbed cells - Borderless tables NO-OP with diagnostic Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 00:23:14 -04:00
jedarden	26bdd255c8	feat(pdftract-ilen): implement header row detection with bold+TH support Implement header row detection for tables using two signals: 1. Bold font detection (fully implemented) 2. StructTree TH detection (stub pending MCID tracking) Bold detection: - is_bold_font(): detects bold fonts from PostScript name patterns - is_cell_bold(): checks if all non-whitespace content in a cell is bold - is_bold_header_row(): validates rows with >=2 bold cells - count_header_rows(): counts contiguous bold headers from top - Cell::mark_header_rows(): sets is_header_row flag on cells TH detection (stub): - is_th_header_row(): placeholder for StructTree TH detection Requires MCID tracking on TableSpan (future work) Will use ParentTree to map MCIDs to StructElems Will verify TR > TH chain structure Combined detection: - is_header_row(): combines bold and TH signals - Bold wins on conflict per body data design principle Documentation: - Updated table-structure-reconstruction.md with full header detection spec - Documented implemented vs pending signals - Added implementation notes for TH detection Tests: - 45 tests covering all bold detection scenarios - Tests for multi-row headers (contiguous from top) - Tests for single-cell row exclusion - Tests for empty/whitespace cell handling - Placeholder tests for TH detection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 23:32:54 -04:00
jedarden	9b5fbc9b5e	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction - Add decode_page_content_streams() function for per-page lazy decode - Update extract_page_from_dict() to support lazy stream decoding - Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding - Fix borrow checker issue in LazyPageIter::next() This ensures content streams are decoded lazily per page and dropped immediately after processing, keeping peak RSS flat across page count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 12:30:26 -04:00
jedarden	9fca24c77a	docs(plan): SDKs are monorepo members, not separate repos Add a Repository Layout subsection: SDK source lives at root-level pdftract-<lang>/ in this monorepo (single source of truth), generated via pdftract sdk codegen and published to language registries from here. Retire the legacy standalone repos. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 07:21:45 -04:00
jedarden	2251f8a9c0	docs(plan): make bounded peak-RSS a CI-gated target; default max_decompress_bytes 2GB->512MB Add a Memory targets table as a first-class acceptance criterion alongside Accuracy/Speed/Weight, with a hard per-document peak-RSS ceiling that must not scale with input/payload. Promote OOM-safety to a Tier-1 hard gate. Reconcile the contradictory 2 GB max_decompress_bytes default to the research-backed 512 MB (root cause of an observed multi-GB OOM via the unbounded PNG-predictor pre-alloc under rayon page parallelism). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 23:25:50 -04:00
jedarden	bb5346b305	docs(pdftract-58kz): add security policy documentation Add comprehensive SECURITY.md covering: - Supported versions policy - Private vulnerability reporting (email + GitHub) - 90-day disclosure window with timelines - CVE assignment via GitHub Security Advisories - In-scope and out-of-scope vulnerability classes - Safe harbor policy for good-faith researchers Add security issue template redirecting users to private reporting. Add Security section to CONTRIBUTING.md and README.md with links to SECURITY.md. Add docs/security/pgp-public-key.asc placeholder with generation instructions. References: bead pdftract-58kz, plan line 3433 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 19:39:24 -04:00
jedarden	9456d8e231	feat(pdftract-5omc): implement per-language conformance test runner pattern Implements the conformance test runner pattern for all 10 SDKs as specified in the plan (line 3547). Each SDK now has a dedicated conformance test runner. Created: - tests/sdk-conformance/report-schema.json: JSON schema for conformance reports - docs/notes/sdk-conformance-runner.md: Pattern documentation and reference - crates/pdftract-cli/tests/conformance.rs: Rust cargo test target - tests/conformance/test_conformance.py: Python pytest harness - tests/conformance/conformance.test.ts: Node.js vitest runner - tests/conformance/conformance_test.go: Go go test runner - tests/conformance/ConformanceTest.java: Java JUnit 5 runner - tests/conformance/ConformanceTests.cs: .NET xUnit runner - tests/conformance/conformance.c: C standalone binary - tests/conformance/conformance_test.rb: Ruby minitest runner - tests/conformance/ConformanceTest.php: PHP PHPUnit runner - tests/conformance/ConformanceTests.swift: Swift XCTest runner All runners implement: - Loading of tests/sdk-conformance/cases.json - Execution of test cases with language-native method invocations - Comparison of results against expected values with numeric tolerances - Emission of machine-readable conformance-report.json - Non-zero exit on failures/errors for CI gating Acceptance criteria: - PASS: All 10 SDKs have language-specific runners - PASS: Runners consume shared cases.json - PASS: Runners emit JSON reports matching schema - PASS: Runners exit non-zero on failure - WARN: README integration pending SDK repo creation - WARN: Stub implementations return placeholder results References: - Plan line 3547: "Every SDK has a pdftract-sdk-conformance test runner" - Plan line 3589: "Conformance suite results published as Argo artifact" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: pdftract-5omc	2026-05-18 01:32:24 -04:00
jedarden	857f928732	feat(pdftract-5omc): implement SDK conformance test runner pattern Implement the conformance test runner pattern that every SDK will implement to validate against the shared test suite. - Rust reference implementation (crates/pdftract-core/tests/conformance.rs) * Full test suite loader and executor * Comparison engine with min/max, string constraints, tolerances * Skip logic for unsupported features and schema versions * Report generation in JSON format - CLI compare subcommand (crates/pdftract-cli/src/main.rs) * pdftract compare - Compare actual vs expected with tolerances * Cross-language comparison tool to avoid reimplementations - Documentation (docs/conformance/sdk-contract.md) * Complete pattern specification with pseudocode * Per-language runner locations * CI integration requirements - Python reference stub (tests/python-conformance/test_conformance.py) * Full pytest-based implementation following the pattern Closes: pdftract-5omc	2026-05-18 01:22:23 -04:00
jedarden	a34f9c18d0	docs(pdftract-1g87): create mdBook scaffolding for user documentation - book.toml with title, authors, build directory, edit-url-template - src/SUMMARY.md with complete TOC for all planned sections - src/introduction.md: what pdftract does and doesn't do (Non-Goals) - src/installation.md: cargo, pip, Homebrew, Docker; KU-12 caveat verbatim - src/quickstart.md: five-minute walkthrough with executable commands - 39 draft placeholder files for CLI reference, schema, profiles, SDKs, advanced topics, troubleshooting, FAQ mdbook build completes cleanly with zero warnings (linkcheck optional). See notes/pdftract-1g87.md for verification details. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 00:38:51 -04:00
jedarden	5e66846288	docs(pdftract-147a): author SDK contract specification Add comprehensive SDK contract specification at docs/notes/sdk-contract.md. This document serves as the constitutional specification for all pdftract SDK implementations across all languages. The contract defines: - Method surface (9 methods mirroring CLI/MCP tools) - Error mapping (CLI exit codes → native exceptions) - Versioning compatibility rules (MAJOR lock, MINOR flexibility) - Option-naming conventions (CLI flag → language-native case) - Native type-mapping requirements (Document, Page, Span, Block, Match, Fingerprint, Classification) - Async conventions per language - Conformance enforcement (100% pass required) - Change policy (ADR required for contract changes) Verification note: notes/pdftract-147a.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:13:55 -04:00
jedarden	9f27d16f25	docs(phase-0.1): verify pdftract-ci scaffolding complete Verified the pdftract-ci WorkflowTemplate exists in declarative-config and is correctly synced to the iad-ci cluster. All scaffolding requirements met for Phase 0.1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 03:24:36 -04:00
jedarden	7035706068	docs(plan): fix 3 HIGH gaps + 3 LOW items from Round 5 gap review HIGH: - Add outline/bookmark traversal spec to Phase 1.4 (linked list walk, PDFDocEncoding vs UTF-16BE) - Specify base64 encoding for attachment data field in Phase 7.5 - Move decompression limit to ExtractionOptions.max_decompress_bytes (universal, not serve-only); add max_decompress_gb to CLI/Python/HTTP API surfaces LOW: - Split log+env_logger into two dep matrix rows for accurate crate count - Add full_render to Python keyword args and HTTP form fields (with no-op note) - Clarify v0.1.0 milestone: "all applicable" targets (OCR speed target excluded until v0.2.0) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:30:02 -04:00
jedarden	2ba51a8a73	docs(plan): fix 4 gaps from Round 4 gap review - Fix quick-xml feature gate: move from ocr to default (XMP conformance detection) - Make page_number schema update an explicit Phase 6.1 deliverable - Add PageClass → page_type mapping table; define broken_vector as valid output value - Fix CI test matrix: musl target excludes ocr/python features; glibc runs --all-features Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:24:12 -04:00
jedarden	2d194a4b1b	docs(plan): fix 15 gaps from Round 3 gap review HIGH: - Add fontdue crate for glyph rasterization (ttf-parser is a parser, not rasterizer) - Remove num_cpus reference (rayon default pool sizing is sufficient) - Update dep count target to < 30 direct crates (< 20 was violated by plan's own list) - Fix watermark deferral: Phase 7 not Phase 6; no kind:'watermark' until Phase 7 - Add Phase 7.6 (Hyperlink/Annotation Extraction) and 7.7 (Article Thread Chains) MEDIUM: - Document header/footer streaming mode limitation: first 3 pages emit as paragraph - Add conformance/XFA detection spec to Phase 1.4; move quick-xml to default feature - Clarify pdftract-py-ci is Phase 0 stub, filled in during Phase 6.3 - Specify /Contents array concatenation in Phase 1.4 page tree - Add page rotation un-rotation step after Phase 3 glyph bbox computation - Add password delivery: ExtractionOptions.password, --password CLI, HTTP form, Python kwarg - Fix glyph shape DB: phf::Map → sorted &'static [(u64,char)] slice for Hamming nearest-neighbor - Add Python benchmark runner infrastructure (python:3.11-slim, requirements.txt, hyperfine) - Add wordlist-bloom to Feature flags bullet list LOW: - Clarify extract_stream() yields page dicts only, not header/footer frames Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:18:33 -04:00
jedarden	eb799c0956	docs(plan): fix 21 gaps from Round 2 gap review CRITICAL: - Fix deskew step: pixDeskew operates on grayscale, not binarized image HIGH: - Add sha2 crate to dep matrix (needed for font fingerprint hashing) - Fix bloomfilter feature: wordlist-bloom (optional), not default conditional - Add build-dependencies subsection (phf_codegen, serde_json) - Add v0.1.0 fallback for tagged PDFs: XY-cut with TAGGED_PDF_STRUCT_TREE_DEFERRED diagnostic - Add v0.1.0 fallback for BrokenVector: emit broken_vector page_type when ocr feature absent - Add strsim crate for Levenshtein in header/footer deduplication - Add tokio::task::spawn_blocking bridge for axum→rayon hand-off - Fix JPX/CCITT OCR path: document full-render requirement; add OCR_JPX/CCITT_UNSUPPORTED diagnostics - Fix OCG default visibility: /D/BaseState + /D/ON + /D/OFF (was incorrectly /D/AS) MEDIUM: - Add JBIG2 OCR limitation: requires full-render; OCR_JBIG2_UNSUPPORTED diagnostic - Add Standard-14 font skip for Level 3 fingerprinting (no embedded program) - Change flags field from EnumSet<SpanFlag> to u8 bitmask (removes undocumented enumset dep) - Add tests/fixtures/encoding/ and tests/fixtures/perf/ to Tier 2 fixture list - Add ocg_present to Phase 6.1 metadata field list - Add "Phase 7 feature; empty in Phase 6" notes to links and annotations fields - Add include_invisible/extract_forms/extract_attachments to HTTP serve form fields - Clarify Phase 4.7 lang source: document-level /Lang, not per-span (Phase 7) - Fix Stage 1-2 → Phases 1-2 in architectural decisions (stale draft terminology) - Remove frame-index notation from NDJSON streaming critical test - Define Hybrid threshold (≥15% each) and Hybrid merge strategy (vector wins on overlap) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 18:05:26 -04:00
jedarden	bcccc98fd7	docs(plan): fix 30 gaps from Round 1 gap review CRITICAL fixes: - Remove jpeg-decoder from Phase 1.5 crates (contradicted dep matrix) - Specify word boundary adaptive threshold: text space, per-font-switch window, 20-glyph seed - Add page_number (1-based) alongside page_index (0-based) to resolve SDK/schema mismatch - Add mcid: Option<u32> to Glyph struct (was defined in 3.4 but missing from 3.2) - Add aes + rc4 crates under new decrypt feature; document crypto dependency HIGH fixes: - Specify font fingerprint database format (phf::Map, SHA-256, ~500KB, JSON source) - Fix Level 4 shape DB cross-ref (was "Phase 2.3", corrected to research doc); add Phase 2.5 definition - Document header/footer cross-page pass as sequential post-rayon with Levenshtein matching - Replace Tesseract box-file hint approach with PSM_SPARSE_TEXT + post-OCR validation - Add HTTP serve security constraints: decompression bomb limit, auth guidance, no path params - Add JavaScript detection spec to Phase 1.4 (all four JS action locations) - Align CI benchmark gate to 10x pdfminer.six (was 5x, contradicted primary objectives) - Add cargo bloat CI gate for phf word list size; bloomfilter fallback if >250KB - Add pdftract-py-ci WorkflowTemplate note with manylinux/osxcross/cross approach - Add ConfidenceSource enum → schema string mapping table in Phase 4.1 MEDIUM fixes: - Define docs/schema/v1.0/pdftract.schema.json as Phase 6.1 deliverable - Add unicode-bidi crate to dep matrix and Phase 4.2 for RTL detection - Define Color enum with CSS hex conversion rules in Phase 3.1 - Remove bytes crate from Phase 1.2 (belongs in serve feature only; use Arc<[u8]>) - Specify NDJSON buffer Condvar blocking behavior at window saturation - Clarify pdftract:ocr vs pdftract:full Docker image tags and size budgets - Add Docstrum parameters: k=5, Euclidean, ±30° constraints, root node definition - Add code and formula block kind detection heuristics to Phase 4.4 - Add OCG visibility handling to Phase 1.4 (ON/OFF from /OCProperties /D /AS) - Add linearized PDF detection and dual-xref merge to Phase 1.3 - Add HTTP 413 to error table with custom JSON rejection handler - Add Phase 0: CI Infrastructure section (pdftract-ci WorkflowTemplate) LOW fixes: - Clarify Name length limit: 127 bytes pre-expansion, matching PDF spec 7.3.5 - Reorder preprocessing pipeline: contrast normalization before binarization (was after) - Add CIDToGIDMap stream form: 2-byte big-endian GID array Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 17:45:04 -04:00
jedarden	d161d109b3	docs(plan): revise plan to center accuracy/speed/weight as hard targets - Add Primary Objectives section with CI-gated measurable targets: accuracy (CER <0.5%, WER <3%, readability >0.85), speed (100pp <3s, 10x vs pdfminer), weight (<4MB default binary, <20 default deps) - Add feature-flag strategy: axum/tokio/pdfium/pyo3 are all optional; default build is core extraction + CLI only - Add Phase 4.7: text readability validation and correction pipeline (ligature repair, hyphenation, mojibake detection, readability scoring) - Make pdfium-render explicitly optional (full-render feature) vs. the always-present direct image compositing path - Add Tier 4 competitive benchmark suite (vs. pdfminer.six, pypdf, pdfplumber) - Remove jpeg-decoder and whichlang from dependency matrix (unnecessary) - Rename implementation-plan.md → plan.md (matches CLAUDE.md reference) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 17:07:48 -04:00
jedarden	8753630bc3	Add parallel extraction research and comprehensive research index New research document covering parallel extraction architecture: rayon page-level parallelism, Arc<> shared xref/font/object-stream caches, RwLock font cache design, Tesseract thread-local OCR pool, semaphore memory budget, ordered NDJSON streaming slot array, and catch_unwind error isolation per page. Also adds docs/research-index.md: a 622-line navigable index of all 83 research documents grouped into 9 thematic categories, with a "Start Here" reading path, per-phase implementation reading tables, and an alphabetical lookup table covering every document. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:30:35 -04:00
jedarden	92e6196ac5	Add research: Ruby/furigana typography, PDF/VT variable printing Two new research documents covering Japanese Ruby text and East Asian typography (tagged/untagged furigana extraction, Kinsoku Shori spacing, full-width normalization, tate-chu-yoko, CJK/Latin boundary detection, ruby_text output field) and PDF/VT variable and transactional printing (DPart hierarchy traversal, per-record extraction model, DPM metadata, variable vs. static content classification, postal address extraction, records array output schema). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:24:21 -04:00
jedarden	e3b72efc83	Add research: Southeast Asian scripts, OpenType MATH formula extraction Two new research documents covering Southeast Asian script extraction (Thai/Khmer/Myanmar/Lao/Tibetan/Ethiopic — cluster structure, no-space word boundary policy for Thai/Lao, Zawgyi vs Unicode detection for Myanmar, USE shaping, Tesseract fallback) and OpenType MATH table exploitation for formula extraction (MathConstants for fraction/ subscript/radical layout, TeX OML/OMS/OMX encoding tables, MathML output generation, GlyphAssembly reconstruction, alternative text and MathJax XMP source recovery). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:21:48 -04:00
jedarden	4e72c66763	Add research: Indic scripts, adversarial parser security Two new research documents covering Indic script extraction (abugida structure, ToUnicode CMap failures for shaped glyphs, ActualText fast-path, GSUB lookup reversal, pre-base matra reordering, virama placement, Tesseract fallback with script-specific models) and adversarial input handling (decompression bombs, circular references, malformed stream lengths, path traversal in attachments, content stream loop detection, O(n log n) algorithm requirements, output sanitization). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:18:03 -04:00
jedarden	12fad41596	Add research: span merging, Unicode normalization, implementation plan Two new research documents covering the glyph-to-span-to-block assembly pipeline (inter-operator merging, adaptive word gap threshold, column detection, ligature bbox splitting, multi-granularity output) and Unicode post-processing (NFC normalization, selective NFKC decomposition for ligatures, PUA preservation, soft hyphen resolution, ZWJ/ZWNJ handling, combining character reordering). Also adds docs/plan/implementation-plan.md: the full 7-phase Rust implementation roadmap covering core parser, font/encoding pipeline, content stream processing, text assembly, OCR integration, API surface, and advanced features — with crate selections, complexity ratings, test strategy, and v0.1–v1.0 release milestones. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:15:14 -04:00
jedarden	6b96d8d637	Add research: error handling, PDF/A guarantees, output schema, generator quirks Four new extraction research documents covering permissive error handling with extraction quality signaling (five error classes, circular reference detection, memory limits), PDF/A conformance level guarantees and fast-path optimization (Level A skips OCR and layout heuristics), the complete extraction output schema (span/block/table/NDJSON streaming/ versioning), and per-generator extraction quirks (Word/LibreOffice/ InDesign/LaTeX/Chrome/Ghostscript/scanners). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:07:13 -04:00
jedarden	a89fef64fc	Add research: article threads, resource dictionaries, catalog, hyperlinks Four new extraction research documents covering PDF article thread traversal for multi-flow magazine layouts, resource dictionary inheritance and ResourceStack semantics for nested Form XObjects, document catalog and page tree structure (UserUnit, Contents array, page inheritance), and hyperlink/named destination extraction with QuadPoints anchor text and link density classification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:04:00 -04:00
jedarden	16cb1bd61d	Add research: xref parsing, object model, font descriptors, PDF/UA-2 Four new extraction research documents covering cross-reference table and xref stream parsing with error recovery, PDF object model and lexer correctness (all 8 types, string escapes, stream /Length recovery), FontDescriptor fields and embedded font data (Type1/TrueType/CFF/OT), and PDF/UA-2 / PDF 2.0 structure changes (MathML, NFC normalization, new structure types, artifact classification improvements). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 16:01:34 -04:00
jedarden	6c6ec6a4ca	Add research: color management, text metrics, PDF/X, content stream operators Four new extraction research documents covering ICC profile and color space luminance estimation for text visibility, precise text state tracking and bounding box computation (Tc/Tw/Tz/TL, font units, TJ kerning, baseline clustering), PDF/X prepress handling (OutputIntent, TrimBox, spot colors, article threading), and a complete content stream operator reference (BT/ET, Tj/TJ/'/", BI/ID/EI, BX/EX, marked content). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:59:02 -04:00
jedarden	516ca154aa	Add research: page labels, government forms, book publishing, filter decoding Four new extraction research documents covering page label/PageLabels number tree and outline/bookmark tree extraction, government form PDF patterns (IRS, USCIS, court filings, classification markings), book and publishing PDF structure (running heads, footnotes, index extraction), and PDF stream filter pipeline (FlateDecode/LZW predictors, JBIG2 global segments, CCITTFax, JPX, error boundaries). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:55:08 -04:00
jedarden	5ff918b178	Add research: portfolios, incremental updates, tagged PDF, JavaScript/forms Four new extraction research documents covering PDF portfolio and attachment enumeration (ZUGFeRD, PDF/A-3 AFRelationship), incremental update structure and xref chaining, PDF/UA tagged PDF deep dive with all 36 structure types and MCID mechanics, and JavaScript/AcroForm/XFA field extraction without script execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:45:59 -04:00
jedarden	006dfb286c	Add research: color visibility, medical/scientific, multilingual, digital signatures Four new extraction research documents covering color space and contrast analysis for text visibility, medical/scientific document structure (ICH E3, IMRaD, FDA labeling, eCTD), multilingual mixed-script extraction with UBA bidi handling and CJK vertical text, and digital signature metadata extraction with DocMDP integrity context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:41:43 -04:00
jedarden	eac3235291	Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs Four new extraction research documents covering text rendering modes (Tr 0-7 including invisible OCR layers), legal/financial document extraction patterns, character-level confidence aggregation with output schema, and PDF/E engineering document handling (CAD, GD&T, schematics). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:35:48 -04:00
jedarden	8f8138a65e	Add research: font subsetting, LaTeX patterns, redaction detection Three new extraction research documents covering subset font Unicode recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and proper vs. improper redaction detection with output schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:30:52 -04:00

1 2

60 commits