jedarden/pdftract

Author	SHA1	Message	Date
jedarden	47df769e4b	feat(pdftract-5ls35): implement JSON-Lines output sink for grep Implement the --json output sink for pdftract grep with JSON-Lines format (one match per line). Includes MatchEvent, FileOnlyEvent, CountEvent structs and JsonSink line-buffered writer. Key features: - MatchEvent with all fields (path, page_index, bbox, match_text, span_text, span_confidence, pdf_fingerprint, crosses_spans) - crosses_spans omitted when false via skip_serializing_if - NaN/Infinity in span_confidence replaced with null - page_index is 0-based (machine convention) - FileOnlyEvent for -l mode, CountEvent for -c mode - Line-buffered writes with immediate flush - JSON schema at docs/schema/v1.0/grep-jsonl.schema.json Closes: pdftract-5ls35	2026-05-25 02:05:17 -04:00
jedarden	016c738188	feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata Implement the xtask gen-schema binary at xtask/src/bin/gen_schema.rs that derives JSON Schema Draft 2020-12 from the Rust ExtractionResult type via the schemars crate. Changes: - Add stable key sorting (sort_keys_recursive) for byte-identical output - Set $id to stable URL: https://pdftract.com/schema/v1.0/pdftract.schema.json - Set title to "pdftract Output v1.0" - Add cargo alias `gen-schema` for convenient invocation - Emit schema to docs/schema/v1.0/pdftract.schema.json The schema is generated from the Rust types with schemars derives, ensuring the JSON schema is always in sync with the source types. Acceptance criteria: - cargo gen-schema regenerates docs/schema/v1.0/pdftract.schema.json - Generated schema validates against JSON Schema Draft 2020-12 - Schema $id is the stable URL - Title is "pdftract Output v1.0" - Stable ordering: regenerating twice produces byte-identical output - All expected types appear in $defs (BlockJson, SpanJson, PageResult, etc.) Note: page_type and confidence_source enums are not yet implemented in the Rust types (marked as TODO in schema/mod.rs). These will be added by sibling beads pdftract-1ob and pdftract-1f8we respectively. Closes: pdftract-5nv9h	2026-05-24 17:31:16 -04:00
jedarden	84b4448648	feat(pdftract-5qca): implement form_fields JSON output + schema integration Phase 7.4.5 implementation: Wire combined Vec<(String, FormFieldValue)> from combiner into document-level /form_fields JSON output with tagged union schema. - Add FormFieldJson, FormFieldTypeJson, FormFieldValueJson, ChoiceValueJson to schema - Add form_fields: Vec<FormFieldJson> to ExtractionResult (always emitted, empty when none) - Implement acro_field_to_value() converter for Phase 7.4.2 type-specific extraction - Wire form field extraction in extract_pdf(): walk AcroForm, extract XFA, combine with XFA-wins - Add convert_form_field_to_json() helper for FormFieldValue → FormFieldJson conversion - Update docs/schema/v1.0/pdftract.schema.json with form_fields $defs and required field - Add form_fields_to_markdown() to markdown module for Form Fields footer table Schema shape: /form_fields is array of {name, type, value, default?, page_index?, rect?, required, read_only, multiline?, max_length?, options?, multi_select?, selected?, state_name?, pushbutton?, radio?}. Type field is tagged enum: "text", "button", "choice", "signature". Value field varies by type (string\|boolean\|string\|array\|uint\|null). Closes: pdftract-5qca Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 14:36:03 -04:00
jedarden	67b3fde4d6	feat(pdftract-j6yd): implement signatures array output + validation_status enum + schema integration Add document-level /signatures array output per Phase 7.3 of the plan. Changes: - Add SignatureJson struct to schema module with all signature metadata fields - Update ExtractionResult to include signatures: Vec<SignatureJson> - Integrate signature extraction into extract_pdf() pipeline - Update result_to_json() to include signatures in JSON output - Update JSON schema with signatures array and SignatureJson definition - Add markdown sink signatures footer when signatures are present - Add comprehensive tests for signature JSON serialization and validation Acceptance criteria: - Schema tests: 5/5 signature JSON tests pass - Markdown sink emits Signatures footer when count > 0 - PyO3 binding automatically handles Vec<SignatureJson> via serde - docs/schema/v1.0/pdftract.schema.json updated with signatures shape Verification note: notes/pdftract-j6yd.md Closes: pdftract-j6yd	2026-05-24 04:05:34 -04:00
jedarden	92e90af0b0	feat(pdftract-zy2jx): generate JSON Schema from Rust output types - Add schemars dependency to pdftract-core (v1.2) - Add JsonSchema derives to output types (ExtractionResult, PageResult, ExtractionMetadata, SpanJson, BlockJson, CellJson, RowJson, TableJson, ExtractionQuality, Receipt, ReceiptsMode) - Create xtask/src/bin/gen_schema.rs for schema generation - Add gen-schema command to xtask main.rs - Generate docs/schema/v1.0/pdftract.schema.json using Draft 2020-12 Schema includes: - $schema: "https://json-schema.org/draft/2020-12/schema" - $defs with all output type definitions - Proper type annotations for all fields Closes: pdftract-zy2jx	2026-05-24 01:29:14 -04:00
jedarden	d14ec92fcb	feat(pdftract-3zhf): add unified TableDetector::detect entry point Add unified detect() method to TableDetector that combines both line-based and borderless table detection pipelines. This completes the coordinator bead for Phase 7.2: Table Detection and Structure Reconstruction. All child beads (7.2.1-7.2.6) are closed: - 7.2.1: Line-based detection (path segment clustering) - 7.2.2: Borderless detection (x0 alignment heuristic) - 7.2.3: Span-to-cell assignment (centroid containment) - 7.2.4: Header row detection (bold + StructTree TH) - 7.2.5: Merged cell detection (missing interior edges) - 7.2.6: Table JSON output schema integration Critical tests pass: - 5x3 bordered table (15 cells extracted) - Merged header cell colspan=3 - Borderless 3-column table detection - Two-page table continuation detection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 00:51:59 -04:00

6 commits