jedarden/pdftract

Author	SHA1	Message	Date
jedarden	90d1b9a83d	test(pdftract-4c8qu): add page_label tests and fix JSON schema - Add test_page_json_with_page_labels_roman_numerals: verifies page_label serialization with roman numeral values (i, ii, iii, etc) - Add test_page_json_without_page_labels_absent: verifies page_label is absent (null) when PDF has no /PageLabels - Add test_page_json_page_index_and_page_number_both_present: verifies both page_index and page_number are always present and page_number = page_index + 1 - Add test_page_json_roundtrip_with_all_fields: verifies full roundtrip serde preservation of all PageJson fields - Update docs/schema/v1.0/pdftract.schema.json PageResult definition: - Add page_number field (1-based, = page_index + 1) - Add page_label field (optional, from /PageLabels number tree) - Add width and height fields (page geometry in points) - Add rotation field (0, 90, 180, 270 degrees) - Add type field with enum: text, scanned, mixed, broken_vector, blank, figure_only - Update required fields to include all page-level fields Acceptance criteria: ✅ Page serializes with both page_index AND page_number ✅ PDF with /PageLabels [{S: "r"}] produces page_label "i", "ii", "iii" etc ✅ PDF without /PageLabels -> page_label absent ✅ JSON Schema enum for page_type includes all values ✅ Roundtrip serde Page test passes Closes: pdftract-4c8qu	2026-05-25 14:43:31 -04:00
jedarden	9abc386cce	feat(pdftract-3h9xo): implement threads JSON output + schema integration Phase 7.7.3: Add threads field to ExtractionResult with ThreadJson schema integration. Changes: - Added ThreadJson and BeadJson structs to schema/mod.rs - Added thread_to_json() function to threads/mod.rs - Added build_page_ref_to_index() helper to parser/pages.rs - Added threads field to ExtractionResult in extract.rs - Implemented Phase 7.7 extraction logic with discover_threads/walk_beads - Added threads_to_markdown() and collapse_page_ranges() to markdown.rs - Updated JSON schema with ThreadJson and BeadJson definitions - Added thread_to_py() and bead_to_py() conversions in pdftract-py - Exported ThreadJson, BeadJson from lib.rs All 32 threads module tests pass. All 35 markdown tests pass. Verification: notes/pdftract-3h9xo.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 13:40:15 -04:00
jedarden	b7851b9d92	feat(pdftract-4hle): implement 7.6.4 links + annotations JSON output Add JSON conversion functions, schema integration, and extraction pipeline wiring for Phase 7.6 hyperlink and annotation extraction. Changes: - Create annotation/json.rs with conversion functions (link_to_json, annotation_to_json, fit_type_to_json, sort_links, sort_annotations) - Add 13 comprehensive tests covering all link/annotation types - Wire Phase 7.6 annotation extraction into main extract.rs pipeline - Update docs/schema/v1.0/pdftract.schema.json with LinkJson, AnnotationJson, DestArrayJson, DestTypeJson, AnnotationSpecificJson - Add links to root schema properties and required fields - Add annotations array to PageResult Schema definitions include all 8 PDF fit types (XYZ, Fit, FitH, FitV, FitR, FitB, FitBH, FitBV) and all major annotation subtypes (TextMarkup, Stamp, FreeText, Text, Ink, Line, Polygon, FileAttachment). Closes pdftract-4hle (7.6.4) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 07:44:12 -04:00
jedarden	47df769e4b	feat(pdftract-5ls35): implement JSON-Lines output sink for grep Implement the --json output sink for pdftract grep with JSON-Lines format (one match per line). Includes MatchEvent, FileOnlyEvent, CountEvent structs and JsonSink line-buffered writer. Key features: - MatchEvent with all fields (path, page_index, bbox, match_text, span_text, span_confidence, pdf_fingerprint, crosses_spans) - crosses_spans omitted when false via skip_serializing_if - NaN/Infinity in span_confidence replaced with null - page_index is 0-based (machine convention) - FileOnlyEvent for -l mode, CountEvent for -c mode - Line-buffered writes with immediate flush - JSON schema at docs/schema/v1.0/grep-jsonl.schema.json Closes: pdftract-5ls35	2026-05-25 02:05:17 -04:00
jedarden	016c738188	feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata Implement the xtask gen-schema binary at xtask/src/bin/gen_schema.rs that derives JSON Schema Draft 2020-12 from the Rust ExtractionResult type via the schemars crate. Changes: - Add stable key sorting (sort_keys_recursive) for byte-identical output - Set $id to stable URL: https://pdftract.com/schema/v1.0/pdftract.schema.json - Set title to "pdftract Output v1.0" - Add cargo alias `gen-schema` for convenient invocation - Emit schema to docs/schema/v1.0/pdftract.schema.json The schema is generated from the Rust types with schemars derives, ensuring the JSON schema is always in sync with the source types. Acceptance criteria: - cargo gen-schema regenerates docs/schema/v1.0/pdftract.schema.json - Generated schema validates against JSON Schema Draft 2020-12 - Schema $id is the stable URL - Title is "pdftract Output v1.0" - Stable ordering: regenerating twice produces byte-identical output - All expected types appear in $defs (BlockJson, SpanJson, PageResult, etc.) Note: page_type and confidence_source enums are not yet implemented in the Rust types (marked as TODO in schema/mod.rs). These will be added by sibling beads pdftract-1ob and pdftract-1f8we respectively. Closes: pdftract-5nv9h	2026-05-24 17:31:16 -04:00
jedarden	84b4448648	feat(pdftract-5qca): implement form_fields JSON output + schema integration Phase 7.4.5 implementation: Wire combined Vec<(String, FormFieldValue)> from combiner into document-level /form_fields JSON output with tagged union schema. - Add FormFieldJson, FormFieldTypeJson, FormFieldValueJson, ChoiceValueJson to schema - Add form_fields: Vec<FormFieldJson> to ExtractionResult (always emitted, empty when none) - Implement acro_field_to_value() converter for Phase 7.4.2 type-specific extraction - Wire form field extraction in extract_pdf(): walk AcroForm, extract XFA, combine with XFA-wins - Add convert_form_field_to_json() helper for FormFieldValue → FormFieldJson conversion - Update docs/schema/v1.0/pdftract.schema.json with form_fields $defs and required field - Add form_fields_to_markdown() to markdown module for Form Fields footer table Schema shape: /form_fields is array of {name, type, value, default?, page_index?, rect?, required, read_only, multiline?, max_length?, options?, multi_select?, selected?, state_name?, pushbutton?, radio?}. Type field is tagged enum: "text", "button", "choice", "signature". Value field varies by type (string\|boolean\|string\|array\|uint\|null). Closes: pdftract-5qca Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 14:36:03 -04:00
jedarden	67b3fde4d6	feat(pdftract-j6yd): implement signatures array output + validation_status enum + schema integration Add document-level /signatures array output per Phase 7.3 of the plan. Changes: - Add SignatureJson struct to schema module with all signature metadata fields - Update ExtractionResult to include signatures: Vec<SignatureJson> - Integrate signature extraction into extract_pdf() pipeline - Update result_to_json() to include signatures in JSON output - Update JSON schema with signatures array and SignatureJson definition - Add markdown sink signatures footer when signatures are present - Add comprehensive tests for signature JSON serialization and validation Acceptance criteria: - Schema tests: 5/5 signature JSON tests pass - Markdown sink emits Signatures footer when count > 0 - PyO3 binding automatically handles Vec<SignatureJson> via serde - docs/schema/v1.0/pdftract.schema.json updated with signatures shape Verification note: notes/pdftract-j6yd.md Closes: pdftract-j6yd	2026-05-24 04:05:34 -04:00
jedarden	92e90af0b0	feat(pdftract-zy2jx): generate JSON Schema from Rust output types - Add schemars dependency to pdftract-core (v1.2) - Add JsonSchema derives to output types (ExtractionResult, PageResult, ExtractionMetadata, SpanJson, BlockJson, CellJson, RowJson, TableJson, ExtractionQuality, Receipt, ReceiptsMode) - Create xtask/src/bin/gen_schema.rs for schema generation - Add gen-schema command to xtask main.rs - Generate docs/schema/v1.0/pdftract.schema.json using Draft 2020-12 Schema includes: - $schema: "https://json-schema.org/draft/2020-12/schema" - $defs with all output type definitions - Proper type annotations for all fields Closes: pdftract-zy2jx	2026-05-24 01:29:14 -04:00
jedarden	d14ec92fcb	feat(pdftract-3zhf): add unified TableDetector::detect entry point Add unified detect() method to TableDetector that combines both line-based and borderless table detection pipelines. This completes the coordinator bead for Phase 7.2: Table Detection and Structure Reconstruction. All child beads (7.2.1-7.2.6) are closed: - 7.2.1: Line-based detection (path segment clustering) - 7.2.2: Borderless detection (x0 alignment heuristic) - 7.2.3: Span-to-cell assignment (centroid containment) - 7.2.4: Header row detection (bold + StructTree TH) - 7.2.5: Merged cell detection (missing interior edges) - 7.2.6: Table JSON output schema integration Critical tests pass: - 5x3 bordered table (15 cells extracted) - Merged header cell colspan=3 - Borderless 3-column table detection - Two-page table continuation detection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 00:51:59 -04:00

9 commits