Commit graph

9 commits

Author SHA1 Message Date
jedarden
90d1b9a83d test(pdftract-4c8qu): add page_label tests and fix JSON schema
- Add test_page_json_with_page_labels_roman_numerals: verifies page_label
  serialization with roman numeral values (i, ii, iii, etc)
- Add test_page_json_without_page_labels_absent: verifies page_label is
  absent (null) when PDF has no /PageLabels
- Add test_page_json_page_index_and_page_number_both_present: verifies
  both page_index and page_number are always present and page_number = page_index + 1
- Add test_page_json_roundtrip_with_all_fields: verifies full roundtrip
  serde preservation of all PageJson fields

- Update docs/schema/v1.0/pdftract.schema.json PageResult definition:
  - Add page_number field (1-based, = page_index + 1)
  - Add page_label field (optional, from /PageLabels number tree)
  - Add width and height fields (page geometry in points)
  - Add rotation field (0, 90, 180, 270 degrees)
  - Add type field with enum: text, scanned, mixed, broken_vector, blank, figure_only
  - Update required fields to include all page-level fields

Acceptance criteria:
 Page serializes with both page_index AND page_number
 PDF with /PageLabels [{S: "r"}] produces page_label "i", "ii", "iii" etc
 PDF without /PageLabels -> page_label absent
 JSON Schema enum for page_type includes all values
 Roundtrip serde Page test passes

Closes: pdftract-4c8qu
2026-05-25 14:43:31 -04:00
jedarden
9abc386cce feat(pdftract-3h9xo): implement threads JSON output + schema integration
Phase 7.7.3: Add threads field to ExtractionResult with ThreadJson schema integration.

Changes:
- Added ThreadJson and BeadJson structs to schema/mod.rs
- Added thread_to_json() function to threads/mod.rs
- Added build_page_ref_to_index() helper to parser/pages.rs
- Added threads field to ExtractionResult in extract.rs
- Implemented Phase 7.7 extraction logic with discover_threads/walk_beads
- Added threads_to_markdown() and collapse_page_ranges() to markdown.rs
- Updated JSON schema with ThreadJson and BeadJson definitions
- Added thread_to_py() and bead_to_py() conversions in pdftract-py
- Exported ThreadJson, BeadJson from lib.rs

All 32 threads module tests pass. All 35 markdown tests pass.

Verification: notes/pdftract-3h9xo.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 13:40:15 -04:00
jedarden
b7851b9d92 feat(pdftract-4hle): implement 7.6.4 links + annotations JSON output
Add JSON conversion functions, schema integration, and extraction
pipeline wiring for Phase 7.6 hyperlink and annotation extraction.

Changes:
- Create annotation/json.rs with conversion functions (link_to_json,
  annotation_to_json, fit_type_to_json, sort_links, sort_annotations)
- Add 13 comprehensive tests covering all link/annotation types
- Wire Phase 7.6 annotation extraction into main extract.rs pipeline
- Update docs/schema/v1.0/pdftract.schema.json with LinkJson,
  AnnotationJson, DestArrayJson, DestTypeJson, AnnotationSpecificJson
- Add links to root schema properties and required fields
- Add annotations array to PageResult

Schema definitions include all 8 PDF fit types (XYZ, Fit, FitH, FitV,
FitR, FitB, FitBH, FitBV) and all major annotation subtypes (TextMarkup,
Stamp, FreeText, Text, Ink, Line, Polygon, FileAttachment).

Closes pdftract-4hle (7.6.4)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 07:44:12 -04:00
jedarden
47df769e4b feat(pdftract-5ls35): implement JSON-Lines output sink for grep
Implement the --json output sink for pdftract grep with JSON-Lines
format (one match per line). Includes MatchEvent, FileOnlyEvent,
CountEvent structs and JsonSink line-buffered writer.

Key features:
- MatchEvent with all fields (path, page_index, bbox, match_text,
  span_text, span_confidence, pdf_fingerprint, crosses_spans)
- crosses_spans omitted when false via skip_serializing_if
- NaN/Infinity in span_confidence replaced with null
- page_index is 0-based (machine convention)
- FileOnlyEvent for -l mode, CountEvent for -c mode
- Line-buffered writes with immediate flush
- JSON schema at docs/schema/v1.0/grep-jsonl.schema.json

Closes: pdftract-5ls35
2026-05-25 02:05:17 -04:00
jedarden
016c738188 feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata
Implement the xtask gen-schema binary at xtask/src/bin/gen_schema.rs that
derives JSON Schema Draft 2020-12 from the Rust ExtractionResult type via
the schemars crate.

Changes:
- Add stable key sorting (sort_keys_recursive) for byte-identical output
- Set $id to stable URL: https://pdftract.com/schema/v1.0/pdftract.schema.json
- Set title to "pdftract Output v1.0"
- Add cargo alias `gen-schema` for convenient invocation
- Emit schema to docs/schema/v1.0/pdftract.schema.json

The schema is generated from the Rust types with schemars derives, ensuring
the JSON schema is always in sync with the source types.

Acceptance criteria:
- cargo gen-schema regenerates docs/schema/v1.0/pdftract.schema.json
- Generated schema validates against JSON Schema Draft 2020-12
- Schema $id is the stable URL
- Title is "pdftract Output v1.0"
- Stable ordering: regenerating twice produces byte-identical output
- All expected types appear in $defs (BlockJson, SpanJson, PageResult, etc.)

Note: page_type and confidence_source enums are not yet implemented in the
Rust types (marked as TODO in schema/mod.rs). These will be added by sibling
beads pdftract-1ob and pdftract-1f8we respectively.

Closes: pdftract-5nv9h
2026-05-24 17:31:16 -04:00
jedarden
84b4448648 feat(pdftract-5qca): implement form_fields JSON output + schema integration
Phase 7.4.5 implementation: Wire combined Vec<(String, FormFieldValue)> from
combiner into document-level /form_fields JSON output with tagged union schema.

- Add FormFieldJson, FormFieldTypeJson, FormFieldValueJson, ChoiceValueJson to schema
- Add form_fields: Vec<FormFieldJson> to ExtractionResult (always emitted, empty when none)
- Implement acro_field_to_value() converter for Phase 7.4.2 type-specific extraction
- Wire form field extraction in extract_pdf(): walk AcroForm, extract XFA, combine with XFA-wins
- Add convert_form_field_to_json() helper for FormFieldValue → FormFieldJson conversion
- Update docs/schema/v1.0/pdftract.schema.json with form_fields $defs and required field
- Add form_fields_to_markdown() to markdown module for Form Fields footer table

Schema shape: /form_fields is array of {name, type, value, default?, page_index?, rect?,
required, read_only, multiline?, max_length?, options?, multi_select?, selected?,
state_name?, pushbutton?, radio?}. Type field is tagged enum: "text", "button", "choice",
"signature". Value field varies by type (string|boolean|string|array|uint|null).

Closes: pdftract-5qca

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 14:36:03 -04:00
jedarden
67b3fde4d6 feat(pdftract-j6yd): implement signatures array output + validation_status enum + schema integration
Add document-level /signatures array output per Phase 7.3 of the plan.

Changes:
- Add SignatureJson struct to schema module with all signature metadata fields
- Update ExtractionResult to include signatures: Vec<SignatureJson>
- Integrate signature extraction into extract_pdf() pipeline
- Update result_to_json() to include signatures in JSON output
- Update JSON schema with signatures array and SignatureJson definition
- Add markdown sink signatures footer when signatures are present
- Add comprehensive tests for signature JSON serialization and validation

Acceptance criteria:
- Schema tests: 5/5 signature JSON tests pass
- Markdown sink emits Signatures footer when count > 0
- PyO3 binding automatically handles Vec<SignatureJson> via serde
- docs/schema/v1.0/pdftract.schema.json updated with signatures shape

Verification note: notes/pdftract-j6yd.md

Closes: pdftract-j6yd
2026-05-24 04:05:34 -04:00
jedarden
92e90af0b0 feat(pdftract-zy2jx): generate JSON Schema from Rust output types
- Add schemars dependency to pdftract-core (v1.2)
- Add JsonSchema derives to output types (ExtractionResult, PageResult, ExtractionMetadata, SpanJson, BlockJson, CellJson, RowJson, TableJson, ExtractionQuality, Receipt, ReceiptsMode)
- Create xtask/src/bin/gen_schema.rs for schema generation
- Add gen-schema command to xtask main.rs
- Generate docs/schema/v1.0/pdftract.schema.json using Draft 2020-12

Schema includes:
- $schema: "https://json-schema.org/draft/2020-12/schema"
- $defs with all output type definitions
- Proper type annotations for all fields

Closes: pdftract-zy2jx
2026-05-24 01:29:14 -04:00
jedarden
d14ec92fcb feat(pdftract-3zhf): add unified TableDetector::detect entry point
Add unified detect() method to TableDetector that combines both
line-based and borderless table detection pipelines. This completes
the coordinator bead for Phase 7.2: Table Detection and Structure
Reconstruction.

All child beads (7.2.1-7.2.6) are closed:
- 7.2.1: Line-based detection (path segment clustering)
- 7.2.2: Borderless detection (x0 alignment heuristic)
- 7.2.3: Span-to-cell assignment (centroid containment)
- 7.2.4: Header row detection (bold + StructTree TH)
- 7.2.5: Merged cell detection (missing interior edges)
- 7.2.6: Table JSON output schema integration

Critical tests pass:
- 5x3 bordered table (15 cells extracted)
- Merged header cell colspan=3
- Borderless 3-column table detection
- Two-page table continuation detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:51:59 -04:00