Commit graph

4 commits

Author SHA1 Message Date
jedarden
92e90af0b0 feat(pdftract-zy2jx): generate JSON Schema from Rust output types
- Add schemars dependency to pdftract-core (v1.2)
- Add JsonSchema derives to output types (ExtractionResult, PageResult, ExtractionMetadata, SpanJson, BlockJson, CellJson, RowJson, TableJson, ExtractionQuality, Receipt, ReceiptsMode)
- Create xtask/src/bin/gen_schema.rs for schema generation
- Add gen-schema command to xtask main.rs
- Generate docs/schema/v1.0/pdftract.schema.json using Draft 2020-12

Schema includes:
- $schema: "https://json-schema.org/draft/2020-12/schema"
- $defs with all output type definitions
- Proper type annotations for all fields

Closes: pdftract-zy2jx
2026-05-24 01:29:14 -04:00
jedarden
ba551b04d1 feat(pdftract-5mph): implement table block + table JSON output schema integration
- Fix table block bbox to use actual grid bbox instead of placeholder
- Add schema validation tests for tables array emission
- Verify two-page table detection integration

Files modified:
- crates/pdftract-core/src/extract.rs: Use grid bbox for table blocks
- crates/pdftract-core/src/schema/mod.rs: Add tests for tables array emission

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:49:01 -04:00
jedarden
e3a149fbf8 feat(pdftract-sg6): implement DPI selection logic for OCR rendering
Implement Phase 5.2.3 DPI selection that picks per-page DPI based on
image filter signals (JBIG2 detection) and font size signals from Phase 4.

- Add select_dpi() function implementing the DPI selection table:
  * JBIG2Decode filter present -> 200 DPI (already binary)
  * Median font_size < 7.0 pt -> 400 DPI (fine print)
  * Median font_size >= 7.0 pt -> 300 DPI (standard)
  * Default -> 300 DPI for scanned pages
- Add Pdf1Filter enum for PDF 1.x filter name parsing
- Add FontSizeSpan struct for Phase 4 font size data
- Add ocr_dpi_override option to ExtractionOptions
- Export ExtractionQuality from schema module for DPI tracking
- Add comprehensive unit tests (19 tests, all passing)

Acceptance criteria:
- Unit tests: each branch tested with synthetic inputs
- Integration: legal-document -> 400 DPI, textbook -> 300 DPI, JBIG2 -> 200 DPI
- DPI override option works correctly
- extraction_quality.dpi_used schema field ready

Co-Authored-By: Claude Code <claude-code@anthropic.com>
2026-05-23 17:37:40 -04:00
jedarden
9f18c6cb9c feat(pdftract-5zm86): implement Receipt struct + lite-mode serialization
Implement the Receipt struct and lite-mode JSON serialization for
visual citation receipts. This provides cryptographic proof of
provenance for extracted text.

Changes:
- Add Receipt struct with 6 fields (pdf_fingerprint, page_index,
  bbox, content_hash, extraction_version, svg_clip)
- Implement Receipt::lite() constructor with NFC normalization
- Integrate Receipt into SpanJson and BlockJson schemas
- Add unicode-normalization and serde_json dependencies

Acceptance criteria:
- Receipt::lite() produces valid receipts with svg_clip=None
- Lite mode JSON omits svg_clip key via skip_serializing_if
- Content hash uses NFC normalization for cross-platform stability
- Receipt wired into SpanJson and BlockJson types

Note: 100 receipts aggregate size is ~27 KB (not 15 KB as planned).
The 15 KB target is not achievable with required field sizes.

Refs: pdftract-5zm86, Phase 6.8 Visual Citation Receipts (lines 2351-2417)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 03:30:24 -04:00