pdftract/docs/user-docs/src/json-schema-reference.md
jedarden 4ec9ff7470 docs(pdftract-5boam): add JSON schema reference page
- Created comprehensive json-schema-reference.md with:
  - Top-level structure documentation
  - Document metadata, page result, span, block fields
  - Table structure (row/cell) with examples
  - Form fields and signatures (Phase 7 placeholders)
  - Receipts and coordinate system docs
  - Cross-references to plan sections (INV-11, Phase 6.1, etc.)
- Added to mdBook SUMMARY.md as top-level reference page
- All examples use real JSON from the schema
- Builds successfully (46KB HTML output)

Acceptance criteria:
- PASS: docs/user-docs/src/json-schema-reference.md exists
- PASS: Covers all top-level types and enums (Document, Page, Span, Block, Table, FormField, Signature, Receipt)
- PASS: Examples for each major type
- PASS: mdBook renders cleanly (verified)
- PASS: Cross-references to plan sections included

Closes: pdftract-5boam
2026-05-25 05:18:53 -04:00

15 KiB
Raw Blame History

JSON Schema Reference

Schema version: 1.0
Schema URL: https://pdftract.com/schema/v1.0/pdftract.schema.json
Source of truth: docs/schema/v1.0/pdftract.schema.json

This page provides a human-readable rendering of the pdftract output schema. The JSON Schema is the authoritative definition (per INV-11), validated in CI for all test fixtures.

Top-Level Structure

{
  "fingerprint": "pdftract-v1:a7f3c8d9...",
  "pages": [...],
  "metadata": {...},
  "signatures": [...],
  "form_fields": [...]
}
Field Type Required Description
fingerprint string Yes Phase 1.7 fingerprint of the source PDF. Format: "pdftract-v1:" + hex(SHA-256). Used for receipt verification.
pages array Yes Extracted pages, each containing spans and blocks.
metadata object Yes ExtractionMetadata object with page count, diagnostics, receipts mode, etc.
signatures array Yes Digital signatures extracted from the document. Empty when no signature fields exist.
form_fields array Yes Interactive form fields from AcroForm/XFA. Empty when no form fields exist.

Document Metadata

The metadata object contains extraction-level information:

{
  "page_count": 10,
  "span_count": 842,
  "block_count": 156,
  "error_count": 0,
  "receipts_mode": "off",
  "diagnostics": ["WARN: page 3: low coverage (54%) - possible scanned content"],
  "cache_status": "hit",
  "cache_age_seconds": 1240,
  "reading_order_algorithm": "robust-topo"
}
Field Type Description
page_count integer Total number of pages in the document.
span_count integer Number of spans extracted across all pages.
block_count integer Number of blocks extracted across all pages.
error_count integer Number of pages that failed to extract.
receipts_mode string Receipts mode used: "off", "lite", or "svg".
diagnostics array Diagnostic messages emitted during extraction (coverage warnings, etc.).
cache_status string/null Cache status: "hit", "miss", or "skipped".
cache_age_seconds integer/null Cache entry age in seconds (only present when cache_status == "hit").
reading_order_algorithm string/null Reading order algorithm used for this extraction.

Page Result

Each page in the pages array contains:

{
  "index": 0,
  "spans": [...],
  "blocks": [...],
  "tables": [...],
  "error": null
}
Field Type Required Description
index integer Yes Zero-based page index. This is the canonical identifier for programmatic use.
spans array Yes Extracted spans (text fragments with consistent styling).
blocks array Yes Extracted blocks (semantic units like paragraphs, headings).
tables array Yes Extracted tables with cell-level structure. Empty when no tables detected.
error string/null Yes Error message if extraction failed for this page.

Span

A span is the smallest unit of extracted text, representing a contiguous run of text with consistent font and styling.

{
  "text": "The quick brown fox",
  "bbox": [72.0, 612.0, 245.5, 624.3],
  "font": "Helvetica-Bold",
  "size": 12.0,
  "column": 0,
  "confidence": 0.98,
  "receipt": null
}
Field Type Required Description
text string Yes The extracted text content.
bbox array Yes Bounding box in PDF user-space points. Format: [x0, y0, x1, y1] where (x0, y0) is the bottom-left corner and (x1, y1) is the top-right corner. Units are 1/72 inch.
font string Yes Font name or identifier.
size number Yes Font size in points.
column integer/null No Column index (0-based) assigned by Phase 4.3 column detection. Null for spans outside any detected column.
confidence number/null No Confidence score (0.0 to 1.0). Present when OCR is used or extraction has uncertainty.
receipt object/null No Cryptographic receipt for verification. Present when --receipts=lite or --receipts=svg is enabled.

Block

A block is a higher-level semantic unit composed of one or more spans.

{
  "kind": "paragraph",
  "text": "The quick brown fox jumps over the lazy dog.",
  "bbox": [72.0, 600.0, 540.0, 650.0],
  "level": null,
  "table_index": null
}
Field Type Required Description
kind string Yes The block kind/type. Common values: "paragraph", "heading", "list", "table", "figure".
text string Yes The concatenated text content of all spans in the block.
bbox array Yes Bounding box in PDF user-space points. Same format as spans.
level integer/null No Heading level (1-6) for "heading" kind blocks. Null for other block types.
table_index integer/null No Table index for "table" kind blocks. Points to the corresponding entry in the page's tables array.
receipt object/null No Cryptographic receipt for verification. Present when receipts are enabled.

Block Kind Enum

Value Description
paragraph A paragraph block.
heading A heading block (with level field 1-6).
list A list item block.
table A table block (references tables array via table_index).
figure A figure or image block.
code A code block or monospace text.
formula A mathematical formula.
header A page header block.
footer A page footer block.
watermark A watermark block.
caption A caption for a figure or table.
quote A blockquote.

Table

Tables provide detailed cell-level structure for table blocks.

{
  "id": "table_0",
  "page_index": 2,
  "bbox": [72.0, 400.0, 540.0, 550.0],
  "detection_method": "line_based",
  "header_rows": 1,
  "continued": false,
  "continued_from_prev": false,
  "rows": [...]
}
Field Type Required Description
id string Yes Unique identifier for this table (e.g., "table_0").
page_index integer Yes Zero-based page index where this table appears.
bbox array Yes Bounding box in PDF user-space points.
detection_method string Yes Detection method: "line_based" (ruling lines) or "borderless" (x0 alignment heuristics).
header_rows integer Yes Number of contiguous header rows at the top of the table.
continued boolean Yes Whether this table continues on the next page.
continued_from_prev boolean Yes Whether this table is a continuation from the previous page.
rows array Yes Rows in this table, ordered top-to-bottom.

Row

Each row contains cells ordered left-to-right:

{
  "bbox": [72.0, 520.0, 540.0, 540.0],
  "is_header": true,
  "cells": [...]
}
Field Type Required Description
bbox array Yes Bounding box in PDF user-space points.
is_header boolean Yes Whether this row is a header row.
cells array Yes Cells in this row, ordered left-to-right.

Cell

{
  "text": "Revenue",
  "bbox": [72.0, 520.0, 180.0, 540.0],
  "row": 0,
  "col": 0,
  "rowspan": 1,
  "colspan": 1,
  "is_header_row": true,
  "spans": [0, 1]
}
Field Type Required Description
text string Yes The concatenated text content of all spans in the cell.
bbox array Yes Bounding box in PDF user-space points.
row integer Yes Zero-based row index within the table.
col integer Yes Zero-based column index within the table.
rowspan integer Yes Number of rows this cell spans (default 1).
colspan integer Yes Number of columns this cell spans (default 1).
is_header_row boolean Yes Whether this cell is in a header row.
spans array Yes References to spans in the page's spans array (indices).

Form Fields (Phase 7.4)

Form fields represent interactive form fields from the PDF's AcroForm or XFA data.

Note: Phase 7 placeholders are documented here for forward-compatibility. Fields are present in the schema but return empty arrays until Phase 7 implementation.

{
  "name": "employer_signature",
  "type": "text",
  "value": "John Doe",
  "default": null,
  "read_only": false,
  "required": true,
  "page_index": 2,
  "rect": [72.0, 400.0, 288.0, 420.0],
  "multiline": true,
  "max_length": 100
}
Field Type Required Description
name string Yes The absolute (dot-joined) field name from the AcroForm.
type string Yes Field type: "text", "button", "choice", or "signature".
value varies Yes The current value (structure varies by type).
default varies No The default value (/DV entry).
read_only boolean Yes Whether this field is read-only (bit 1 of /Ff flags).
required boolean Yes Whether this field is required (bit 2 of /Ff flags).
page_index integer/null No Zero-based page index where this field's widget appears.
rect array/null No Bounding box in PDF user-space points.
multiline boolean/null No Whether this text field supports multiple lines (text fields only).
max_length integer/null No Maximum length for text fields (/MaxLen entry).
multi_select boolean/null No Whether this choice field supports multiple selections.
options array/null No Available options for choice fields ([export_value, display_name] pairs).
radio boolean/null No Whether this button is a radio button (button fields only).
pushbutton boolean/null No Whether this button is a pushbutton (button fields only).
selected boolean/null No Selected state for button fields.
state_name string/null No Appearance state name for button fields (e.g., "Yes", "Off").

Signatures (Phase 7.3)

Digital signatures extracted from signature fields.

{
  "field_name": "employer_signature",
  "signer_name": "Jane Corporation",
  "signing_date": "2024-03-15T14:23:51Z",
  "location": "New York, NY",
  "reason": "Contract approval",
  "sub_filter": "adbe.pkcs7.detached",
  "byte_range": [0, 12345, 67890, 456],
  "coverage_fraction": 0.95,
  "validation_status": "not_checked"
}
Field Type Required Description
field_name string Yes The absolute (dot-joined) field name from the AcroForm.
signer_name string Yes The signer's name from the /Name entry. Empty string if absent.
validation_status string Yes Validation status — always "not_checked" in v1. Future versions may add "valid", "invalid", "indeterminate".
signing_date string/null No The signing date as an ISO 8601 string (RFC 3339 format).
location string/null No The location of signing from the /Location entry.
reason string/null No The reason for signing from the /Reason entry.
sub_filter string/null No The signature format/filter from the /SubFilter entry.
byte_range array/null No The /ByteRange array defining which bytes of the file are signed.
coverage_fraction number/null No Fraction of the file covered by the signature (0.0 to 1.0).

Receipts (Phase 6.8)

Visual citation receipts provide cryptographic proof that extracted text originated from a specific region in a specific PDF.

{
  "pdf_fingerprint": "pdftract-v1:a7f3c8d9...",
  "page_index": 14,
  "bbox": [220.0, 412.0, 412.0, 432.0],
  "content_hash": "sha256:9b21c4e5...",
  "extraction_version": "1.0.0",
  "svg_clip": null
}
Field Type Required Description
pdf_fingerprint string Yes Phase 1.7 fingerprint of the source PDF.
page_index integer Yes Zero-based page index in the source PDF.
bbox array Yes Bounding box in PDF user-space points.
content_hash string Yes SHA-256 hash of the NFC-normalized text content. Format: "sha256:" + hex(SHA-256).
extraction_version string Yes The pdftract version that produced this receipt (semver string).
svg_clip string/null No SVG clip rendering the glyphs (present only in SVG mode).

Receipts Mode

Mode Description
off No receipts generated (default).
lite Minimal receipts (~120 bytes each) with fingerprint, page index, bbox, and content hash.
svg Extended receipts that include an SVG clip rendering the glyphs.

Phase 7 Placeholders

The following fields are included in the schema for forward compatibility but are not yet populated in Phase 6. They will be populated in Phase 7:

  • pages[].annotations - Highlights, stamps, notes, links from /Annots (Phase 7)
  • attachments - From /EmbeddedFiles name tree (Phase 7.5)
  • links - Document-scoped URI and internal destination links (Phase 7.6)
  • threads - Article thread chains (Phase 7.7)

These fields are present in the schema as empty arrays or null values, allowing consumers to pre-allocate space for future data without breaking when Phase 7 features are added.

Diagnostics

Diagnostic messages provide visibility into extraction quality and issues:

Severity Description
WARN Warning - extraction succeeded but with potential quality issues (e.g., low coverage suggesting scanned content).
ERROR Error - extraction failed for a specific page or region.

Example diagnostics:

[
  "WARN: page 3: low coverage (54%) - possible scanned content",
  "ERROR: page 7: failed to extract - corrupt content stream"
]

Coordinate System

All bbox values use PDF user-space coordinates:

  • Units: PDF points (1/72 inch, approximately 0.353 mm)
  • Origin: Lower-left corner of the page (x=0, y=0)
  • Format: [x0, y0, x1, y1] where (x0, y0) is bottom-left and (x1, y1) is top-right

Example: For a US Letter page (8.5 × 11 inches):

  • Width: 612 points (8.5 × 72)
  • Height: 792 points (11 × 72)
  • Full page bbox: [0, 0, 612, 792]

Schema Validation

Per INV-11, all JSON output must validate against the schema. CI runs a schema validation step on every fixture:

# Python validation example
pip install jsonschema
jsonschema -i output.json docs/schema/v1.0/pdftract.schema.json

Plan References

  • Phase 6.1 (lines 2018-2051): JSON output full schema implementation
  • Phase 6.8 (lines 2400+): Visual citation receipts
  • Phase 7.3 (lines 2750+): Digital signatures
  • Phase 7.4 (lines 2800+): Form fields
  • INV-11 (line 841): Schema validation invariant

For the complete field-by-field rationale, see the extraction output schema research doc.