Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

JSON Schema Reference

Schema version: 1.0
Schema URL: https://pdftract.com/schema/v1.0/pdftract.schema.json
Source of truth: docs/schema/v1.0/pdftract.schema.json

This page provides a human-readable rendering of the pdftract output schema. The JSON Schema is the authoritative definition (per INV-11), validated in CI for all test fixtures.

Top-Level Structure

{
  "fingerprint": "pdftract-v1:a7f3c8d9...",
  "pages": [...],
  "metadata": {...},
  "signatures": [...],
  "form_fields": [...]
}
FieldTypeRequiredDescription
fingerprintstringYesPhase 1.7 fingerprint of the source PDF. Format: "pdftract-v1:" + hex(SHA-256). Used for receipt verification.
pagesarrayYesExtracted pages, each containing spans and blocks.
metadataobjectYesExtractionMetadata object with page count, diagnostics, receipts mode, etc.
signaturesarrayYesDigital signatures extracted from the document. Empty when no signature fields exist.
form_fieldsarrayYesInteractive form fields from AcroForm/XFA. Empty when no form fields exist.

Document Metadata

The metadata object contains extraction-level information:

{
  "page_count": 10,
  "span_count": 842,
  "block_count": 156,
  "error_count": 0,
  "receipts_mode": "off",
  "diagnostics": ["WARN: page 3: low coverage (54%) - possible scanned content"],
  "cache_status": "hit",
  "cache_age_seconds": 1240,
  "reading_order_algorithm": "robust-topo"
}
FieldTypeDescription
page_countintegerTotal number of pages in the document.
span_countintegerNumber of spans extracted across all pages.
block_countintegerNumber of blocks extracted across all pages.
error_countintegerNumber of pages that failed to extract.
receipts_modestringReceipts mode used: "off", "lite", or "svg".
diagnosticsarrayDiagnostic messages emitted during extraction (coverage warnings, etc.).
cache_statusstring/nullCache status: "hit", "miss", or "skipped".
cache_age_secondsinteger/nullCache entry age in seconds (only present when cache_status == "hit").
reading_order_algorithmstring/nullReading order algorithm used for this extraction.

Page Result

Each page in the pages array contains:

{
  "index": 0,
  "spans": [...],
  "blocks": [...],
  "tables": [...],
  "error": null
}
FieldTypeRequiredDescription
indexintegerYesZero-based page index. This is the canonical identifier for programmatic use.
spansarrayYesExtracted spans (text fragments with consistent styling).
blocksarrayYesExtracted blocks (semantic units like paragraphs, headings).
tablesarrayYesExtracted tables with cell-level structure. Empty when no tables detected.
errorstring/nullYesError message if extraction failed for this page.

Span

A span is the smallest unit of extracted text, representing a contiguous run of text with consistent font and styling.

{
  "text": "The quick brown fox",
  "bbox": [72.0, 612.0, 245.5, 624.3],
  "font": "Helvetica-Bold",
  "size": 12.0,
  "column": 0,
  "confidence": 0.98,
  "receipt": null
}
FieldTypeRequiredDescription
textstringYesThe extracted text content.
bboxarrayYesBounding box in PDF user-space points. Format: [x0, y0, x1, y1] where (x0, y0) is the bottom-left corner and (x1, y1) is the top-right corner. Units are 1/72 inch.
fontstringYesFont name or identifier.
sizenumberYesFont size in points.
columninteger/nullNoColumn index (0-based) assigned by Phase 4.3 column detection. Null for spans outside any detected column.
confidencenumber/nullNoConfidence score (0.0 to 1.0). Present when OCR is used or extraction has uncertainty.
receiptobject/nullNoCryptographic receipt for verification. Present when --receipts=lite or --receipts=svg is enabled.

Block

A block is a higher-level semantic unit composed of one or more spans.

{
  "kind": "paragraph",
  "text": "The quick brown fox jumps over the lazy dog.",
  "bbox": [72.0, 600.0, 540.0, 650.0],
  "level": null,
  "table_index": null
}
FieldTypeRequiredDescription
kindstringYesThe block kind/type. Common values: "paragraph", "heading", "list", "table", "figure".
textstringYesThe concatenated text content of all spans in the block.
bboxarrayYesBounding box in PDF user-space points. Same format as spans.
levelinteger/nullNoHeading level (1-6) for "heading" kind blocks. Null for other block types.
table_indexinteger/nullNoTable index for "table" kind blocks. Points to the corresponding entry in the page’s tables array.
receiptobject/nullNoCryptographic receipt for verification. Present when receipts are enabled.

Block Kind Enum

ValueDescription
paragraphA paragraph block.
headingA heading block (with level field 1-6).
listA list item block.
tableA table block (references tables array via table_index).
figureA figure or image block.
codeA code block or monospace text.
formulaA mathematical formula.
headerA page header block.
footerA page footer block.
watermarkA watermark block.
captionA caption for a figure or table.
quoteA blockquote.

Table

Tables provide detailed cell-level structure for table blocks.

{
  "id": "table_0",
  "page_index": 2,
  "bbox": [72.0, 400.0, 540.0, 550.0],
  "detection_method": "line_based",
  "header_rows": 1,
  "continued": false,
  "continued_from_prev": false,
  "rows": [...]
}
FieldTypeRequiredDescription
idstringYesUnique identifier for this table (e.g., "table_0").
page_indexintegerYesZero-based page index where this table appears.
bboxarrayYesBounding box in PDF user-space points.
detection_methodstringYesDetection method: "line_based" (ruling lines) or "borderless" (x0 alignment heuristics).
header_rowsintegerYesNumber of contiguous header rows at the top of the table.
continuedbooleanYesWhether this table continues on the next page.
continued_from_prevbooleanYesWhether this table is a continuation from the previous page.
rowsarrayYesRows in this table, ordered top-to-bottom.

Row

Each row contains cells ordered left-to-right:

{
  "bbox": [72.0, 520.0, 540.0, 540.0],
  "is_header": true,
  "cells": [...]
}
FieldTypeRequiredDescription
bboxarrayYesBounding box in PDF user-space points.
is_headerbooleanYesWhether this row is a header row.
cellsarrayYesCells in this row, ordered left-to-right.

Cell

{
  "text": "Revenue",
  "bbox": [72.0, 520.0, 180.0, 540.0],
  "row": 0,
  "col": 0,
  "rowspan": 1,
  "colspan": 1,
  "is_header_row": true,
  "spans": [0, 1]
}
FieldTypeRequiredDescription
textstringYesThe concatenated text content of all spans in the cell.
bboxarrayYesBounding box in PDF user-space points.
rowintegerYesZero-based row index within the table.
colintegerYesZero-based column index within the table.
rowspanintegerYesNumber of rows this cell spans (default 1).
colspanintegerYesNumber of columns this cell spans (default 1).
is_header_rowbooleanYesWhether this cell is in a header row.
spansarrayYesReferences to spans in the page’s spans array (indices).

Form Fields (Phase 7.4)

Form fields represent interactive form fields from the PDF’s AcroForm or XFA data.

Note: Phase 7 placeholders are documented here for forward-compatibility. Fields are present in the schema but return empty arrays until Phase 7 implementation.

{
  "name": "employer_signature",
  "type": "text",
  "value": "John Doe",
  "default": null,
  "read_only": false,
  "required": true,
  "page_index": 2,
  "rect": [72.0, 400.0, 288.0, 420.0],
  "multiline": true,
  "max_length": 100
}
FieldTypeRequiredDescription
namestringYesThe absolute (dot-joined) field name from the AcroForm.
typestringYesField type: "text", "button", "choice", or "signature".
valuevariesYesThe current value (structure varies by type).
defaultvariesNoThe default value (/DV entry).
read_onlybooleanYesWhether this field is read-only (bit 1 of /Ff flags).
requiredbooleanYesWhether this field is required (bit 2 of /Ff flags).
page_indexinteger/nullNoZero-based page index where this field’s widget appears.
rectarray/nullNoBounding box in PDF user-space points.
multilineboolean/nullNoWhether this text field supports multiple lines (text fields only).
max_lengthinteger/nullNoMaximum length for text fields (/MaxLen entry).
multi_selectboolean/nullNoWhether this choice field supports multiple selections.
optionsarray/nullNoAvailable options for choice fields ([export_value, display_name] pairs).
radioboolean/nullNoWhether this button is a radio button (button fields only).
pushbuttonboolean/nullNoWhether this button is a pushbutton (button fields only).
selectedboolean/nullNoSelected state for button fields.
state_namestring/nullNoAppearance state name for button fields (e.g., "Yes", "Off").

Signatures (Phase 7.3)

Digital signatures extracted from signature fields.

{
  "field_name": "employer_signature",
  "signer_name": "Jane Corporation",
  "signing_date": "2024-03-15T14:23:51Z",
  "location": "New York, NY",
  "reason": "Contract approval",
  "sub_filter": "adbe.pkcs7.detached",
  "byte_range": [0, 12345, 67890, 456],
  "coverage_fraction": 0.95,
  "validation_status": "not_checked"
}
FieldTypeRequiredDescription
field_namestringYesThe absolute (dot-joined) field name from the AcroForm.
signer_namestringYesThe signer’s name from the /Name entry. Empty string if absent.
validation_statusstringYesValidation status — always "not_checked" in v1. Future versions may add "valid", "invalid", "indeterminate".
signing_datestring/nullNoThe signing date as an ISO 8601 string (RFC 3339 format).
locationstring/nullNoThe location of signing from the /Location entry.
reasonstring/nullNoThe reason for signing from the /Reason entry.
sub_filterstring/nullNoThe signature format/filter from the /SubFilter entry.
byte_rangearray/nullNoThe /ByteRange array defining which bytes of the file are signed.
coverage_fractionnumber/nullNoFraction of the file covered by the signature (0.0 to 1.0).

Receipts (Phase 6.8)

Visual citation receipts provide cryptographic proof that extracted text originated from a specific region in a specific PDF.

{
  "pdf_fingerprint": "pdftract-v1:a7f3c8d9...",
  "page_index": 14,
  "bbox": [220.0, 412.0, 412.0, 432.0],
  "content_hash": "sha256:9b21c4e5...",
  "extraction_version": "1.0.0",
  "svg_clip": null
}
FieldTypeRequiredDescription
pdf_fingerprintstringYesPhase 1.7 fingerprint of the source PDF.
page_indexintegerYesZero-based page index in the source PDF.
bboxarrayYesBounding box in PDF user-space points.
content_hashstringYesSHA-256 hash of the NFC-normalized text content. Format: "sha256:" + hex(SHA-256).
extraction_versionstringYesThe pdftract version that produced this receipt (semver string).
svg_clipstring/nullNoSVG clip rendering the glyphs (present only in SVG mode).

Receipts Mode

ModeDescription
offNo receipts generated (default).
liteMinimal receipts (~120 bytes each) with fingerprint, page index, bbox, and content hash.
svgExtended receipts that include an SVG clip rendering the glyphs.

Phase 7 Placeholders

The following fields are included in the schema for forward compatibility but are not yet populated in Phase 6. They will be populated in Phase 7:

  • pages[].annotations - Highlights, stamps, notes, links from /Annots (Phase 7)
  • attachments - From /EmbeddedFiles name tree (Phase 7.5)
  • links - Document-scoped URI and internal destination links (Phase 7.6)
  • threads - Article thread chains (Phase 7.7)

These fields are present in the schema as empty arrays or null values, allowing consumers to pre-allocate space for future data without breaking when Phase 7 features are added.

Diagnostics

Diagnostic messages provide visibility into extraction quality and issues:

SeverityDescription
WARNWarning - extraction succeeded but with potential quality issues (e.g., low coverage suggesting scanned content).
ERRORError - extraction failed for a specific page or region.

Example diagnostics:

[
  "WARN: page 3: low coverage (54%) - possible scanned content",
  "ERROR: page 7: failed to extract - corrupt content stream"
]

Coordinate System

All bbox values use PDF user-space coordinates:

  • Units: PDF points (1/72 inch, approximately 0.353 mm)
  • Origin: Lower-left corner of the page (x=0, y=0)
  • Format: [x0, y0, x1, y1] where (x0, y0) is bottom-left and (x1, y1) is top-right

Example: For a US Letter page (8.5 × 11 inches):

  • Width: 612 points (8.5 × 72)
  • Height: 792 points (11 × 72)
  • Full page bbox: [0, 0, 612, 792]

Schema Validation

Per INV-11, all JSON output must validate against the schema. CI runs a schema validation step on every fixture:

# Python validation example
pip install jsonschema
jsonschema -i output.json docs/schema/v1.0/pdftract.schema.json

Plan References

  • Phase 6.1 (lines 2018-2051): JSON output full schema implementation
  • Phase 6.8 (lines 2400+): Visual citation receipts
  • Phase 7.3 (lines 2750+): Digital signatures
  • Phase 7.4 (lines 2800+): Form fields
  • INV-11 (line 841): Schema validation invariant

For the complete field-by-field rationale, see the extraction output schema research doc.