docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields

This commit brings docs/research/extraction-output-schema.md to v1.0 final-pass
specification, aligning with Phase 6.1 deliverables and plan requirements.

**Key additions:**
- page_number field documented with page_index relationship (1-based vs 0-based)
- page_type enum expanded with all six values: text, scanned, mixed, broken_vector,
  blank, figure_only — with broken_vector cross-referenced to Phase 5.5
- Block kind enum fully documented: paragraph, heading, list, table, figure, caption,
  code, formula, watermark, header, footer
- Attachments schema with base64 contentEncoding and 50MB truncation rule
- Profile-based classification fields (document_type, document_type_confidence,
  document_type_reasons, profile_name, profile_version, profile_fields)
- Schema Version Compatibility section with additive-evolution rules
- JSON Schema cross-reference throughout

**Format changes:**
- Restructured with ATX headings (## for sections)
- Added explicit field tables for each major schema section
- Cross-linked to machine-readable JSON Schema at docs/schema/v1.0/pdftract.schema.json
- Grew from 81 lines to 304 lines per acceptance criteria

**Plan references:**
- Lines 97, 2002-2030, 2017, 1836, 2640, 1709, 1752, 2989-3006, 3659
- INV-9 page_type taxonomy stability

Co-Authored-By: Claude Code (GLM-4.7) <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-24 00:59:04 -04:00
parent 3b91b340aa
commit bf37f0f05f

View file

@ -4,6 +4,8 @@
The pdftract extraction output schema is designed around a single, governing principle: every downstream use case — RAG ingestion, full-text search indexing, accessibility auditing, forensic analysis, and archival preservation — must be satisfiable from a single extraction pass without requiring re-processing. This demands a schema that is simultaneously comprehensive and layered, exposing fine-grained atomic data at lower levels while assembling semantic structure at higher levels, with clean separation between what belongs at document scope and what belongs at page scope.
**Machine-readable schema:** This document is the human-readable specification. The machine-readable JSON Schema is available at [`docs/schema/v1.0/pdftract.schema.json`](../schema/v1.0/pdftract.schema.json) and should be used for automated validation.
---
## Document-Level Structure
@ -14,14 +16,70 @@ The `pages` array is also at document level, each entry being a self-contained p
Fields that are inherently global — metadata, signatures, the outline tree, embedded attachments — must not be duplicated inside page objects. Conversely, anything that varies per-page (geometry, content blocks, annotations) must not be flattened to the document level. This division is what keeps both the full-document JSON and per-page NDJSON frames self-consistent.
### Root Fields
| Field | Type | Description |
|-------|------|-------------|
| `schema_version` | string | Schema version identifier (e.g., `"1.0"`) |
| `fingerprint` | string | PDF fingerprint for verification (format: `pdftract-v1:<hex>`) |
| `metadata` | object | Document-level metadata (see Metadata Schema below) |
| `pages` | array | Array of page objects (see Page-Level Structure below) |
| `outline` | array | Recursive bookmark tree (empty if no bookmarks) |
| `threads` | array | Article thread chains (empty until Phase 7) |
| `attachments` | array | Embedded files (see Attachments Schema below) |
| `signatures` | array | Digital signature metadata (empty until Phase 7) |
| `form_fields` | array | AcroForm/XFA field definitions (empty until Phase 7) |
| `links` | array | Document-scoped hyperlinks (empty until Phase 7) |
| `extraction_quality` | object | Aggregate quality metrics across all pages |
| `errors` | array | Diagnostic events from extraction run |
---
## Page-Level Structure
Each page object carries `page_index` (zero-based integer, matching array position), `page_label` (the human-readable label from the PDF `/PageLabels` number tree, e.g. "iv", "A-3", "1"), `width` and `height` in points (1/72 inch), and `rotation` in degrees clockwise (0, 90, 180, or 270). The `page_type` field is a hint produced by the classifier: `text`, `scanned`, `mixed`, `blank`, or `figure_only`. This hint is informational; it does not gate access to content but signals to consumers how much confidence to assign to the extracted text.
Each page object carries both positional identifiers and classification metadata.
### Page Identification Fields
| Field | Type | Description |
|-------|------|-------------|
| `page_index` | integer | Zero-based page index, canonical for programmatic use. Used in all internal references (error diagnostics, NDJSON frame ordering, cache keys). SDK code and downstream tools MUST key on `page_index` for programmatic access. |
| `page_number` | integer | One-based page number, equal to `page_index + 1`. Emitted alongside `page_index` as a convenience for human-facing display. This field is informational only; all programmatic access should use `page_index`. |
| `page_label` | string\|null | Human-readable label from the PDF `/PageLabels` number tree (e.g., `"iv"`, `"A-3"`, `"1"`). Absent (`null`) if the PDF defines no page labels. |
### Page Geometry and Classification
| Field | Type | Description |
|-------|------|-------------|
| `width` | number | Page width in points (1/72 inch) |
| `height` | number | Page height in points (1/72 inch) |
| `rotation` | integer | Page rotation in degrees clockwise (0, 90, 180, or 270) |
| `page_type` | string | Classification hint from the page classifier (see Page Type Enum below) |
### Page Type Enum
The `page_type` field is produced by the classifier and signals to consumers how much confidence to assign to the extracted text. This taxonomy is stable per [INV-9](../plan/plan.md#inv-9) — new values require an ADR.
| Value | Description |
|-------|-------------|
| `"text"` | Pure vector text PDF — all content extracted from font glyphs with high confidence |
| `"scanned"` | Raster image page — text extracted via OCR (or OCR-assisted for broken vector pages) |
| `"mixed"` | Hybrid page containing both vector text regions and scanned image regions |
| `"broken_vector"` | Vector page with corrupted encoding (e.g., bad ToUnicode CMAPs); extraction produced low-confidence text. See Phase 5.5 for the OCR escalation path. If the binary was compiled without the `ocr` feature, `broken_vector` pages are emitted as-is with a `BROKENVECTOR_OCR_UNAVAILABLE` diagnostic. |
| `"blank"` | Page with no text and no images |
| `"figure_only"` | Page with only image XObjects, no text glyphs |
### Content Arrays
Within a page, content is represented at two granularities: `spans` and `blocks`. Spans are the atomic unit — individual sequences of characters sharing identical rendering properties. Blocks are semantic groupings assembled from one or more spans. Both arrays coexist on the page object. This dual representation is deliberate: applications that need character-level font and position data (accessibility auditing, forensic comparison) operate on spans directly; applications that need paragraph flow (RAG chunking, search indexing) operate on blocks. A span carries a reference by index so that block-level consumers can always descend to span-level data when needed.
| Field | Type | Description |
|-------|------|-------------|
| `spans` | array | Atomic text spans (see Span Schema below) |
| `blocks` | array | Semantic block groupings (see Block Schema below) |
| `tables` | array | Parallel table structure objects for `kind: table` blocks (see Table Output below) |
| `annotations` | array | Page-level annotations (highlights, stamps, notes, links; empty until Phase 7) |
Page-level `annotations` are distinct from block content. They include highlights, stamps, sticky notes, and ink annotations, each with their own `bbox`, `subtype`, `author`, `created`, `modified`, and `contents` fields. Links (URI and internal-destination) appear in annotations as `subtype: link` with a `uri` or `dest` field rather than being mixed into the text stream.
---
@ -30,15 +88,55 @@ Page-level `annotations` are distinct from block content. They include highlight
A span is the smallest unit of extraction output. Its fields are: `text` (the decoded Unicode string), `bbox` as a four-element array `[x0, y0, x1, y1]` in points with the coordinate origin at the lower-left of the page (PDF default), `font` (the font name as declared in the resource dictionary), `size` (the rendered glyph size in points, combining the font matrix and CTM), `color` (the fill color as a CSS hex string like `"#1a1a1a"`, or `null` if the color is not expressible as RGB, for example a spot color), `rendering_mode` (an integer 07 matching the PDF `Tr` operator: 0 = fill, 3 = invisible, etc.), `confidence` (a float 0.01.0), `confidence_source` (one of `"native"`, `"ocr"`, `"heuristic"`), `lang` (a BCP-47 language tag if detected, otherwise `null`), and `flags` (a set of strings: `"bold"`, `"italic"`, `"smallcaps"`, `"subscript"`, `"superscript"`).
| Field | Type | Description |
|-------|------|-------------|
| `text` | string | The decoded Unicode string |
| `bbox` | array | Bounding box `[x0, y0, x1, y1]` in PDF user-space points (origin at lower-left) |
| `font` | string | Font name as declared in the resource dictionary |
| `size` | number | Rendered glyph size in points (combines font matrix and CTM) |
| `color` | string\|null | Fill color as CSS hex string (e.g., `"#1a1a1a"`), or `null` if not expressible as RGB |
| `rendering_mode` | integer | PDF `Tr` operator value (0 = fill, 3 = invisible, etc.) |
| `confidence` | number | Confidence score 0.01.0 |
| `confidence_source` | string | One of `"native"`, `"ocr"`, `"heuristic"` |
| `lang` | string\|null | BCP-47 language tag if detected, otherwise `null` |
| `flags` | array | Set of style flags: `"bold"`, `"italic"`, `"smallcaps"`, `"subscript"`, `"superscript"` |
The `confidence` and `confidence_source` pair allows consumers to apply their own filtering thresholds. A span with `confidence_source: "native"` and high confidence came from decoded font mapping with no ambiguity. A span with `confidence_source: "ocr"` was produced by the raster OCR pipeline and warrants lower trust. The `rendering_mode` field is critical for invisible-text detection: text placed with `Tr 3` is present in the stream but was never intended to be visible — forensic and accessibility consumers need this distinction.
---
## Block Schema
A block aggregates spans into a semantic unit. Its fields are: `kind` (one of `paragraph`, `heading`, `table`, `list`, `figure`, `header`, `footer`, `caption`, `code`, `formula`), `text` (the concatenated plain text of all member spans, with whitespace normalized), `bbox` (the union bounding box of all member spans), `spans` (an array of span indices referencing the page-level `spans` array), `level` (an integer 16 for `kind: heading`, matching h1h6 semantics, omitted for all other kinds), and `confidence` (the minimum confidence across member spans, representing the weakest link).
A block aggregates spans into a semantic unit. The `kind` field is the primary classification signal.
The `kind` field is the primary classification signal. Consumers building a table of contents use `kind: heading` with `level`. Consumers extracting body text filter to `kind: paragraph`. The `header` and `footer` kinds identify repeated page-margin content that should typically be excluded from body-text flows. The `figure` kind marks regions where no extractable text is present but a visual element occupies the bbox — useful for flagging gaps in extraction coverage.
### Block Fields
| Field | Type | Description |
|-------|------|-------------|
| `kind` | string | Block kind/type (see Block Kind Enum below) |
| `text` | string | Concatenated plain text of all member spans, with whitespace normalized |
| `bbox` | array | Union bounding box of all member spans `[x0, y0, x1, y1]` in points |
| `spans` | array | Array of span indices referencing the page-level `spans` array |
| `level` | integer\|null | Heading level 16 for `kind: heading` (matches h1h6 semantics), `null` for other kinds |
| `confidence` | number | Minimum confidence across member spans (weakest link) |
### Block Kind Enum
| Value | Description |
|-------|-------------|
| `"paragraph"` | Default body text block |
| `"heading"` | Heading or subheading (has `level` field 16) |
| `"list"` | List item(s) — bullet or numbered |
| `"table"` | Tabular data (see Table Output below) |
| `"figure"` | Image or graphic region with no extractable text |
| `"caption"` | Figure or table caption (small font, follows a figure/table block) |
| `"code"` | Monospace code block (indented, uses monospace font) |
| `"formula"` | Mathematical formula (detected via OpenType Math in Phase 7) |
| `"watermark"` | Watermark or background text (excluded from body text flow) |
| `"header"` | Repeated page-margin content at top (deduplicated across pages) |
| `"footer"` | Repeated page-margin content at bottom (deduplicated across pages) |
Consumers building a table of contents use `kind: heading` with `level`. Consumers extracting body text filter to `kind: paragraph`. The `header` and `footer` kinds identify repeated page-margin content that should typically be excluded from body-text flows. The `figure` kind marks regions where no extractable text is present but a visual element occupies the bbox — useful for flagging gaps in extraction coverage.
---
@ -46,11 +144,111 @@ The `kind` field is the primary classification signal. Consumers building a tabl
Tables are represented with two complementary structures. The block with `kind: table` gives the bounding box and concatenated text for downstream consumers that do not need cell structure. For consumers that do, a parallel `table` object at page level (keyed to the block index) provides the full nested structure: `rows` is an array of row objects, each containing a `cells` array. Each cell carries `text`, `bbox`, `rowspan` (default 1), `colspan` (default 1), and `is_header` (boolean, derived from tagged PDF structure or heuristic header-row detection). This separation ensures that table-aware consumers get machine-readable structure while table-unaware consumers still receive coherent concatenated text from the block.
### Table Object Fields
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Unique identifier (e.g., `"table_0"`) |
| `bbox` | array | Bounding box `[x0, y0, x1, y1]` in points |
| `rows` | array | Array of row objects (see Row Schema below) |
| `header_rows` | integer | Number of contiguous header rows at top |
| `detection_method` | string | One of `"line_based"`, `"borderless"` |
| `continued` | boolean | Whether table continues on next page |
| `continued_from_prev` | boolean | Whether table is continuation from previous page |
| `page_index` | integer | Zero-based page index where table appears |
### Row Schema
| Field | Type | Description |
|-------|------|-------------|
| `bbox` | array | Bounding box `[x0, y0, x1, y1]` in points |
| `cells` | array | Array of cell objects (see Cell Schema below) |
| `is_header` | boolean | Whether this row is a header row |
### Cell Schema
| Field | Type | Description |
|-------|------|-------------|
| `bbox` | array | Bounding box `[x0, y0, x1, y1]` in points |
| `text` | string | Concatenated text content of all spans in the cell |
| `spans` | array | References to spans in the page's `spans` array (integer indices) |
| `row` | integer | Zero-based row index within the table |
| `col` | integer | Zero-based column index within the table |
| `rowspan` | integer | Number of rows this cell spans (default 1) |
| `colspan` | integer | Number of columns this cell spans (default 1) |
| `is_header_row` | boolean | Whether this cell is in a header row |
---
## Attachments Schema
Extracted embedded files from PDF portfolios and `/EmbeddedFiles` name trees.
### Attachment Fields
| Field | Type | Description |
|-------|------|-------------|
| `filename` | string | Filename from `/F` or `/UF` in the Filespec dictionary |
| `description` | string\|null | Description from `/Desc`, or `null` if absent |
| `mime_type` | string\|null | MIME type hint from `/Subtype` in the EF stream dictionary |
| `size` | integer\|null | Decoded stream size in bytes, or `null` if unavailable |
| `created` | string\|null | ISO-8601 creation date from `/Params /CreationDate`, or `null` |
| `modified` | string\|null | ISO-8601 modification date from `/Params /ModDate`, or `null` |
| `checksum` | string\|null | Checksum from `/Params /CheckSum`, or `null` |
| `data` | string\|null | Base64-encoded content of the decoded attachment stream, or `null` if truncated (see Size Limit below) |
| `truncated` | boolean | `true` if the attachment exceeded the size limit and `data` is `null` |
### Size Limit and Encoding
If attachment stream decoded size > 50 MB, include metadata only and set `data: null` with `truncated: true`. When non-null, `data` is the base64-encoded content of the decoded attachment stream using the standard Base64 alphabet with no line breaks and padding preserved. The JSON Schema reflects this as `{"type": "string", "contentEncoding": "base64"}` for this field. In the Python API, `data` is returned as a Python `bytes` object (PyO3 converts from base64 automatically). In the CLI `--text` mode, attachments are not included.
---
## Metadata Schema
The document `metadata` object surfaces all standard PDF document information dictionary fields: `/Title`, `/Author`, `/Subject`, `/Keywords`, `/Creator`, `/Producer`, `/CreationDate`, and `/ModDate` (ISO-8601 strings). It also carries derived signals: `page_count`, `pdf_version` (e.g. `"1.7"`), `is_tagged` (boolean, true if a `/MarkInfo` dictionary with `Marked: true` is present), `is_encrypted` (boolean), `conformance` (one of `"none"`, `"PDF-A-1a"`, `"PDF-A-1b"`, `"PDF-A-2a"`, `"PDF-A-2b"`, `"PDF-A-2u"`, `"PDF-A-3a"`, `"PDF-A-3b"`, `"PDF-A-3u"`, `"PDF-UA-1"`, `"PDF-UA-2"`, `"PDF-X-1a"` — validated, not merely declared), `contains_javascript` (boolean), `contains_xfa` (boolean), and `generator` (a heuristic string identifying the producing application inferred from `/Creator` and `/Producer` patterns). XMP metadata is normalized into these same fields where it provides richer values than the document information dictionary.
The document `metadata` object surfaces all standard PDF document information dictionary fields, derived signals, and profile-based classification results.
### Standard PDF Fields
| Field | Type | Description |
|-------|------|-------------|
| `title` | string\|null | PDF `/Title` |
| `author` | string\|null | PDF `/Author` |
| `subject` | string\|null | PDF `/Subject` |
| `keywords` | string\|null | PDF `/Keywords` |
| `creator` | string\|null | PDF `/Creator` |
| `producer` | string\|null | PDF `/Producer` |
| `creation_date` | string\|null | ISO-8601 string from `/CreationDate` |
| `modification_date` | string\|null | ISO-8601 string from `/ModDate` |
### Derived Signals
| Field | Type | Description |
|-------|------|-------------|
| `page_count` | integer | Total number of pages |
| `pdf_version` | string | PDF version (e.g., `"1.7"`) |
| `is_tagged` | boolean | `true` if `/MarkInfo /Marked: true` is present |
| `is_encrypted` | boolean | `true` if document is encrypted |
| `conformance` | string | One of `"none"`, `"PDF-A-1a"`, `"PDF-A-1b"`, `"PDF-A-2a"`, `"PDF-A-2b"`, `"PDF-A-2u"`, `"PDF-A-3a"`, `"PDF-A-3b"`, `"PDF-A-3u"`, `"PDF-UA-1"`, `"PDF-UA-2"`, `"PDF-X-1a"` |
| `contains_javascript` | boolean | `true` if JavaScript actions are present |
| `contains_xfa` | boolean | `true` if XFA forms are present |
| `ocg_present` | boolean | `true` if optional content groups (layers) are present |
| `generator` | string | Heuristic string identifying the producing application |
### Profile-Based Classification (Phase 7.10)
When a document profile matches (via `--auto` or `--profile`), the metadata includes classification fields:
| Field | Type | Description |
|-------|------|-------------|
| `document_type` | string\|null | Matched profile type (e.g., `"invoice"`, `"receipt"`, `"form"`) |
| `document_type_confidence` | number\|null | Classification confidence 0.01.0 |
| `document_type_reasons` | array\|null | Array of strings explaining why this type matched (e.g., `"text_contains matched 'Invoice #'"`, `"structural.has_table = true"`) |
| `profile_name` | string\|null | Name of the matched profile (e.g., `"invoice"`) |
| `profile_version` | string\|null | Profile version string (e.g., `"1.0.0"`) |
| `profile_fields` | object\|null | Map from field name to typed value, per the matched profile's schema. Each profile defines its own field set; see `profiles/builtin/<type>/README.md` for profile-specific field documentation. |
XMP metadata is normalized into these same fields where it provides richer values than the document information dictionary.
---
@ -64,18 +262,43 @@ When invoked with `--text`, pdftract emits a single UTF-8 string rather than JSO
For large documents, the `--stream` flag activates NDJSON output: one JSON object per line, emitted as each page completes extraction. The first line is a document header frame containing the `schema_version`, `metadata`, `outline`, and a `total_pages` count. Each subsequent page frame contains a single page object in the same schema as the `pages` array entries in full-document mode, plus a `frame: "page"` discriminator field. The final line is a document footer frame (`frame: "footer"`) carrying `extraction_quality`, `errors`, `threads`, `attachments`, `signatures`, `form_fields`, and document-scoped `links` — all the fields that can only be finalized after all pages have been processed. This design allows consumers to begin processing page one while pages two through N are still being extracted, which is essential for large documents and server-side streaming APIs.
### Frame Sequence
1. **Header frame:** `{"frame":"header","schema_version":"1.0","metadata":{...},"outline":[...],"total_pages":N}`
2. **Page frames:** `{"frame":"page","page_index":N,...}` — emitted in page_index order with a window of 8 pages maximum for out-of-order buffering
3. **Footer frame:** `{"frame":"footer","extraction_quality":{...},"errors":[...],"threads":[],"attachments":[],"signatures":[],"form_fields":[],"links":[]}`
---
## Error and Diagnostic Schema
Every diagnostic event from the extraction pipeline is recorded in the `errors` array at document level. Each entry has: `code` (a stable string identifier like `"FONT_CMAP_MISSING"`, `"GLYPH_UNMAPPED"`, `"OCR_FALLBACK"`, `"XREF_REPAIRED"`, `"ENCRYPTION_UNSUPPORTED"`), `message` (a human-readable description), `page_index` (integer or `null` for document-level events), `severity` (one of `"error"`, `"warning"`, `"info"`), and `location` (an optional object with `object_number` and `generation_number` identifying the PDF indirect object where the issue originated). Error codes are namespaced by area: `FONT_*` for encoding failures, `OCR_*` for raster fallback events, `STRUCT_*` for structure tree problems, `XREF_*` for cross-reference repairs. Integration developers can key on codes programmatically rather than parsing messages, which remain subject to wording changes between releases.
### Error Entry Fields
| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Stable string identifier (e.g., `"FONT_CMAP_MISSING"`) |
| `message` | string | Human-readable description |
| `page_index` | integer\|null | Page index where error occurred, or `null` for document-level |
| `severity` | string | One of `"error"`, `"warning"`, `"info"` |
| `location` | object\|null | PDF object reference with `object_number` and `generation_number` |
---
## Versioning and Stability
## Schema Version Compatibility
The root document object carries `schema_version: "1.0"`. All fields documented here are stable in the 1.x series: their names, types, and semantics will not change in a breaking way. New fields may be added to any object in minor releases; consumers must ignore unknown fields. Fields marked with `"experimental": true` in the specification are exempt from the stability guarantee and may be removed or renamed between minor versions.
The `extensions` object at the root level is reserved for non-breaking additions that have not yet graduated to stable status. Extension fields use a namespaced key format (`"pdftract.ocr.engine_version"`) to avoid collision with future stable fields. Consumers that rely on extension fields must treat them as experimental regardless of the version in which they appear.
### Additive Evolution Rules
This schema follows JSON-Schema-style additive-evolution rules (see [plan.md lines 3659-3685](../plan/plan.md#L3659-L3685)):
- `schema_version: "1.1"` SHALL be a **strict superset** of `"1.0"`: every `"1.0"`-valid document SHALL also be `"1.1"`-valid
- New fields are optional; no field is removed; no field's semantic meaning changes within a major version
- Semantic changes to an existing field require a major-version bump and a corresponding `schema_version` major bump (`"2.0"`)
- Downstream consumers reading `"1.1"` output with a `"1.0"`-aware parser MUST tolerate unknown fields (the schema explicitly sets `additionalProperties: true` for the v1.x line)
This schema is designed so that a consumer written against version 1.0 will continue to function correctly when processing output from any 1.x release, receiving richer data it may ignore rather than encountering structural incompatibilities. Major version increments (2.0, 3.0) signal breaking changes and require explicit consumer migration.