pdftract/docs/research
jedarden bf37f0f05f docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields
This commit brings docs/research/extraction-output-schema.md to v1.0 final-pass
specification, aligning with Phase 6.1 deliverables and plan requirements.

**Key additions:**
- page_number field documented with page_index relationship (1-based vs 0-based)
- page_type enum expanded with all six values: text, scanned, mixed, broken_vector,
  blank, figure_only — with broken_vector cross-referenced to Phase 5.5
- Block kind enum fully documented: paragraph, heading, list, table, figure, caption,
  code, formula, watermark, header, footer
- Attachments schema with base64 contentEncoding and 50MB truncation rule
- Profile-based classification fields (document_type, document_type_confidence,
  document_type_reasons, profile_name, profile_version, profile_fields)
- Schema Version Compatibility section with additive-evolution rules
- JSON Schema cross-reference throughout

**Format changes:**
- Restructured with ATX headings (## for sections)
- Added explicit field tables for each major schema section
- Cross-linked to machine-readable JSON Schema at docs/schema/v1.0/pdftract.schema.json
- Grew from 81 lines to 304 lines per acceptance criteria

**Plan references:**
- Lines 97, 2002-2030, 2017, 1836, 2640, 1709, 1752, 2989-3006, 3659
- INV-9 page_type taxonomy stability

Co-Authored-By: Claude Code (GLM-4.7) <noreply@anthropic.com>
2026-05-24 00:59:23 -04:00
..
.gitkeep Initial repo scaffold with README and docs structure 2026-05-16 14:26:16 -04:00
accessibility-and-tagged-pdf-deep-dive.md Add research: portfolios, incremental updates, tagged PDF, JavaScript/forms 2026-05-16 15:45:59 -04:00
adversarial-inputs-and-parser-security.md Add research: Indic scripts, adversarial parser security 2026-05-16 16:18:03 -04:00
article-threads-and-reading-order.md Add research: article threads, resource dictionaries, catalog, hyperlinks 2026-05-16 16:04:00 -04:00
benchmark-and-test-methodology.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
book-and-publishing-pdf-patterns.md Add research: page labels, government forms, book publishing, filter decoding 2026-05-16 15:55:08 -04:00
chunking-for-llm-consumption.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
cjk-and-asian-script-encoding.md Add three research documents: CJK encoding, pipeline synthesis, linearization 2026-05-16 15:26:36 -04:00
cmap-format-and-cid-encoding.md Add three research documents on parser correctness fundamentals 2026-05-16 15:16:41 -04:00
color-management-and-icc-profiles.md Add research: color management, text metrics, PDF/X, content stream operators 2026-05-16 15:59:02 -04:00
complex-layout-reading-order.md Add four research documents focused on readable text production 2026-05-16 15:13:10 -04:00
confidence-scoring-and-aggregation.md Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs 2026-05-16 15:35:48 -04:00
content-stream-concatenation.md Add three research documents on parser correctness fundamentals 2026-05-16 15:16:41 -04:00
content-stream-operators.md Add research: color management, text metrics, PDF/X, content stream operators 2026-05-16 15:59:02 -04:00
digital-signatures-and-certification.md Add research: color visibility, medical/scientific, multilingual, digital signatures 2026-05-16 15:41:43 -04:00
document-catalog-and-structure.md Add research: article threads, resource dictionaries, catalog, hyperlinks 2026-05-16 16:04:00 -04:00
document-classification-and-zone-labeling.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
embedded-files-and-portfolios.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
engineering-document-extraction.md Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs 2026-05-16 15:35:48 -04:00
error-handling-and-robustness.md Add research: error handling, PDF/A guarantees, output schema, generator quirks 2026-05-16 16:07:13 -04:00
extraction-output-schema.md docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields 2026-05-24 00:59:23 -04:00
extraction-pipeline-overview.md Add three research documents: CJK encoding, pipeline synthesis, linearization 2026-05-16 15:26:36 -04:00
font-descriptor-and-metrics.md Add research: xref parsing, object model, font descriptors, PDF/UA-2 2026-05-16 16:01:34 -04:00
font-subsetting-and-extraction.md Add research: font subsetting, LaTeX patterns, redaction detection 2026-05-16 15:30:52 -04:00
form-fields-and-annotations.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
glyph-recognition-and-unicode-recovery.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
government-form-pdf-patterns.md Add research: page labels, government forms, book publishing, filter decoding 2026-05-16 15:55:08 -04:00
graphics-state-tracking.md Add three research documents on parser correctness fundamentals 2026-05-16 15:16:41 -04:00
historical-and-degraded-document-extraction.md Add four research documents focused on readable text production 2026-05-16 15:13:10 -04:00
hyperlinks-and-named-destinations.md Add research: article threads, resource dictionaries, catalog, hyperlinks 2026-05-16 16:04:00 -04:00
image-and-figure-extraction.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
image-compression-and-filter-decoding.md Add research: page labels, government forms, book publishing, filter decoding 2026-05-16 15:55:08 -04:00
incremental-updates-and-versioning.md Add research: portfolios, incremental updates, tagged PDF, JavaScript/forms 2026-05-16 15:45:59 -04:00
indic-script-extraction.md Add research: Indic scripts, adversarial parser security 2026-05-16 16:18:03 -04:00
invisible-and-hidden-text.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
javascript-and-interactive-pdf-extraction.md Add research: portfolios, incremental updates, tagged PDF, JavaScript/forms 2026-05-16 15:45:59 -04:00
language-detection-and-script-handling.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
latex-and-scientific-pdf-patterns.md Add research: font subsetting, LaTeX patterns, redaction detection 2026-05-16 15:30:52 -04:00
legal-and-financial-pdf-patterns.md Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs 2026-05-16 15:35:48 -04:00
linearized-pdf-and-streaming.md Add three research documents: CJK encoding, pipeline synthesis, linearization 2026-05-16 15:26:36 -04:00
malformed-pdf-repair-and-recovery.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
mathematical-expression-handling.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
medical-and-scientific-pdf-patterns.md Add research: color visibility, medical/scientific, multilingual, digital signatures 2026-05-16 15:41:43 -04:00
multilingual-document-extraction.md Add research: color visibility, medical/scientific, multilingual, digital signatures 2026-05-16 15:41:43 -04:00
opentype-math-and-formula-extraction.md Add research: Southeast Asian scripts, OpenType MATH formula extraction 2026-05-16 16:21:48 -04:00
optional-content-groups.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
page-geometry-and-document-structure.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
page-labels-and-outline-extraction.md Add research: page labels, government forms, book publishing, filter decoding 2026-05-16 15:55:08 -04:00
parallel-extraction-architecture.md Add parallel extraction research and comprehensive research index 2026-05-16 16:30:35 -04:00
pdf-encryption-and-security.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
pdf-fonts-and-encoding.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
pdf-generator-quirks.md Add research: error handling, PDF/A guarantees, output schema, generator quirks 2026-05-16 16:07:13 -04:00
pdf-object-model-and-data-types.md Add research: xref parsing, object model, font descriptors, PDF/UA-2 2026-05-16 16:01:34 -04:00
pdf-portfolio-and-attachments.md Add research: portfolios, incremental updates, tagged PDF, JavaScript/forms 2026-05-16 15:45:59 -04:00
pdf-specification.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
pdfa-archival-extraction-guarantees.md Add research: error handling, PDF/A guarantees, output schema, generator quirks 2026-05-16 16:07:13 -04:00
pdfa-compliance-and-extraction.md Add three research documents on routing and text reconstruction 2026-05-16 15:22:08 -04:00
pdfua2-and-accessibility-standards.md Add research: xref parsing, object model, font descriptors, PDF/UA-2 2026-05-16 16:01:34 -04:00
pdfvt-variable-transactional-printing.md Add research: Ruby/furigana typography, PDF/VT variable printing 2026-05-16 16:24:21 -04:00
pdfx-prepress-extraction.md Add research: color management, text metrics, PDF/X, content stream operators 2026-05-16 15:59:02 -04:00
performance-and-streaming-architecture.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
post-extraction-normalization.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
post-ocr-text-correction.md Add four research documents on text quality and document-type handling 2026-05-16 15:07:30 -04:00
presentation-and-spreadsheet-pdfs.md Add four research documents on text quality and document-type handling 2026-05-16 15:07:30 -04:00
raster-ocr-pipeline.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
redaction-detection-and-recovery.md Add research: font subsetting, LaTeX patterns, redaction detection 2026-05-16 15:30:52 -04:00
resource-dictionary-and-inheritance.md Add research: article threads, resource dictionaries, catalog, hyperlinks 2026-05-16 16:04:00 -04:00
ruby-text-and-east-asian-typography.md Add research: Ruby/furigana typography, PDF/VT variable printing 2026-05-16 16:24:21 -04:00
scanned-vs-vector-page-classification.md Add three research documents on routing and text reconstruction 2026-05-16 15:22:08 -04:00
semantic-text-reconstruction.md Add four research documents on text quality and document-type handling 2026-05-16 15:07:30 -04:00
shading-pattern-and-text-visibility.md Add research: color visibility, medical/scientific, multilingual, digital signatures 2026-05-16 15:41:43 -04:00
southeast-asian-script-extraction.md Add research: Southeast Asian scripts, OpenType MATH formula extraction 2026-05-16 16:21:48 -04:00
span-merging-and-text-run-assembly.md Add research: span merging, Unicode normalization, implementation plan 2026-05-16 16:15:14 -04:00
stroke-and-outlined-text.md Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs 2026-05-16 15:35:48 -04:00
table-structure-reconstruction.md feat(pdftract-ilen): implement header row detection with bold+TH support 2026-05-23 23:32:54 -04:00
tagged-pdf-structure-and-reading-order.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
text-positioning-and-font-metrics.md Add research: color management, text metrics, PDF/X, content stream operators 2026-05-16 15:59:02 -04:00
text-readability-validation.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
type3-font-extraction.md Add four research documents focused on readable text production 2026-05-16 15:13:10 -04:00
unicode-normalization-and-text-cleanup.md Add research: span merging, Unicode normalization, implementation plan 2026-05-16 16:15:14 -04:00
watermark-and-background-separation.md Add four research documents focused on readable text production 2026-05-16 15:13:10 -04:00
word-boundary-reconstruction.md Add three research documents on routing and text reconstruction 2026-05-16 15:22:08 -04:00
xmp-and-document-metadata.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
xref-table-parsing-and-object-lookup.md Add research: xref parsing, object model, font descriptors, PDF/UA-2 2026-05-16 16:01:34 -04:00