pdftract/docs
jedarden bf37f0f05f docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields
This commit brings docs/research/extraction-output-schema.md to v1.0 final-pass
specification, aligning with Phase 6.1 deliverables and plan requirements.

**Key additions:**
- page_number field documented with page_index relationship (1-based vs 0-based)
- page_type enum expanded with all six values: text, scanned, mixed, broken_vector,
  blank, figure_only — with broken_vector cross-referenced to Phase 5.5
- Block kind enum fully documented: paragraph, heading, list, table, figure, caption,
  code, formula, watermark, header, footer
- Attachments schema with base64 contentEncoding and 50MB truncation rule
- Profile-based classification fields (document_type, document_type_confidence,
  document_type_reasons, profile_name, profile_version, profile_fields)
- Schema Version Compatibility section with additive-evolution rules
- JSON Schema cross-reference throughout

**Format changes:**
- Restructured with ATX headings (## for sections)
- Added explicit field tables for each major schema section
- Cross-linked to machine-readable JSON Schema at docs/schema/v1.0/pdftract.schema.json
- Grew from 81 lines to 304 lines per acceptance criteria

**Plan references:**
- Lines 97, 2002-2030, 2017, 1836, 2640, 1709, 1752, 2989-3006, 3659
- INV-9 page_type taxonomy stability

Co-Authored-By: Claude Code (GLM-4.7) <noreply@anthropic.com>
2026-05-24 00:59:23 -04:00
..
adr feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
conformance feat(pdftract-5omc): implement SDK conformance test runner pattern 2026-05-18 01:22:23 -04:00
notes feat(pdftract-3zhf): add unified TableDetector::detect entry point 2026-05-24 00:51:59 -04:00
plan feat(pdftract-3zhf): add unified TableDetector::detect entry point 2026-05-24 00:51:59 -04:00
research docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields 2026-05-24 00:59:23 -04:00
schema/v1.0 feat(pdftract-3zhf): add unified TableDetector::detect entry point 2026-05-24 00:51:59 -04:00
security docs(pdftract-58kz): add security policy documentation 2026-05-20 19:39:24 -04:00
user-docs docs(pdftract-1g87): create mdBook scaffolding for user documentation 2026-05-18 00:38:51 -04:00
research-index.md Add parallel extraction research and comprehensive research index 2026-05-16 16:30:35 -04:00