pdftract

History

jedarden bf37f0f05f docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields This commit brings docs/research/extraction-output-schema.md to v1.0 final-pass specification, aligning with Phase 6.1 deliverables and plan requirements. Key additions: - page_number field documented with page_index relationship (1-based vs 0-based) - page_type enum expanded with all six values: text, scanned, mixed, broken_vector, blank, figure_only — with broken_vector cross-referenced to Phase 5.5 - Block kind enum fully documented: paragraph, heading, list, table, figure, caption, code, formula, watermark, header, footer - Attachments schema with base64 contentEncoding and 50MB truncation rule - Profile-based classification fields (document_type, document_type_confidence, document_type_reasons, profile_name, profile_version, profile_fields) - Schema Version Compatibility section with additive-evolution rules - JSON Schema cross-reference throughout Format changes: - Restructured with ATX headings (## for sections) - Added explicit field tables for each major schema section - Cross-linked to machine-readable JSON Schema at docs/schema/v1.0/pdftract.schema.json - Grew from 81 lines to 304 lines per acceptance criteria Plan references: - Lines 97, 2002-2030, 2017, 1836, 2640, 1709, 1752, 2989-3006, 3659 - INV-9 page_type taxonomy stability Co-Authored-By: Claude Code (GLM-4.7) <noreply@anthropic.com>		2026-05-24 00:59:23 -04:00
..
adr	feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction	2026-05-23 12:30:26 -04:00
conformance	feat(pdftract-5omc): implement SDK conformance test runner pattern	2026-05-18 01:22:23 -04:00
notes	feat(pdftract-3zhf): add unified TableDetector::detect entry point	2026-05-24 00:51:59 -04:00
plan	feat(pdftract-3zhf): add unified TableDetector::detect entry point	2026-05-24 00:51:59 -04:00
research	docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields	2026-05-24 00:59:23 -04:00
schema/v1.0	feat(pdftract-3zhf): add unified TableDetector::detect entry point	2026-05-24 00:51:59 -04:00
security	docs(pdftract-58kz): add security policy documentation	2026-05-20 19:39:24 -04:00
user-docs	docs(pdftract-1g87): create mdBook scaffolding for user documentation	2026-05-18 00:38:51 -04:00
research-index.md	Add parallel extraction research and comprehensive research index	2026-05-16 16:30:35 -04:00