jedarden
|
bf37f0f05f
|
docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields
This commit brings docs/research/extraction-output-schema.md to v1.0 final-pass
specification, aligning with Phase 6.1 deliverables and plan requirements.
**Key additions:**
- page_number field documented with page_index relationship (1-based vs 0-based)
- page_type enum expanded with all six values: text, scanned, mixed, broken_vector,
blank, figure_only — with broken_vector cross-referenced to Phase 5.5
- Block kind enum fully documented: paragraph, heading, list, table, figure, caption,
code, formula, watermark, header, footer
- Attachments schema with base64 contentEncoding and 50MB truncation rule
- Profile-based classification fields (document_type, document_type_confidence,
document_type_reasons, profile_name, profile_version, profile_fields)
- Schema Version Compatibility section with additive-evolution rules
- JSON Schema cross-reference throughout
**Format changes:**
- Restructured with ATX headings (## for sections)
- Added explicit field tables for each major schema section
- Cross-linked to machine-readable JSON Schema at docs/schema/v1.0/pdftract.schema.json
- Grew from 81 lines to 304 lines per acceptance criteria
**Plan references:**
- Lines 97, 2002-2030, 2017, 1836, 2640, 1709, 1752, 2989-3006, 3659
- INV-9 page_type taxonomy stability
Co-Authored-By: Claude Code (GLM-4.7) <noreply@anthropic.com>
|
2026-05-24 00:59:23 -04:00 |
|
jedarden
|
6b96d8d637
|
Add research: error handling, PDF/A guarantees, output schema, generator quirks
Four new extraction research documents covering permissive error handling
with extraction quality signaling (five error classes, circular reference
detection, memory limits), PDF/A conformance level guarantees and
fast-path optimization (Level A skips OCR and layout heuristics), the
complete extraction output schema (span/block/table/NDJSON streaming/
versioning), and per-generator extraction quirks (Word/LibreOffice/
InDesign/LaTeX/Chrome/Ghostscript/scanners).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-16 16:07:13 -04:00 |
|