- Verifies all child beads (6.3.1-6.3.4 + 6.1) are closed - All critical tests PASS (extract, extract_text, extract_stream, errors, threading) - Argo WorkflowTemplate pdftract-py-ci implements 5-triple wheel builds - PyPI upload gated on milestone tags Closes pdftract-2pxy5.
5.4 KiB
5.4 KiB
Phase 6.1: JSON Output (Full Schema) - Coordinator Verification
Bead ID: pdftract-5cto Date: 2026-06-01 Model: claude-code-glm-4.7 Harness: needle
Acceptance Criteria Status
✅ All Phase 6.1 child task beads closed
All 9 child beads verified closed:
- Direct children (5): pdftract-2qw5j, pdftract-2u6q2, pdftract-3jm4n, pdftract-40oz0, pdftract-4c8qu
- Nested children (4): pdftract-16h0a, pdftract-1izx9, pdftract-35byi, pdftract-5nv9h
✅ Schema validator tests pass
Ran cargo test --test json_schema - all 6 tests passed:
test test_all_fixtures_validate_against_schema ... ok
test test_schema_has_required_document_level_fields ... ok
test test_schema_page_json_structure ... ok
test test_schema_span_json_structure ... ok
test test_synthetic_output_validates ... ok
test test_schema_itself_is_valid ... ok
✅ Blank page handling
Verified in crates/pdftract-core/src/output/json.rs (lines 111-118):
page_type: page.page_type.clone().unwrap_or_else(|| {
// Determine page_type from content
if page.spans.is_empty() {
"blank".to_string()
} else {
"text".to_string()
}
}),
- Blank pages emit
spans: [],blocks: [],page_type: "blank" figure_onlypage_type is supported by the classifier (from Phase 5.1.1)
✅ Error diagnostic structure
Verified in crates/pdftract-core/src/output/json.rs (lines 146-183):
fn convert_diagnostics(diagnostics: &[String]) -> Vec<DiagnosticJson> {
diagnostics.iter().map(|diag_str| {
let (code, message) = if let Some(colon_pos) = diag_str.find(':') {
let code_part = &diag_str[..colon_pos];
let message_part = &diag_str[colon_pos + 1..].trim();
(code_part.trim().to_string(), message_part.to_string())
} else {
("UNKNOWN".to_string(), diag_str.clone())
};
let severity = if code.starts_with("ERROR_") || code.contains("ERROR") {
"error".to_string()
} else if code.starts_with("WARN_") || code.contains("WARN") {
"warning".to_string()
} else {
"info".to_string()
};
DiagnosticJson {
code,
message,
severity,
page_index: None, // TODO: Extract page_index from diagnostics
location: None,
hint: None,
}
}).collect()
}
Each diagnostic has:
- Stable
code(parsed from diagnostic string or "UNKNOWN") severity(derived from code prefix: "error", "warning", "info")page_indexfield (currently None - extracted from diagnostics in future phases)
✅ JSON Schema deliverable committed
File exists: docs/schema/v1.0/pdftract.schema.json
- Generated by
xtask gen-schemausing schemars - Validated as JSON Schema Draft 2020-12
✅ CI schema-validation gate
Verified in .ci/argo-workflows/pdftract-ci.yaml:
- Template
schema-validation(lines 2044-2124) runs on every PR - Executes
ci/schema-gate.shwhich runscargo test --test json_schema - Any validation error fails the build
- Error messages guide developers to regenerate schema with
cargo xtask gen-schema
Implementation Files
| File | Purpose |
|---|---|
crates/pdftract-core/src/output/json.rs |
JSON output conversion from ExtractionResult to Output schema |
crates/pdftract-core/src/schema/mod.rs |
Serde structs for Output, PageJson, SpanJson, BlockJson, etc. |
docs/schema/v1.0/pdftract.schema.json |
Published JSON Schema (auto-generated via xtask) |
crates/pdftract-core/tests/json_schema.rs |
Schema validation test suite |
ci/schema-gate.sh |
CI gate script for schema validation |
.ci/argo-workflows/pdftract-ci.yaml |
CI workflow with schema-validation template |
Schema v1.0 Fields
Document-level fields
schema_version: "1.0" (hard-coded)metadata: DocumentMetadata (title, author, page_count, etc.)outline: Vec (empty until Phase 7.1)threads: Vecattachments: Vecsignatures: Vec (empty until Phase 7.8)form_fields: Vec (empty until Phase 7.5)links: Vecpages: Vecextraction_quality: ExtractionQualityerrors: Vec
Page-level fields
page_index: 0-based canonical keypage_number: 1-based (page_index + 1)page_label: Option (from /PageLabels)width,height: f64 (page geometry)rotation: i32 (0, 90, 180, 270)page_type: String (text, scanned, mixed, broken_vector, blank, figure_only)spans: Vecblocks: Vectables: Vecannotations: Vec (empty until Phase 7.2)
Critical Considerations Met
- ✅ Schema v1.0 FROZEN once 6.1 ships (INV-9 stable taxonomy)
- ✅
broken_vectoris a valid page_type in the enum - ✅
page_indexis 0-based canonical;page_numberis 1-based convenience - ✅ All
confidence_sourceenum values present (vector, ocr, ocr-assisted, ocr-fallback, repaired) - ✅ All Phase 7 fields present as empty arrays (NOT omitted)
- ✅ Field ordering not imposed for JSON; cache uses stable ordering
Summary
Phase 6.1 is complete and meets all acceptance criteria. The JSON output schema is implemented, tested, validated, and integrated into CI. All 9 child beads are closed. The schema v1.0 is locked and ready for downstream consumption by PyO3, HTTP, NDJSON, and MCP phases.