# Phase 6.1: JSON Output (Full Schema) - Coordinator Verification **Bead ID:** pdftract-5cto **Date:** 2026-06-01 **Model:** claude-code-glm-4.7 **Harness:** needle ## Acceptance Criteria Status ### ✅ All Phase 6.1 child task beads closed All 9 child beads verified closed: - Direct children (5): pdftract-2qw5j, pdftract-2u6q2, pdftract-3jm4n, pdftract-40oz0, pdftract-4c8qu - Nested children (4): pdftract-16h0a, pdftract-1izx9, pdftract-35byi, pdftract-5nv9h ### ✅ Schema validator tests pass Ran `cargo test --test json_schema` - all 6 tests passed: ``` test test_all_fixtures_validate_against_schema ... ok test test_schema_has_required_document_level_fields ... ok test test_schema_page_json_structure ... ok test test_schema_span_json_structure ... ok test test_synthetic_output_validates ... ok test test_schema_itself_is_valid ... ok ``` ### ✅ Blank page handling Verified in `crates/pdftract-core/src/output/json.rs` (lines 111-118): ```rust page_type: page.page_type.clone().unwrap_or_else(|| { // Determine page_type from content if page.spans.is_empty() { "blank".to_string() } else { "text".to_string() } }), ``` - Blank pages emit `spans: []`, `blocks: []`, `page_type: "blank"` - `figure_only` page_type is supported by the classifier (from Phase 5.1.1) ### ✅ Error diagnostic structure Verified in `crates/pdftract-core/src/output/json.rs` (lines 146-183): ```rust fn convert_diagnostics(diagnostics: &[String]) -> Vec { diagnostics.iter().map(|diag_str| { let (code, message) = if let Some(colon_pos) = diag_str.find(':') { let code_part = &diag_str[..colon_pos]; let message_part = &diag_str[colon_pos + 1..].trim(); (code_part.trim().to_string(), message_part.to_string()) } else { ("UNKNOWN".to_string(), diag_str.clone()) }; let severity = if code.starts_with("ERROR_") || code.contains("ERROR") { "error".to_string() } else if code.starts_with("WARN_") || code.contains("WARN") { "warning".to_string() } else { "info".to_string() }; DiagnosticJson { code, message, severity, page_index: None, // TODO: Extract page_index from diagnostics location: None, hint: None, } }).collect() } ``` Each diagnostic has: - Stable `code` (parsed from diagnostic string or "UNKNOWN") - `severity` (derived from code prefix: "error", "warning", "info") - `page_index` field (currently None - extracted from diagnostics in future phases) ### ✅ JSON Schema deliverable committed File exists: `docs/schema/v1.0/pdftract.schema.json` - Generated by `xtask gen-schema` using schemars - Validated as JSON Schema Draft 2020-12 ### ✅ CI schema-validation gate Verified in `.ci/argo-workflows/pdftract-ci.yaml`: - Template `schema-validation` (lines 2044-2124) runs on every PR - Executes `ci/schema-gate.sh` which runs `cargo test --test json_schema` - Any validation error fails the build - Error messages guide developers to regenerate schema with `cargo xtask gen-schema` ## Implementation Files | File | Purpose | |------|---------| | `crates/pdftract-core/src/output/json.rs` | JSON output conversion from ExtractionResult to Output schema | | `crates/pdftract-core/src/schema/mod.rs` | Serde structs for Output, PageJson, SpanJson, BlockJson, etc. | | `docs/schema/v1.0/pdftract.schema.json` | Published JSON Schema (auto-generated via xtask) | | `crates/pdftract-core/tests/json_schema.rs` | Schema validation test suite | | `ci/schema-gate.sh` | CI gate script for schema validation | | `.ci/argo-workflows/pdftract-ci.yaml` | CI workflow with schema-validation template | ## Schema v1.0 Fields ### Document-level fields - `schema_version`: "1.0" (hard-coded) - `metadata`: DocumentMetadata (title, author, page_count, etc.) - `outline`: Vec (empty until Phase 7.1) - `threads`: Vec - `attachments`: Vec - `signatures`: Vec (empty until Phase 7.8) - `form_fields`: Vec (empty until Phase 7.5) - `links`: Vec - `pages`: Vec - `extraction_quality`: ExtractionQuality - `errors`: Vec ### Page-level fields - `page_index`: 0-based canonical key - `page_number`: 1-based (page_index + 1) - `page_label`: Option (from /PageLabels) - `width`, `height`: f64 (page geometry) - `rotation`: i32 (0, 90, 180, 270) - `page_type`: String (text, scanned, mixed, broken_vector, blank, figure_only) - `spans`: Vec - `blocks`: Vec - `tables`: Vec - `annotations`: Vec (empty until Phase 7.2) ## Critical Considerations Met - ✅ Schema v1.0 FROZEN once 6.1 ships (INV-9 stable taxonomy) - ✅ `broken_vector` is a valid page_type in the enum - ✅ `page_index` is 0-based canonical; `page_number` is 1-based convenience - ✅ All `confidence_source` enum values present (vector, ocr, ocr-assisted, ocr-fallback, repaired) - ✅ All Phase 7 fields present as empty arrays (NOT omitted) - ✅ Field ordering not imposed for JSON; cache uses stable ordering ## Summary Phase 6.1 is complete and meets all acceptance criteria. The JSON output schema is implemented, tested, validated, and integrated into CI. All 9 child beads are closed. The schema v1.0 is locked and ready for downstream consumption by PyO3, HTTP, NDJSON, and MCP phases.