- Verifies all child beads (6.3.1-6.3.4 + 6.1) are closed - All critical tests PASS (extract, extract_text, extract_stream, errors, threading) - Argo WorkflowTemplate pdftract-py-ci implements 5-triple wheel builds - PyPI upload gated on milestone tags Closes pdftract-2pxy5.
147 lines
5.4 KiB
Markdown
147 lines
5.4 KiB
Markdown
# Phase 6.1: JSON Output (Full Schema) - Coordinator Verification
|
|
|
|
**Bead ID:** pdftract-5cto
|
|
**Date:** 2026-06-01
|
|
**Model:** claude-code-glm-4.7
|
|
**Harness:** needle
|
|
|
|
## Acceptance Criteria Status
|
|
|
|
### ✅ All Phase 6.1 child task beads closed
|
|
|
|
All 9 child beads verified closed:
|
|
- Direct children (5): pdftract-2qw5j, pdftract-2u6q2, pdftract-3jm4n, pdftract-40oz0, pdftract-4c8qu
|
|
- Nested children (4): pdftract-16h0a, pdftract-1izx9, pdftract-35byi, pdftract-5nv9h
|
|
|
|
### ✅ Schema validator tests pass
|
|
|
|
Ran `cargo test --test json_schema` - all 6 tests passed:
|
|
```
|
|
test test_all_fixtures_validate_against_schema ... ok
|
|
test test_schema_has_required_document_level_fields ... ok
|
|
test test_schema_page_json_structure ... ok
|
|
test test_schema_span_json_structure ... ok
|
|
test test_synthetic_output_validates ... ok
|
|
test test_schema_itself_is_valid ... ok
|
|
```
|
|
|
|
### ✅ Blank page handling
|
|
|
|
Verified in `crates/pdftract-core/src/output/json.rs` (lines 111-118):
|
|
```rust
|
|
page_type: page.page_type.clone().unwrap_or_else(|| {
|
|
// Determine page_type from content
|
|
if page.spans.is_empty() {
|
|
"blank".to_string()
|
|
} else {
|
|
"text".to_string()
|
|
}
|
|
}),
|
|
```
|
|
|
|
- Blank pages emit `spans: []`, `blocks: []`, `page_type: "blank"`
|
|
- `figure_only` page_type is supported by the classifier (from Phase 5.1.1)
|
|
|
|
### ✅ Error diagnostic structure
|
|
|
|
Verified in `crates/pdftract-core/src/output/json.rs` (lines 146-183):
|
|
```rust
|
|
fn convert_diagnostics(diagnostics: &[String]) -> Vec<DiagnosticJson> {
|
|
diagnostics.iter().map(|diag_str| {
|
|
let (code, message) = if let Some(colon_pos) = diag_str.find(':') {
|
|
let code_part = &diag_str[..colon_pos];
|
|
let message_part = &diag_str[colon_pos + 1..].trim();
|
|
(code_part.trim().to_string(), message_part.to_string())
|
|
} else {
|
|
("UNKNOWN".to_string(), diag_str.clone())
|
|
};
|
|
|
|
let severity = if code.starts_with("ERROR_") || code.contains("ERROR") {
|
|
"error".to_string()
|
|
} else if code.starts_with("WARN_") || code.contains("WARN") {
|
|
"warning".to_string()
|
|
} else {
|
|
"info".to_string()
|
|
};
|
|
|
|
DiagnosticJson {
|
|
code,
|
|
message,
|
|
severity,
|
|
page_index: None, // TODO: Extract page_index from diagnostics
|
|
location: None,
|
|
hint: None,
|
|
}
|
|
}).collect()
|
|
}
|
|
```
|
|
|
|
Each diagnostic has:
|
|
- Stable `code` (parsed from diagnostic string or "UNKNOWN")
|
|
- `severity` (derived from code prefix: "error", "warning", "info")
|
|
- `page_index` field (currently None - extracted from diagnostics in future phases)
|
|
|
|
### ✅ JSON Schema deliverable committed
|
|
|
|
File exists: `docs/schema/v1.0/pdftract.schema.json`
|
|
- Generated by `xtask gen-schema` using schemars
|
|
- Validated as JSON Schema Draft 2020-12
|
|
|
|
### ✅ CI schema-validation gate
|
|
|
|
Verified in `.ci/argo-workflows/pdftract-ci.yaml`:
|
|
- Template `schema-validation` (lines 2044-2124) runs on every PR
|
|
- Executes `ci/schema-gate.sh` which runs `cargo test --test json_schema`
|
|
- Any validation error fails the build
|
|
- Error messages guide developers to regenerate schema with `cargo xtask gen-schema`
|
|
|
|
## Implementation Files
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `crates/pdftract-core/src/output/json.rs` | JSON output conversion from ExtractionResult to Output schema |
|
|
| `crates/pdftract-core/src/schema/mod.rs` | Serde structs for Output, PageJson, SpanJson, BlockJson, etc. |
|
|
| `docs/schema/v1.0/pdftract.schema.json` | Published JSON Schema (auto-generated via xtask) |
|
|
| `crates/pdftract-core/tests/json_schema.rs` | Schema validation test suite |
|
|
| `ci/schema-gate.sh` | CI gate script for schema validation |
|
|
| `.ci/argo-workflows/pdftract-ci.yaml` | CI workflow with schema-validation template |
|
|
|
|
## Schema v1.0 Fields
|
|
|
|
### Document-level fields
|
|
- `schema_version`: "1.0" (hard-coded)
|
|
- `metadata`: DocumentMetadata (title, author, page_count, etc.)
|
|
- `outline`: Vec<OutlineNode> (empty until Phase 7.1)
|
|
- `threads`: Vec<ThreadJson>
|
|
- `attachments`: Vec<AttachmentJson>
|
|
- `signatures`: Vec<SignatureJson> (empty until Phase 7.8)
|
|
- `form_fields`: Vec<FormFieldJson> (empty until Phase 7.5)
|
|
- `links`: Vec<LinkJson>
|
|
- `pages`: Vec<PageJson>
|
|
- `extraction_quality`: ExtractionQuality
|
|
- `errors`: Vec<DiagnosticJson>
|
|
|
|
### Page-level fields
|
|
- `page_index`: 0-based canonical key
|
|
- `page_number`: 1-based (page_index + 1)
|
|
- `page_label`: Option<String> (from /PageLabels)
|
|
- `width`, `height`: f64 (page geometry)
|
|
- `rotation`: i32 (0, 90, 180, 270)
|
|
- `page_type`: String (text, scanned, mixed, broken_vector, blank, figure_only)
|
|
- `spans`: Vec<SpanJson>
|
|
- `blocks`: Vec<BlockJson>
|
|
- `tables`: Vec<TableJson>
|
|
- `annotations`: Vec<AnnotationJson> (empty until Phase 7.2)
|
|
|
|
## Critical Considerations Met
|
|
|
|
- ✅ Schema v1.0 FROZEN once 6.1 ships (INV-9 stable taxonomy)
|
|
- ✅ `broken_vector` is a valid page_type in the enum
|
|
- ✅ `page_index` is 0-based canonical; `page_number` is 1-based convenience
|
|
- ✅ All `confidence_source` enum values present (vector, ocr, ocr-assisted, ocr-fallback, repaired)
|
|
- ✅ All Phase 7 fields present as empty arrays (NOT omitted)
|
|
- ✅ Field ordering not imposed for JSON; cache uses stable ordering
|
|
|
|
## Summary
|
|
|
|
Phase 6.1 is complete and meets all acceptance criteria. The JSON output schema is implemented, tested, validated, and integrated into CI. All 9 child beads are closed. The schema v1.0 is locked and ready for downstream consumption by PyO3, HTTP, NDJSON, and MCP phases.
|