pdftract/notes/pdftract-5cto.md
jedarden a336fb55a0 docs(pdftract-2pxy5): Phase 6.3 Python bindings coordinator - verification note
- Verifies all child beads (6.3.1-6.3.4 + 6.1) are closed
- All critical tests PASS (extract, extract_text, extract_stream, errors, threading)
- Argo WorkflowTemplate pdftract-py-ci implements 5-triple wheel builds
- PyPI upload gated on milestone tags

Closes pdftract-2pxy5.
2026-06-01 17:57:24 -04:00

5.4 KiB

Phase 6.1: JSON Output (Full Schema) - Coordinator Verification

Bead ID: pdftract-5cto Date: 2026-06-01 Model: claude-code-glm-4.7 Harness: needle

Acceptance Criteria Status

All Phase 6.1 child task beads closed

All 9 child beads verified closed:

  • Direct children (5): pdftract-2qw5j, pdftract-2u6q2, pdftract-3jm4n, pdftract-40oz0, pdftract-4c8qu
  • Nested children (4): pdftract-16h0a, pdftract-1izx9, pdftract-35byi, pdftract-5nv9h

Schema validator tests pass

Ran cargo test --test json_schema - all 6 tests passed:

test test_all_fixtures_validate_against_schema ... ok
test test_schema_has_required_document_level_fields ... ok
test test_schema_page_json_structure ... ok
test test_schema_span_json_structure ... ok
test test_synthetic_output_validates ... ok
test test_schema_itself_is_valid ... ok

Blank page handling

Verified in crates/pdftract-core/src/output/json.rs (lines 111-118):

page_type: page.page_type.clone().unwrap_or_else(|| {
    // Determine page_type from content
    if page.spans.is_empty() {
        "blank".to_string()
    } else {
        "text".to_string()
    }
}),
  • Blank pages emit spans: [], blocks: [], page_type: "blank"
  • figure_only page_type is supported by the classifier (from Phase 5.1.1)

Error diagnostic structure

Verified in crates/pdftract-core/src/output/json.rs (lines 146-183):

fn convert_diagnostics(diagnostics: &[String]) -> Vec<DiagnosticJson> {
    diagnostics.iter().map(|diag_str| {
        let (code, message) = if let Some(colon_pos) = diag_str.find(':') {
            let code_part = &diag_str[..colon_pos];
            let message_part = &diag_str[colon_pos + 1..].trim();
            (code_part.trim().to_string(), message_part.to_string())
        } else {
            ("UNKNOWN".to_string(), diag_str.clone())
        };

        let severity = if code.starts_with("ERROR_") || code.contains("ERROR") {
            "error".to_string()
        } else if code.starts_with("WARN_") || code.contains("WARN") {
            "warning".to_string()
        } else {
            "info".to_string()
        };

        DiagnosticJson {
            code,
            message,
            severity,
            page_index: None, // TODO: Extract page_index from diagnostics
            location: None,
            hint: None,
        }
    }).collect()
}

Each diagnostic has:

  • Stable code (parsed from diagnostic string or "UNKNOWN")
  • severity (derived from code prefix: "error", "warning", "info")
  • page_index field (currently None - extracted from diagnostics in future phases)

JSON Schema deliverable committed

File exists: docs/schema/v1.0/pdftract.schema.json

  • Generated by xtask gen-schema using schemars
  • Validated as JSON Schema Draft 2020-12

CI schema-validation gate

Verified in .ci/argo-workflows/pdftract-ci.yaml:

  • Template schema-validation (lines 2044-2124) runs on every PR
  • Executes ci/schema-gate.sh which runs cargo test --test json_schema
  • Any validation error fails the build
  • Error messages guide developers to regenerate schema with cargo xtask gen-schema

Implementation Files

File Purpose
crates/pdftract-core/src/output/json.rs JSON output conversion from ExtractionResult to Output schema
crates/pdftract-core/src/schema/mod.rs Serde structs for Output, PageJson, SpanJson, BlockJson, etc.
docs/schema/v1.0/pdftract.schema.json Published JSON Schema (auto-generated via xtask)
crates/pdftract-core/tests/json_schema.rs Schema validation test suite
ci/schema-gate.sh CI gate script for schema validation
.ci/argo-workflows/pdftract-ci.yaml CI workflow with schema-validation template

Schema v1.0 Fields

Document-level fields

  • schema_version: "1.0" (hard-coded)
  • metadata: DocumentMetadata (title, author, page_count, etc.)
  • outline: Vec (empty until Phase 7.1)
  • threads: Vec
  • attachments: Vec
  • signatures: Vec (empty until Phase 7.8)
  • form_fields: Vec (empty until Phase 7.5)
  • links: Vec
  • pages: Vec
  • extraction_quality: ExtractionQuality
  • errors: Vec

Page-level fields

  • page_index: 0-based canonical key
  • page_number: 1-based (page_index + 1)
  • page_label: Option (from /PageLabels)
  • width, height: f64 (page geometry)
  • rotation: i32 (0, 90, 180, 270)
  • page_type: String (text, scanned, mixed, broken_vector, blank, figure_only)
  • spans: Vec
  • blocks: Vec
  • tables: Vec
  • annotations: Vec (empty until Phase 7.2)

Critical Considerations Met

  • Schema v1.0 FROZEN once 6.1 ships (INV-9 stable taxonomy)
  • broken_vector is a valid page_type in the enum
  • page_index is 0-based canonical; page_number is 1-based convenience
  • All confidence_source enum values present (vector, ocr, ocr-assisted, ocr-fallback, repaired)
  • All Phase 7 fields present as empty arrays (NOT omitted)
  • Field ordering not imposed for JSON; cache uses stable ordering

Summary

Phase 6.1 is complete and meets all acceptance criteria. The JSON output schema is implemented, tested, validated, and integrated into CI. All 9 child beads are closed. The schema v1.0 is locked and ready for downstream consumption by PyO3, HTTP, NDJSON, and MCP phases.