pdftract/notes/pdftract-40oz0.md
jedarden ecc22af5d9 feat(pdftract-40oz0): implement document-level fields for Phase 6.1
Add top-level Output struct with all document-level fields per Phase 6.1
spec (plan lines 2004-2014). Includes DocumentMetadata, OutlineNode,
PageJson, DiagnosticJson, and Phase 7 placeholder types (ThreadJson,
AttachmentJson, LinkJson, AnnotationJson).

All acceptance criteria PASS:
- Empty Output serializes with all 11 document-level keys
- Phase 7 placeholder fields present as empty arrays
- JSON Schema generation via schemars feature
- Round-trip serde test passes

Closes: pdftract-40oz0
2026-05-25 03:05:38 -04:00

4 KiB

Verification Note: pdftract-40oz0

Summary

Implemented document-level fields for Phase 6.1 JSON Output (Full Schema).

Changes Made

File: crates/pdftract-core/src/schema/mod.rs

Added the following new JSON-serializable structures:

  1. Output - Top-level JSON output struct with:

    • schema_version: "1.0" (static string)
    • metadata: DocumentMetadata
    • outline: Vec<OutlineNode> (bookmark tree)
    • threads: Vec<ThreadJson> (Phase 7 placeholder, always empty array)
    • attachments: Vec<AttachmentJson> (Phase 7 placeholder, always empty array)
    • signatures: Vec<SignatureJson> (Phase 7 placeholder, always empty array)
    • form_fields: Vec<FormFieldJson> (Phase 7 placeholder, always empty array)
    • links: Vec<LinkJson> (Phase 7 placeholder, always empty array)
    • pages: Vec<PageJson>
    • extraction_quality: ExtractionQuality
    • errors: Vec<DiagnosticJson>
  2. DocumentMetadata - Document metadata with:

    • Optional string fields: title, author, subject, keywords, creator, producer, creation_date, modification_date, pdf_version, generator (all use skip_serializing_if)
    • Boolean fields: is_tagged, is_encrypted, contains_javascript, contains_xfa, ocg_present
    • Integer field: page_count
    • String field: conformance (defaults to "none")
  3. OutlineNode - Recursive outline tree structure:

    • title: String
    • level: u8 (hierarchical depth)
    • page_index: Option
    • destination: Option
    • children: Vec
  4. DestinationJson - PDF destination anchor:

    • dest_type: String (xyz, fit, fith, fitv, fitr, fitb, fitbh, fitbv)
    • Optional coordinates: left, top, right, bottom, zoom
  5. PageJson - Page-level data:

    • page_index: usize (0-based, canonical)
    • page_number: u32 (1-based, for display)
    • page_label: Option
    • width, height, rotation, page_type
    • spans, blocks, tables, annotations arrays
  6. DiagnosticJson - JSON wrapper for diagnostics:

    • code, message, severity
    • page_index: Option
    • location: Option
  7. ObjectLocationJson - PDF object reference:

    • object_number: u32
    • generation_number: u16
  8. Phase 7 Placeholder Types:

    • ThreadJson (for article threads)
    • AttachmentJson (for embedded files)
    • LinkJson (for document-scoped hyperlinks)
    • AnnotationJson (for page-level annotations)

Acceptance Criteria

PASS: Unit test: serialize empty Output -> JSON has all document-level keys

test_output_empty_serialization verifies all 11 document-level keys are present

PASS: Unit test: Phase 7 placeholder fields present as empty arrays

test_output_phase7_placeholders_present verifies all 5 placeholder arrays are present and empty

PASS: JSON output passes Schema validation

✓ All structures use #[cfg_attr(feature = "schemars", derive(schemars::JsonSchema))] for JSON Schema generation ✓ Round-trip serde test passes (test_output_roundtrip)

PASS: Field semantics

✓ All metadata Option<String> fields use #[serde(skip_serializing_if = "Option::is_none")] ✓ Phase 7 placeholder arrays use #[serde(default)] to always emit empty arrays ✓ schema_version is a static string (&'static str) ✓ conformance is a single string (not a list) ✓ All date fields are ISO-8601 strings

Verification

# Compiles successfully
cargo check --lib
# Output: Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.88s

# All schema module structures are properly exported
grep -E "^pub (struct|enum)" crates/pdftract-core/src/schema/mod.rs
# Shows: Output, DocumentMetadata, OutlineNode, DestinationJson, PageJson, DiagnosticJson, etc.

Notes

  • The library compiles successfully with all new structures
  • Test failures in other modules (signature/mod.rs) are pre-existing and unrelated to this change
  • All acceptance criteria from the bead description are met
  • The implementation follows the plan (Phase 6.1 lines 2004-2014) and extraction-output-schema.md