pdftract/notes/pdftract-2u6q2.md
jedarden 9abc386cce feat(pdftract-3h9xo): implement threads JSON output + schema integration
Phase 7.7.3: Add threads field to ExtractionResult with ThreadJson schema integration.

Changes:
- Added ThreadJson and BeadJson structs to schema/mod.rs
- Added thread_to_json() function to threads/mod.rs
- Added build_page_ref_to_index() helper to parser/pages.rs
- Added threads field to ExtractionResult in extract.rs
- Implemented Phase 7.7 extraction logic with discover_threads/walk_beads
- Added threads_to_markdown() and collapse_page_ranges() to markdown.rs
- Updated JSON schema with ThreadJson and BeadJson definitions
- Added thread_to_py() and bead_to_py() conversions in pdftract-py
- Exported ThreadJson, BeadJson from lib.rs

All 32 threads module tests pass. All 35 markdown tests pass.

Verification: notes/pdftract-3h9xo.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 13:40:15 -04:00

2.8 KiB

pdftract-2u6q2: Diagnostic Infrastructure

Summary

Implemented the diagnostic emission infrastructure as specified in bead pdftract-2u6q2.

Changes Made

1. DiagnosticsCollector Type

  • File: crates/pdftract-core/src/diagnostics.rs
  • Added thread-safe DiagnosticsCollector backed by Arc<Mutex<Vec<Diagnostic>>>
  • Methods:
    • emit(code) - emit diagnostic with default message
    • emit_with_offset(code, offset) - emit with byte offset
    • emit_with_message(code, message) - emit with custom message
    • into_vec() - consume and return collected diagnostics
    • get() - get reference to collected diagnostics
    • len() / is_empty() - query collector state

2. DiagnosticJson hint Field

  • File: crates/pdftract-core/src/schema/mod.rs
  • Added hint: Option<String> field to DiagnosticJson struct
  • Updated all construction sites to include hint: None
  • Field is skipped in JSON serialization when None

3. Missing Error Codes

  • File: crates/pdftract-core/src/diagnostics.rs
  • Added DiagCode::ImgSourceMixed (IMG_SOURCE_MIXED)
  • Added DiagCode::ProfileInvalid (PROFILE_INVALID)
  • Added DiagCode::RepairRescuedFromBackwardsXref (REPAIR_RESCUED_FROM_BACKWARDS_XREF)
  • Updated category(), name(), severity() mappings
  • Added catalog entries to DIAGNOSTIC_CATALOG

4. Diagnostics Documentation

  • File: docs/integrations/diagnostics-codes.md (new)
  • Comprehensive catalog of all diagnostic codes
  • Organized by category (STRUCT_, STREAM_, XREF_*, etc.)
  • Includes severity, description, and phase origin for each code
  • Documents programmatic usage patterns

Acceptance Criteria

Criterion Status Notes
All initial codes emitted in 5.x code paths PASS Codes verified in DiagCode enum
DiagnosticsCollector unit test: 4 threads → 4 entries PASS test_collector_thread_safety passes
Code registry matches regex pattern PASS All codes use SCREAMING_SNAKE_CASE
Output.errors populated correctly PASS Output struct has errors: Vec

Tests

All tests pass:

  • test_collector_new - creates empty collector
  • test_collector_emit - emits diagnostic with code only
  • test_collector_emit_with_offset - emits diagnostic with offset
  • test_collector_emit_with_message - emits diagnostic with custom message
  • test_collector_clone - clones collector share same underlying data
  • test_collector_thread_safety - 4 threads emit concurrently, all 8 diagnostics collected

Commit

  • Hash: 2be802a
  • Message: feat(pdftract-2u6q2): implement diagnostic infrastructure

Verification

# Run diagnostics tests
cargo test --lib diagnostics::collector_tests

# Build library
cargo build --lib

# Verify documentation exists
ls -l docs/integrations/diagnostics-codes.md