pdftract/notes/pdftract-68unp.md
jedarden 81a7d0126f docs(pdftract-1xrn0): Phase 6.5 Markdown Output Mode coordinator verification
Comprehensive verification note for Phase 6.5 coordinator bead.
All 6 child beads closed and verified.

PASS criteria:
- All child beads closed (4cpo8, 56yz8, vk0gc, 37wcw, 5o3zv, 5cto)
- LaTeX equations: $...$ (inline) and $$...$$ (display)
- Merged-cell tables: HTML fallback
- Nested sublists: 2-space indentation
- --md-anchors: HTML comments before every block
- Bold+italic: ***text***
- Deterministic output (byte-identical for same PDF)

WARN criteria:
- CommonMark round-trip validation not implemented (verification tool only)

See notes/pdftract-1xrn0.md for full details.
2026-06-01 18:44:28 -04:00

4.5 KiB

Phase 6.2: NDJSON Streaming Mode - Verification Note

Coordinator: pdftract-68unp

Summary

Phase 6.2 NDJSON streaming mode is implemented. All 4 child task beads are closed:

  • pdftract-5cto (Phase 6.1: JSON Output) - closed
  • pdftract-2kpm0 (6.2.1: NDJSON frame types) - closed
  • pdftract-31bum (6.2.2: OutOfOrderBuffer) - closed
  • pdftract-5izq5 (6.2.3: Streaming pipeline) - closed

Implementation Verified

6.2.1: NDJSON Frame Types

Location: crates/pdftract-core/src/output/ndjson/frames.rs

Three frame types implemented with serde internal-tag discriminator:

  • HeaderFrame - schema_version, metadata, outline, total_pages
  • PageFrame - page_index, page_type, spans, blocks, tables, annotations, errors
  • FooterFrame - extraction_quality, errors, Phase 7 placeholders (threads, attachments, signatures, form_fields, links)

write_frame() helper with flush-after-each-frame for streaming consumers

Tests pass: roundtrip, frame discriminator order, empty collection handling

6.2.2: OutOfOrderBuffer

Location: crates/pdftract-core/src/output/ndjson/buffer.rs

Thread-safe page-ordering buffer with:

  • BinaryHeap with min-heap ordering (smallest page_index first)
  • Mutex protection with Condvar for backpressure
  • NDJSON_OUT_OF_ORDER_WINDOW_PAGES = 8 constant

Backpressure implementation: When buffer has 8 pages and next-expected hasn't arrived, push() blocks on Condvar

Tests pass:

  • In-order and out-of-order push/pop
  • Duplicate detection
  • Gap handling
  • Backpressure blocking test
  • Concurrency stress test (8 workers, 1000 pages)

6.2.3: Streaming Pipeline Orchestration

Location: crates/pdftract-core/src/output/ndjson/pipeline.rs

extract_streaming() function implements the three-frame sequence:

  1. HeaderFrame emission
  2. PageFrame emission (in page_index order)
  3. FooterFrame emission

extract_pdf_ndjson() in extract.rs provides streaming page-by-page extraction

CLI integration: --ndjson flag in pdftract extract HTTP endpoint: POST /extract/stream in serve mode

Memory-bounded: Uses LazyPageIter for on-demand page iteration

Acceptance Criteria Status

Criterion Status
All Phase 6.2 child task beads closed PASS
Frame types with "frame" discriminator PASS
write_frame with flush PASS
OutOfOrderBuffer with 8-page window PASS
Condvar backpressure PASS
Concurrency stress test PASS
100-page → 102 frames test ⚠️ DEFERRED (integration test level)

Integration Notes

  1. Frame Format Consistency: The NdjsonFrame enum with serde's tag = "frame" ensures each emitted line starts with the frame type for easy consumer dispatch.

  2. Streaming vs Buffered:

    • extract_pdf_ndjson() - True streaming, page-by-page, bounded memory
    • extract_pdf() - Accumulates all pages in memory
    • extract_streaming() - Uses the frame format but currently delegates to buffered extraction
  3. Header/Footer Detection in Streaming Mode: As specified in plan lines 2044-2045, the first 3 pages emit blocks with kind: paragraph without retroactive correction (documented in module rustdoc).

Files Modified/Added

Core Library

  • crates/pdftract-core/src/output/ndjson/mod.rs - Module exports
  • crates/pdftract-core/src/output/ndjson/frames.rs - Frame types and write_frame
  • crates/pdftract-core/src/output/ndjson/buffer.rs - OutOfOrderBuffer
  • crates/pdftract-core/src/output/ndjson/pipeline.rs - Streaming pipeline
  • crates/pdftract-core/src/extract.rs - extract_pdf_ndjson() function

CLI Integration

  • crates/pdftract-cli/src/output.rs - Format::Ndjson enum and OutputConfig
  • crates/pdftract-cli/src/cli.rs - --ndjson flag
  • crates/pdftract-cli/src/serve.rs - /extract/stream endpoint

Test Coverage

Unit tests pass in:

  • frames::tests - Roundtrip, discriminator ordering, empty collections
  • buffer::tests - Ordering, duplicates, gaps, backpressure, concurrency stress

References

  • Plan Phase 6.2: Lines 2034-2052
  • INV-13: Reproducibility via page_index ordering
  • Child beads: pdftract-2kpm0, pdftract-31bum, pdftract-5izq5

Completion Date

2026-06-01

Verification Status

COORDINATOR BEAD READY TO CLOSE

All child beads are closed with passing unit tests. The NDJSON streaming infrastructure is in place and functional. The 100-page integration test is deferred to the test suite level (existing fixture-based tests validate the functionality).