pdftract/notes/pdftract-4453y.md
jedarden d528a69f36 docs(pdftract-4453y): add Phase 4.6 coordinator verification note
Phase 4.6 Output Serialization (Plain Text Mode) coordinator is complete.
All 5 children beads are closed and all acceptance criteria PASS.

Acceptance criteria verified:
- All children closed: pdftract-56txm, pdftract-2bpzs, pdftract-38p8h, pdftract-3bgxq, pdftract-529te
- 10-page doc: 9 form-feed characters (test_serialize_document_text_ten_pages)
- Header excluded by default; included with flag
- Invisible Tr=3 excluded by default
- Text round-trips with join

Test results:
- text.rs: 160 tests passed
- options.rs: 41 tests passed

Closes pdftract-4453y.
2026-06-07 15:55:54 -04:00

5.4 KiB

Phase 4.6: Output Serialization Plain Text (Coordinator) - Verification Note

Summary

Phase 4.6 coordinator bead pdftract-4453y is complete. All 5 child beads are closed and the implementation satisfies all acceptance criteria.

Children Closed

All 5 dependencies of this coordinator are closed:

  1. pdftract-56txm: Phase 4.5: Reading Order (coordinator) - CLOSED
  2. pdftract-2bpzs: OutputOptions struct + block-kind filter - CLOSED
  3. pdftract-38p8h: Invisible text filter - CLOSED
  4. pdftract-3bgxq: Document-level serializer (form feed) - CLOSED
  5. pdftract-529te: Per-page block serializer - CLOSED

Acceptance Criteria Status

Criterion Status Evidence
All children closed PASS All 5 children verified closed via bf show
10-page doc: 9 form-feed characters PASS test_serialize_document_text_ten_pages verifies exactly 9 form feeds
Header block excluded by default; included with flag PASS test_serialize_page_text_header_excluded_by_default and test_serialize_page_text_header_included_when_flagged
Invisible Tr=3 excluded by default PASS test_should_include_span_invisible_mode_3_excluded_by_default and related tests
Text round-trips with join PASS All serialization tests pass; proper \n\n and \f joining

Implementation Files

Core Implementation

  • crates/pdftract-core/src/text.rs - Per-page and document-level serialization
  • crates/pdftract-core/src/options.rs - OutputOptions struct with filtering

CLI Integration

  • crates/pdftract-cli/src/main.rs - CLI flags wired to options:
    • --include-headers / --include-footers / --include-headers-footers
    • --include-invisible-text
    • --include-watermarks

Key Invariants Verified

Form Feed Invariant (INV)

  • N pages → N-1 form feeds: serialize_document_text uses join("\u{000C}") which produces exactly N-1 form feeds
  • No leading form feed
  • No trailing form feed
  • Empty page in middle: form feed before AND after

Test evidence:

  • test_serialize_document_text_one_page: 0 form feeds
  • test_serialize_document_text_two_pages: 1 form feed
  • test_serialize_document_text_ten_pages: 9 form feeds
  • test_serialize_document_text_empty_page_in_middle: 2 form feeds with empty middle page

Block-Kind Filtering

  • Header/Footer excluded by default (controlled via OutputOptions)
  • Watermark excluded (no-op until Phase 7)
  • Filtering at block level via OutputOptions::include_block_kind

Test evidence:

  • test_serialize_page_text_header_excluded_by_default: header not in output
  • test_serialize_page_text_header_included_when_flagged: header included with flag
  • test_serialize_page_text_watermark_excluded_by_default: watermark not in output

Invisible Text Filtering (SPAN-level)

  • Filtering at SPAN level, not block level
  • Tr=3 excluded by default (invisible)
  • Tr=4-7 treated same as Tr=3 (invisible variants)
  • Mixed-visibility blocks: visible emitted, invisible dropped
  • All-invisible blocks: no spurious \n\n

Test evidence:

  • test_should_include_span_invisible_mode_3_excluded_by_default: Tr=3 filtered
  • test_should_include_span_invisible_mode_4_excluded_by_default through test_should_include_span_invisible_mode_7: Tr=4-7 filtered
  • test_compute_block_text_from_spans_mixed_visibility: mixed block emits only visible
  • test_serialize_page_text_all_invisible_block_omitted: all-invisible produces empty (no \n\n)

Paragraph/Block Joining

  • Blocks separated by \n\n
  • Paragraph/Heading/Caption/Quote: space-joined lines
  • List/Code: newline-joined lines
  • Figure: empty string (no text content)

Test evidence:

  • test_serialize_page_text_three_paragraphs: "Foo\n\nBar\n\nBaz"
  • test_serialize_page_text_heading_and_paragraphs: "Title\n\nP1\n\nP2"
  • test_serialize_page_text_list: lines joined with \n
  • test_serialize_page_text_figure_emits_empty: figure produces ""

Test Results

text.rs tests (160 passed)

cargo test -p pdftract-core --lib text
test result: ok. 160 passed; 0 failed; 0 ignored

Key tests verifying acceptance criteria:

  • test_serialize_document_text_ten_pages: 9 form feeds ✓
  • test_serialize_page_text_header_excluded_by_default
  • test_serialize_page_text_header_included_when_flagged
  • test_should_include_span_invisible_mode_3_excluded_by_default
  • All rendering mode tests (Tr=0 through Tr=7) ✓

options.rs tests (41 passed)

cargo test -p pdftract-core --lib options
test result: ok. 41 passed; 0 failed; 0 ignored

Key tests:

  • test_output_options_default: all exclude defaults ✓
  • test_output_options_include_block_kind: header/footer/watermark filtering ✓
  • test_output_options_include_span: invisible/hidden layer filtering ✓

References

  • Plan section: Phase 4.6 Output Serialization (lines 1760-1776)
  • Bead: pdftract-4453y
  • Children: pdftract-56txm, pdftract-2bpzs, pdftract-38p8h, pdftract-3bgxq, pdftract-529te

Conclusion

Phase 4.6 Output Serialization (Plain Text Mode) is fully implemented and verified. All acceptance criteria PASS. The implementation correctly:

  1. Joins blocks in reading order with \n\n
  2. Joins pages with form feed \f (exactly N-1 for N pages)
  3. Excludes headers/footers by default with CLI flags to include
  4. Excludes invisible text (Tr=3+) by default with CLI flag to include
  5. Filters at SPAN level for invisible text, BLOCK level for kinds