jedarden 2cdc44a6ce feat(pdftract-529te): implement per-page block serializer

Implement serialize_page_text() function that iterates blocks in
reading order, filters by block-kind (Header/Footer/Watermark),
joins block texts per kind-specific rules, and separates blocks
with \n\n.

- Add new text.rs module with TextOptions and serialize_page_text()
- Paragraph/Heading/Caption/Quote: use pre-computed block text
- List/Code: preserve newlines from pre-computed text
- Figure: emit empty string
- Empty blocks omitted (no spurious newlines)
- Headers/footers/watermarks excluded by default, configurable

Closes: pdftract-529te

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 12:21:07 -04:00

4 KiB

Raw Blame History

Verification Note: pdftract-529te - Per-page block serializer

Bead ID

pdftract-529te - Per-page block serializer (joins block texts in reading order)

Implementation Summary

Implemented serialize_page_text() function in crates/pdftract-core/src/text.rs that:

Iterates blocks in reading order (as ordered in the blocks array)
Filters by block-kind (Header/Footer/Watermark) via TextOptions
For each block: computes block_text from the pre-computed text field
Paragraph/Heading/Caption/Quote/Code/List/Table: use pre-computed block text
Figure: emits empty string
Concatenates blocks with \n\n separator
Empty blocks emit nothing (no spurious newlines)

Files Changed

New Files

crates/pdftract-core/src/text.rs - New module with plain text serialization logic

Modified Files

crates/pdftract-core/src/lib.rs - Added pub mod text; and exported serialize_page_text, TextOptions

Acceptance Criteria Status

PASS

✅ 3 Paragraph blocks "Foo Bar Baz": "Foo\n\nBar\n\nBaz"
✅ 1 Heading + 2 Paragraphs: "Title\n\nP1\n\nP2"
✅ Header excluded: not in output (default behavior)
✅ List: lines join with \n (pre-computed in block.text)
✅ Empty blocks emit nothing (no spurious \n\n)
✅ Footer excluded by default
✅ Header/Footer included when with_headers_footers() is set
✅ Watermark excluded by default
✅ Watermark included when with_watermarks() is set
✅ Figure emits empty string
✅ Code blocks preserve newlines
✅ Table blocks use pre-computed text
✅ Caption and Quote blocks work correctly
✅ TextOptions builder pattern works correctly

WARN

None

FAIL

None

Test Results

All 22 tests in the text module pass:

text::tests::test_serialize_page_text_three_paragraphs - PASS
text::tests::test_serialize_page_text_heading_and_paragraphs - PASS
text::tests::test_serialize_page_text_header_excluded_by_default - PASS
text::tests::test_serialize_page_text_header_included_when_flagged - PASS
text::tests::test_serialize_page_text_footer_excluded_by_default - PASS
text::tests::test_serialize_page_text_list - PASS
text::tests::test_serialize_page_text_code - PASS
text::tests::test_serialize_page_text_figure_emits_empty - PASS
text::tests::test_serialize_page_text_empty_block_omitted - PASS
text::tests::test_serialize_page_text_watermark_excluded_by_default - PASS
text::tests::test_serialize_page_text_watermark_included_when_flagged - PASS
text::tests::test_serialize_page_text_caption - PASS
text::tests::test_serialize_page_text_quote - PASS
text::tests::test_serialize_page_text_table - PASS
text::tests::test_serialize_page_text_empty_blocks - PASS
text::tests::test_text_options_default - PASS
text::tests::test_text_options_builder_pattern - PASS
text::tests::test_is_header_or_footer - PASS
text::tests::test_is_watermark - PASS
text::tests::test_get_block_text_figure - PASS
text::tests::test_get_block_text_paragraph - PASS
text::tests::test_get_block_text_heading - PASS

Compilation Status

✅ cargo check --all-targets - Passes
✅ cargo clippy --all-targets -- -D warnings - No text module issues (pre-existing errors elsewhere)
✅ cargo fmt - All formatted

Notes

The implementation uses the pre-computed block.text field which already contains the joined text for the block. This aligns with the existing architecture where text computation happens earlier in the pipeline.

The reading_order_rank field mentioned in the plan is not yet present in the BlockJson structure; the function relies on the order of blocks in the array as the reading order (which is the current behavior).

The bead references plan lines 1747-1750 (Phase 4.6 Output Serialization). The implementation correctly handles:

Blocks serialized in reading order
Paragraphs separated by \n\n
Headers/footers excluded by default
Watermark blocks excluded
Invisible text filtering (structure ready for span-level filtering)

The next step would be integrating this function into the CLI's --text output mode, which currently just dumps span texts one per line.

4 KiB Raw Blame History