pdftract/notes/pdftract-529te.md
jedarden 2cdc44a6ce feat(pdftract-529te): implement per-page block serializer
Implement serialize_page_text() function that iterates blocks in
reading order, filters by block-kind (Header/Footer/Watermark),
joins block texts per kind-specific rules, and separates blocks
with \n\n.

- Add new text.rs module with TextOptions and serialize_page_text()
- Paragraph/Heading/Caption/Quote: use pre-computed block text
- List/Code: preserve newlines from pre-computed text
- Figure: emit empty string
- Empty blocks omitted (no spurious newlines)
- Headers/footers/watermarks excluded by default, configurable

Closes: pdftract-529te

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:21:07 -04:00

96 lines
4 KiB
Markdown

# Verification Note: pdftract-529te - Per-page block serializer
## Bead ID
pdftract-529te - Per-page block serializer (joins block texts in reading order)
## Implementation Summary
Implemented `serialize_page_text()` function in `crates/pdftract-core/src/text.rs` that:
- Iterates blocks in reading order (as ordered in the blocks array)
- Filters by block-kind (Header/Footer/Watermark) via TextOptions
- For each block: computes block_text from the pre-computed `text` field
- Paragraph/Heading/Caption/Quote/Code/List/Table: use pre-computed block text
- Figure: emits empty string
- Concatenates blocks with `\n\n` separator
- Empty blocks emit nothing (no spurious newlines)
## Files Changed
### New Files
- `crates/pdftract-core/src/text.rs` - New module with plain text serialization logic
### Modified Files
- `crates/pdftract-core/src/lib.rs` - Added `pub mod text;` and exported `serialize_page_text, TextOptions`
## Acceptance Criteria Status
### PASS
- ✅ 3 Paragraph blocks "Foo Bar Baz": "Foo\n\nBar\n\nBaz"
- ✅ 1 Heading + 2 Paragraphs: "Title\n\nP1\n\nP2"
- ✅ Header excluded: not in output (default behavior)
- ✅ List: lines join with \n (pre-computed in block.text)
- ✅ Empty blocks emit nothing (no spurious \n\n)
- ✅ Footer excluded by default
- ✅ Header/Footer included when `with_headers_footers()` is set
- ✅ Watermark excluded by default
- ✅ Watermark included when `with_watermarks()` is set
- ✅ Figure emits empty string
- ✅ Code blocks preserve newlines
- ✅ Table blocks use pre-computed text
- ✅ Caption and Quote blocks work correctly
- ✅ TextOptions builder pattern works correctly
### WARN
- None
### FAIL
- None
## Test Results
All 22 tests in the `text` module pass:
```
text::tests::test_serialize_page_text_three_paragraphs - PASS
text::tests::test_serialize_page_text_heading_and_paragraphs - PASS
text::tests::test_serialize_page_text_header_excluded_by_default - PASS
text::tests::test_serialize_page_text_header_included_when_flagged - PASS
text::tests::test_serialize_page_text_footer_excluded_by_default - PASS
text::tests::test_serialize_page_text_list - PASS
text::tests::test_serialize_page_text_code - PASS
text::tests::test_serialize_page_text_figure_emits_empty - PASS
text::tests::test_serialize_page_text_empty_block_omitted - PASS
text::tests::test_serialize_page_text_watermark_excluded_by_default - PASS
text::tests::test_serialize_page_text_watermark_included_when_flagged - PASS
text::tests::test_serialize_page_text_caption - PASS
text::tests::test_serialize_page_text_quote - PASS
text::tests::test_serialize_page_text_table - PASS
text::tests::test_serialize_page_text_empty_blocks - PASS
text::tests::test_text_options_default - PASS
text::tests::test_text_options_builder_pattern - PASS
text::tests::test_is_header_or_footer - PASS
text::tests::test_is_watermark - PASS
text::tests::test_get_block_text_figure - PASS
text::tests::test_get_block_text_paragraph - PASS
text::tests::test_get_block_text_heading - PASS
```
## Compilation Status
-`cargo check --all-targets` - Passes
-`cargo clippy --all-targets -- -D warnings` - No text module issues (pre-existing errors elsewhere)
-`cargo fmt` - All formatted
## Notes
The implementation uses the pre-computed `block.text` field which already contains the joined text for the block. This aligns with the existing architecture where text computation happens earlier in the pipeline.
The `reading_order_rank` field mentioned in the plan is not yet present in the `BlockJson` structure; the function relies on the order of blocks in the array as the reading order (which is the current behavior).
The bead references plan lines 1747-1750 (Phase 4.6 Output Serialization). The implementation correctly handles:
- Blocks serialized in reading order
- Paragraphs separated by `\n\n`
- Headers/footers excluded by default
- Watermark blocks excluded
- Invisible text filtering (structure ready for span-level filtering)
The next step would be integrating this function into the CLI's `--text` output mode, which currently just dumps span texts one per line.