pdftract/crates
jedarden 2cdc44a6ce feat(pdftract-529te): implement per-page block serializer
Implement serialize_page_text() function that iterates blocks in
reading order, filters by block-kind (Header/Footer/Watermark),
joins block texts per kind-specific rules, and separates blocks
with \n\n.

- Add new text.rs module with TextOptions and serialize_page_text()
- Paragraph/Heading/Caption/Quote: use pre-computed block text
- List/Code: preserve newlines from pre-computed text
- Figure: emit empty string
- Empty blocks omitted (no spurious newlines)
- Headers/footers/watermarks excluded by default, configurable

Closes: pdftract-529te

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:21:07 -04:00
..
pdftract-cer-diff docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files 2026-05-23 10:36:28 -04:00
pdftract-cli feat(pdftract-5bzpg): implement pdftract-grep-1000 CI benchmark skeleton 2026-05-25 08:53:23 -04:00
pdftract-core feat(pdftract-529te): implement per-page block serializer 2026-05-25 12:21:07 -04:00
pdftract-libpdftract feat(pdftract-3s2i): implement Phase 5.5.2 validation filter 2026-05-24 04:57:17 -04:00
pdftract-py feat(pdftract-3j2u): implement 50 MB size limit + base64 encoding for attachments 2026-05-25 11:42:28 -04:00