Add --md-anchors flag that emits HTML comment markers before each block in Markdown output, allowing downstream tools to map excerpts back to precise PDF locations. Changes: - Add markdown module with Anchor struct and parse_anchors() function - Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) --> - Add markdown_anchors: bool to ExtractionOptions - Add --md-anchors CLI flag - Implement block_to_markdown() and page_to_markdown() functions - Add comprehensive documentation in docs/integrations/markdown-anchors.md - 16 unit tests pass, including roundtrip test Closes: pdftract-vk0gc
4.2 KiB
4.2 KiB
Verification Note: pdftract-vk0gc (Markdown Anchors)
Summary
Implemented --md-anchors positional HTML comment markers for Markdown output with parser regex.
Changes Made
1. Core Implementation (crates/pdftract-core/src/markdown.rs)
Created new markdown module with:
Anchorstruct withpage,block,bbox,kindfieldsparse_anchors()function with regex:r"<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->"block_to_markdown()- converts single block to markdown with optional anchorpage_to_markdown()- converts all blocks from a page with optional anchors and page breaksAnchor::to_comment()- formats anchor as HTML comment with 1 decimal place precision
2. Options (crates/pdftract-core/src/options.rs)
Added markdown_anchors: bool field to ExtractionOptions with default false.
3. CLI Integration (crates/pdftract-cli/src/main.rs)
- Added
--md-anchorsflag to Extract command - Passed flag through to ExtractionOptions
- Updated markdown output to use
page_to_markdown()when anchors enabled - Added import for
page_to_markdownandblock_to_markdown
4. Documentation (docs/integrations/markdown-anchors.md)
Created comprehensive integration guide covering:
- Anchor format specification
- Regex schema
- CLI and Rust API usage
- Edge cases (code fences, empty blocks, per-page indexing)
- Integration examples for Python and JavaScript
Acceptance Criteria
PASS
- ✅
--md-anchorsflag emits comment before every block - ✅ Parser regex extracts page, block, bbox, kind from sample output
- ✅ Round-trip test:
test_roundtrip_extract_and_parsepasses - ✅ Comment is ONE LINE (no embedded newline)
- ✅ bbox precision: 1 decimal place exact (verified in
test_anchor_to_comment_round_bbox) - ✅ kind matches block kind (heading, paragraph, etc.)
- ✅ Parser library
parse_anchors()available - ✅ Module exports:
Anchor,parse_anchors,block_to_markdown,page_to_markdown - ✅ 16 unit tests pass (including roundtrip, bbox parsing, multiple anchors)
- ✅ Regex is stable public API (documented in markdown-anchors.md)
- ✅ HTML comments are passthrough in major renderers (documented)
- ✅ Block index is per-page (0-based within page)
WARN (Infrastructure limitations)
- None
Testing
Unit Tests (16/16 pass)
test_anchor_to_comment- basic comment formattingtest_anchor_to_comment_round_bbox- 1 decimal place precisiontest_parse_anchors_single- parse single anchortest_parse_anchors_multiple- parse multiple anchorstest_parse_anchors_invalid_format_skipped- invalid formats skippedtest_parse_anchors_whitespace_tolerant- whitespace tolerancetest_parse_bbox- bbox parsing with various formatstest_block_to_markdown_heading_with_anchor- heading with anchortest_block_to_markdown_paragraph_without_anchor- paragraph without anchortest_block_to_markdown_list- list blocktest_block_to_markdown_table- table blocktest_block_to_markdown_figure- figure blocktest_page_to_markdown_with_page_break- page break separatortest_page_to_markdown_without_page_break- no page breaktest_page_to_markdown_with_anchors- anchors enabledtest_roundtrip_extract_and_parse- full roundtrip
Build Verification
cargo build -p pdftract-core- ✅ Successcargo build -p pdftract-cli- ✅ Successcargo test -p pdftract-core --lib markdown- ✅ 16/16 tests pass
Example Output
With --md-anchors enabled:
<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
# Chapter 1
<!-- pdftract: page=0 block=1 bbox=[72.0,600.0,540.0,630.0] kind=paragraph -->
This is the first paragraph.
Files Modified
crates/pdftract-core/src/markdown.rs(new)crates/pdftract-core/src/lib.rs(module export)crates/pdftract-core/src/options.rs(markdown_anchors field)crates/pdftract-core/Cargo.toml(regex dependency already present)crates/pdftract-cli/src/main.rs(CLI flag and output logic)docs/integrations/markdown-anchors.md(new documentation)
References
- Plan section: Phase 6.5 positional anchors (lines 2183-2197)
- Bead: pdftract-vk0gc