jedarden 28c31ba0a1 feat(pdftract-vk0gc): implement markdown anchors with parser regex

Add --md-anchors flag that emits HTML comment markers before each block
in Markdown output, allowing downstream tools to map excerpts back to
precise PDF locations.

Changes:
- Add markdown module with Anchor struct and parse_anchors() function
- Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) -->
- Add markdown_anchors: bool to ExtractionOptions
- Add --md-anchors CLI flag
- Implement block_to_markdown() and page_to_markdown() functions
- Add comprehensive documentation in docs/integrations/markdown-anchors.md
- 16 unit tests pass, including roundtrip test

Closes: pdftract-vk0gc

2026-05-24 02:49:16 -04:00

4.2 KiB

Raw Blame History

Verification Note: pdftract-vk0gc (Markdown Anchors)

Summary

Implemented --md-anchors positional HTML comment markers for Markdown output with parser regex.

Changes Made

1. Core Implementation (crates/pdftract-core/src/markdown.rs)

Created new markdown module with:

Anchor struct with page, block, bbox, kind fields
parse_anchors() function with regex: r""
block_to_markdown() - converts single block to markdown with optional anchor
page_to_markdown() - converts all blocks from a page with optional anchors and page breaks
Anchor::to_comment() - formats anchor as HTML comment with 1 decimal place precision

2. Options (crates/pdftract-core/src/options.rs)

Added markdown_anchors: bool field to ExtractionOptions with default false.

3. CLI Integration (crates/pdftract-cli/src/main.rs)

Added --md-anchors flag to Extract command
Passed flag through to ExtractionOptions
Updated markdown output to use page_to_markdown() when anchors enabled
Added import for page_to_markdown and block_to_markdown

4. Documentation (docs/integrations/markdown-anchors.md)

Created comprehensive integration guide covering:

Anchor format specification
Regex schema
CLI and Rust API usage
Edge cases (code fences, empty blocks, per-page indexing)
Integration examples for Python and JavaScript

Acceptance Criteria

PASS

✅ --md-anchors flag emits comment before every block
✅ Parser regex extracts page, block, bbox, kind from sample output
✅ Round-trip test: test_roundtrip_extract_and_parse passes
✅ Comment is ONE LINE (no embedded newline)
✅ bbox precision: 1 decimal place exact (verified in test_anchor_to_comment_round_bbox)
✅ kind matches block kind (heading, paragraph, etc.)
✅ Parser library parse_anchors() available
✅ Module exports: Anchor, parse_anchors, block_to_markdown, page_to_markdown
✅ 16 unit tests pass (including roundtrip, bbox parsing, multiple anchors)
✅ Regex is stable public API (documented in markdown-anchors.md)
✅ HTML comments are passthrough in major renderers (documented)
✅ Block index is per-page (0-based within page)

WARN (Infrastructure limitations)

None

Testing

Unit Tests (16/16 pass)

test_anchor_to_comment - basic comment formatting
test_anchor_to_comment_round_bbox - 1 decimal place precision
test_parse_anchors_single - parse single anchor
test_parse_anchors_multiple - parse multiple anchors
test_parse_anchors_invalid_format_skipped - invalid formats skipped
test_parse_anchors_whitespace_tolerant - whitespace tolerance
test_parse_bbox - bbox parsing with various formats
test_block_to_markdown_heading_with_anchor - heading with anchor
test_block_to_markdown_paragraph_without_anchor - paragraph without anchor
test_block_to_markdown_list - list block
test_block_to_markdown_table - table block
test_block_to_markdown_figure - figure block
test_page_to_markdown_with_page_break - page break separator
test_page_to_markdown_without_page_break - no page break
test_page_to_markdown_with_anchors - anchors enabled
test_roundtrip_extract_and_parse - full roundtrip

Build Verification

cargo build -p pdftract-core - ✅ Success
cargo build -p pdftract-cli - ✅ Success
cargo test -p pdftract-core --lib markdown - ✅ 16/16 tests pass

Example Output

With --md-anchors enabled:

<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
# Chapter 1

<!-- pdftract: page=0 block=1 bbox=[72.0,600.0,540.0,630.0] kind=paragraph -->
This is the first paragraph.

Files Modified

crates/pdftract-core/src/markdown.rs (new)
crates/pdftract-core/src/lib.rs (module export)
crates/pdftract-core/src/options.rs (markdown_anchors field)
crates/pdftract-core/Cargo.toml (regex dependency already present)
crates/pdftract-cli/src/main.rs (CLI flag and output logic)
docs/integrations/markdown-anchors.md (new documentation)

References

Plan section: Phase 6.5 positional anchors (lines 2183-2197)
Bead: pdftract-vk0gc

4.2 KiB Raw Blame History