pdftract/notes/pdftract-vk0gc.md
jedarden 28c31ba0a1 feat(pdftract-vk0gc): implement markdown anchors with parser regex
Add --md-anchors flag that emits HTML comment markers before each block
in Markdown output, allowing downstream tools to map excerpts back to
precise PDF locations.

Changes:
- Add markdown module with Anchor struct and parse_anchors() function
- Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) -->
- Add markdown_anchors: bool to ExtractionOptions
- Add --md-anchors CLI flag
- Implement block_to_markdown() and page_to_markdown() functions
- Add comprehensive documentation in docs/integrations/markdown-anchors.md
- 16 unit tests pass, including roundtrip test

Closes: pdftract-vk0gc
2026-05-24 02:49:16 -04:00

4.2 KiB

Verification Note: pdftract-vk0gc (Markdown Anchors)

Summary

Implemented --md-anchors positional HTML comment markers for Markdown output with parser regex.

Changes Made

1. Core Implementation (crates/pdftract-core/src/markdown.rs)

Created new markdown module with:

  • Anchor struct with page, block, bbox, kind fields
  • parse_anchors() function with regex: r"<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->"
  • block_to_markdown() - converts single block to markdown with optional anchor
  • page_to_markdown() - converts all blocks from a page with optional anchors and page breaks
  • Anchor::to_comment() - formats anchor as HTML comment with 1 decimal place precision

2. Options (crates/pdftract-core/src/options.rs)

Added markdown_anchors: bool field to ExtractionOptions with default false.

3. CLI Integration (crates/pdftract-cli/src/main.rs)

  • Added --md-anchors flag to Extract command
  • Passed flag through to ExtractionOptions
  • Updated markdown output to use page_to_markdown() when anchors enabled
  • Added import for page_to_markdown and block_to_markdown

4. Documentation (docs/integrations/markdown-anchors.md)

Created comprehensive integration guide covering:

  • Anchor format specification
  • Regex schema
  • CLI and Rust API usage
  • Edge cases (code fences, empty blocks, per-page indexing)
  • Integration examples for Python and JavaScript

Acceptance Criteria

PASS

  • --md-anchors flag emits comment before every block
  • Parser regex extracts page, block, bbox, kind from sample output
  • Round-trip test: test_roundtrip_extract_and_parse passes
  • Comment is ONE LINE (no embedded newline)
  • bbox precision: 1 decimal place exact (verified in test_anchor_to_comment_round_bbox)
  • kind matches block kind (heading, paragraph, etc.)
  • Parser library parse_anchors() available
  • Module exports: Anchor, parse_anchors, block_to_markdown, page_to_markdown
  • 16 unit tests pass (including roundtrip, bbox parsing, multiple anchors)
  • Regex is stable public API (documented in markdown-anchors.md)
  • HTML comments are passthrough in major renderers (documented)
  • Block index is per-page (0-based within page)

WARN (Infrastructure limitations)

  • None

Testing

Unit Tests (16/16 pass)

  • test_anchor_to_comment - basic comment formatting
  • test_anchor_to_comment_round_bbox - 1 decimal place precision
  • test_parse_anchors_single - parse single anchor
  • test_parse_anchors_multiple - parse multiple anchors
  • test_parse_anchors_invalid_format_skipped - invalid formats skipped
  • test_parse_anchors_whitespace_tolerant - whitespace tolerance
  • test_parse_bbox - bbox parsing with various formats
  • test_block_to_markdown_heading_with_anchor - heading with anchor
  • test_block_to_markdown_paragraph_without_anchor - paragraph without anchor
  • test_block_to_markdown_list - list block
  • test_block_to_markdown_table - table block
  • test_block_to_markdown_figure - figure block
  • test_page_to_markdown_with_page_break - page break separator
  • test_page_to_markdown_without_page_break - no page break
  • test_page_to_markdown_with_anchors - anchors enabled
  • test_roundtrip_extract_and_parse - full roundtrip

Build Verification

  • cargo build -p pdftract-core - Success
  • cargo build -p pdftract-cli - Success
  • cargo test -p pdftract-core --lib markdown - 16/16 tests pass

Example Output

With --md-anchors enabled:

<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
# Chapter 1

<!-- pdftract: page=0 block=1 bbox=[72.0,600.0,540.0,630.0] kind=paragraph -->
This is the first paragraph.

Files Modified

  • crates/pdftract-core/src/markdown.rs (new)
  • crates/pdftract-core/src/lib.rs (module export)
  • crates/pdftract-core/src/options.rs (markdown_anchors field)
  • crates/pdftract-core/Cargo.toml (regex dependency already present)
  • crates/pdftract-cli/src/main.rs (CLI flag and output logic)
  • docs/integrations/markdown-anchors.md (new documentation)

References

  • Plan section: Phase 6.5 positional anchors (lines 2183-2197)
  • Bead: pdftract-vk0gc