Add --md-anchors flag that emits HTML comment markers before each block in Markdown output, allowing downstream tools to map excerpts back to precise PDF locations. Changes: - Add markdown module with Anchor struct and parse_anchors() function - Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) --> - Add markdown_anchors: bool to ExtractionOptions - Add --md-anchors CLI flag - Implement block_to_markdown() and page_to_markdown() functions - Add comprehensive documentation in docs/integrations/markdown-anchors.md - 16 unit tests pass, including roundtrip test Closes: pdftract-vk0gc
110 lines
4.2 KiB
Markdown
110 lines
4.2 KiB
Markdown
# Verification Note: pdftract-vk0gc (Markdown Anchors)
|
|
|
|
## Summary
|
|
|
|
Implemented `--md-anchors` positional HTML comment markers for Markdown output with parser regex.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Core Implementation (crates/pdftract-core/src/markdown.rs)
|
|
|
|
Created new markdown module with:
|
|
- `Anchor` struct with `page`, `block`, `bbox`, `kind` fields
|
|
- `parse_anchors()` function with regex: `r"<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->"`
|
|
- `block_to_markdown()` - converts single block to markdown with optional anchor
|
|
- `page_to_markdown()` - converts all blocks from a page with optional anchors and page breaks
|
|
- `Anchor::to_comment()` - formats anchor as HTML comment with 1 decimal place precision
|
|
|
|
### 2. Options (crates/pdftract-core/src/options.rs)
|
|
|
|
Added `markdown_anchors: bool` field to `ExtractionOptions` with default `false`.
|
|
|
|
### 3. CLI Integration (crates/pdftract-cli/src/main.rs)
|
|
|
|
- Added `--md-anchors` flag to Extract command
|
|
- Passed flag through to ExtractionOptions
|
|
- Updated markdown output to use `page_to_markdown()` when anchors enabled
|
|
- Added import for `page_to_markdown` and `block_to_markdown`
|
|
|
|
### 4. Documentation (docs/integrations/markdown-anchors.md)
|
|
|
|
Created comprehensive integration guide covering:
|
|
- Anchor format specification
|
|
- Regex schema
|
|
- CLI and Rust API usage
|
|
- Edge cases (code fences, empty blocks, per-page indexing)
|
|
- Integration examples for Python and JavaScript
|
|
|
|
## Acceptance Criteria
|
|
|
|
### PASS
|
|
|
|
- ✅ `--md-anchors` flag emits comment before every block
|
|
- ✅ Parser regex extracts page, block, bbox, kind from sample output
|
|
- ✅ Round-trip test: `test_roundtrip_extract_and_parse` passes
|
|
- ✅ Comment is ONE LINE (no embedded newline)
|
|
- ✅ bbox precision: 1 decimal place exact (verified in `test_anchor_to_comment_round_bbox`)
|
|
- ✅ kind matches block kind (heading, paragraph, etc.)
|
|
- ✅ Parser library `parse_anchors()` available
|
|
- ✅ Module exports: `Anchor`, `parse_anchors`, `block_to_markdown`, `page_to_markdown`
|
|
- ✅ 16 unit tests pass (including roundtrip, bbox parsing, multiple anchors)
|
|
- ✅ Regex is stable public API (documented in markdown-anchors.md)
|
|
- ✅ HTML comments are passthrough in major renderers (documented)
|
|
- ✅ Block index is per-page (0-based within page)
|
|
|
|
### WARN (Infrastructure limitations)
|
|
|
|
- None
|
|
|
|
## Testing
|
|
|
|
### Unit Tests (16/16 pass)
|
|
|
|
- `test_anchor_to_comment` - basic comment formatting
|
|
- `test_anchor_to_comment_round_bbox` - 1 decimal place precision
|
|
- `test_parse_anchors_single` - parse single anchor
|
|
- `test_parse_anchors_multiple` - parse multiple anchors
|
|
- `test_parse_anchors_invalid_format_skipped` - invalid formats skipped
|
|
- `test_parse_anchors_whitespace_tolerant` - whitespace tolerance
|
|
- `test_parse_bbox` - bbox parsing with various formats
|
|
- `test_block_to_markdown_heading_with_anchor` - heading with anchor
|
|
- `test_block_to_markdown_paragraph_without_anchor` - paragraph without anchor
|
|
- `test_block_to_markdown_list` - list block
|
|
- `test_block_to_markdown_table` - table block
|
|
- `test_block_to_markdown_figure` - figure block
|
|
- `test_page_to_markdown_with_page_break` - page break separator
|
|
- `test_page_to_markdown_without_page_break` - no page break
|
|
- `test_page_to_markdown_with_anchors` - anchors enabled
|
|
- `test_roundtrip_extract_and_parse` - full roundtrip
|
|
|
|
### Build Verification
|
|
|
|
- `cargo build -p pdftract-core` - ✅ Success
|
|
- `cargo build -p pdftract-cli` - ✅ Success
|
|
- `cargo test -p pdftract-core --lib markdown` - ✅ 16/16 tests pass
|
|
|
|
## Example Output
|
|
|
|
With `--md-anchors` enabled:
|
|
|
|
```markdown
|
|
<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
|
|
# Chapter 1
|
|
|
|
<!-- pdftract: page=0 block=1 bbox=[72.0,600.0,540.0,630.0] kind=paragraph -->
|
|
This is the first paragraph.
|
|
```
|
|
|
|
## Files Modified
|
|
|
|
- `crates/pdftract-core/src/markdown.rs` (new)
|
|
- `crates/pdftract-core/src/lib.rs` (module export)
|
|
- `crates/pdftract-core/src/options.rs` (markdown_anchors field)
|
|
- `crates/pdftract-core/Cargo.toml` (regex dependency already present)
|
|
- `crates/pdftract-cli/src/main.rs` (CLI flag and output logic)
|
|
- `docs/integrations/markdown-anchors.md` (new documentation)
|
|
|
|
## References
|
|
|
|
- Plan section: Phase 6.5 positional anchors (lines 2183-2197)
|
|
- Bead: pdftract-vk0gc
|