pdftract/docs/integrations/markdown-anchors.md
jedarden 28c31ba0a1 feat(pdftract-vk0gc): implement markdown anchors with parser regex
Add --md-anchors flag that emits HTML comment markers before each block
in Markdown output, allowing downstream tools to map excerpts back to
precise PDF locations.

Changes:
- Add markdown module with Anchor struct and parse_anchors() function
- Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) -->
- Add markdown_anchors: bool to ExtractionOptions
- Add --md-anchors CLI flag
- Implement block_to_markdown() and page_to_markdown() functions
- Add comprehensive documentation in docs/integrations/markdown-anchors.md
- 16 unit tests pass, including roundtrip test

Closes: pdftract-vk0gc
2026-05-24 02:49:16 -04:00

163 lines
4.2 KiB
Markdown

# Markdown Anchors Integration Guide
This document describes the positional HTML comment anchors feature in pdftract's Markdown output.
## Overview
When `--md-anchors` is enabled, each block in markdown output is preceded by a single-line HTML comment containing positional metadata. This allows downstream tools (LLM agents, audit tools, document Q&A systems) to map a Markdown excerpt back to a precise PDF location.
## Anchor Format
Each anchor is a single-line HTML comment:
```markdown
<!-- pdftract: page=3 block=12 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
## Chapter 3
```
### Fields
- `page`: Zero-based page index (0, 1, 2, ...)
- `block`: Zero-based block index within the page (0, 1, 2, ...)
- `bbox`: Bounding box in PDF points `[x0, y0, x1, y1]` with 1 decimal place precision
- `kind`: Block kind (`heading`, `paragraph`, `list`, `table`, `figure`, etc.)
### Regex Schema
The anchor format is parseable with this stable regex:
```regex
<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->
```
## Usage
### CLI
```bash
# Enable anchors in markdown output
pdftract extract input.pdf --format markdown --md-anchors > output.md
```
### Rust API
```rust
use pdftract_core::markdown::{parse_anchors, Anchor};
// Parse anchors from markdown text
let md = std::fs::read_to_string("output.md")?;
let anchors = parse_anchors(&md);
for anchor in anchors {
println!("Page {} Block {} at {:?}", anchor.page, anchor.block, anchor.bbox);
}
```
## Properties
### Stability
The anchor format is a **stable public API**. The regex schema will not change in a breaking way across minor versions. New fields may be added, but existing fields will remain compatible.
### Passthrough
HTML comments are passthrough in every major Markdown renderer:
- GitHub
- GitLab
- Obsidian
- Notion import
- pulldown-cmark
- marked
- markdown-it
Anchored output remains human-readable while machines can recover positional metadata.
### Round-trip
A round-trip property holds: extracting → parsing anchors → recovering the original block list (modulo inline styling, which is lossy in Markdown).
## Edge Cases
### Code Fences
HTML comments inside code fences (```) are not recognized by Markdown renderers—they're emitted verbatim. This is a limitation of the Markdown spec, not pdftract.
### Empty Blocks
Empty blocks (e.g., blank pages) still emit anchors with empty content following.
### Block Index
Block index is **per-page**, not global. Each page starts at block 0. Use the `page` field to compute global indices if needed.
## Examples
### Heading with Anchor
```markdown
<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
# Introduction
```
### Paragraph with Anchor
```markdown
<!-- pdftract: page=0 block=1 bbox=[72.0,600.0,540.0,630.0] kind=paragraph -->
This is the first paragraph of the document.
```
### Table with Anchor
```markdown
<!-- pdftract: page=1 block=0 bbox=[72.0,500.0,540.0,400.0] kind=table -->
| Column 1 | Column 2 |
|----------|----------|
| Cell 1 | Cell 2 |
```
## Integration Examples
### Python: Extract Anchors
```python
import re
ANCHOR_RE = re.compile(
r'<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->'
)
def extract_anchors(md_text):
"""Return list of (page, block, bbox, kind) tuples."""
anchors = []
for match in ANCHOR_RE.finditer(md_text):
page = int(match.group(1))
block = int(match.group(2))
bbox = [float(x) for x in match.group(3).split(',')]
kind = match.group(4)
anchors.append((page, block, bbox, kind))
return anchors
```
### JavaScript: Parse Anchors
```javascript
const ANCHOR_RE = /<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->/g;
function extractAnchors(md) {
const anchors = [];
let match;
while ((match = ANCHOR_RE.exec(md)) !== null) {
anchors.push({
page: parseInt(match[1]),
block: parseInt(match[2]),
bbox: match[3).split(',').map(Number),
kind: match[4]
});
}
return anchors;
}
```
## Version History
- **v0.1.0**: Initial release with `--md-anchors` flag and stable regex schema.