Add --md-anchors flag that emits HTML comment markers before each block in Markdown output, allowing downstream tools to map excerpts back to precise PDF locations. Changes: - Add markdown module with Anchor struct and parse_anchors() function - Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) --> - Add markdown_anchors: bool to ExtractionOptions - Add --md-anchors CLI flag - Implement block_to_markdown() and page_to_markdown() functions - Add comprehensive documentation in docs/integrations/markdown-anchors.md - 16 unit tests pass, including roundtrip test Closes: pdftract-vk0gc
163 lines
4.2 KiB
Markdown
163 lines
4.2 KiB
Markdown
# Markdown Anchors Integration Guide
|
|
|
|
This document describes the positional HTML comment anchors feature in pdftract's Markdown output.
|
|
|
|
## Overview
|
|
|
|
When `--md-anchors` is enabled, each block in markdown output is preceded by a single-line HTML comment containing positional metadata. This allows downstream tools (LLM agents, audit tools, document Q&A systems) to map a Markdown excerpt back to a precise PDF location.
|
|
|
|
## Anchor Format
|
|
|
|
Each anchor is a single-line HTML comment:
|
|
|
|
```markdown
|
|
<!-- pdftract: page=3 block=12 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
|
|
## Chapter 3
|
|
```
|
|
|
|
### Fields
|
|
|
|
- `page`: Zero-based page index (0, 1, 2, ...)
|
|
- `block`: Zero-based block index within the page (0, 1, 2, ...)
|
|
- `bbox`: Bounding box in PDF points `[x0, y0, x1, y1]` with 1 decimal place precision
|
|
- `kind`: Block kind (`heading`, `paragraph`, `list`, `table`, `figure`, etc.)
|
|
|
|
### Regex Schema
|
|
|
|
The anchor format is parseable with this stable regex:
|
|
|
|
```regex
|
|
<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->
|
|
```
|
|
|
|
## Usage
|
|
|
|
### CLI
|
|
|
|
```bash
|
|
# Enable anchors in markdown output
|
|
pdftract extract input.pdf --format markdown --md-anchors > output.md
|
|
```
|
|
|
|
### Rust API
|
|
|
|
```rust
|
|
use pdftract_core::markdown::{parse_anchors, Anchor};
|
|
|
|
// Parse anchors from markdown text
|
|
let md = std::fs::read_to_string("output.md")?;
|
|
let anchors = parse_anchors(&md);
|
|
|
|
for anchor in anchors {
|
|
println!("Page {} Block {} at {:?}", anchor.page, anchor.block, anchor.bbox);
|
|
}
|
|
```
|
|
|
|
## Properties
|
|
|
|
### Stability
|
|
|
|
The anchor format is a **stable public API**. The regex schema will not change in a breaking way across minor versions. New fields may be added, but existing fields will remain compatible.
|
|
|
|
### Passthrough
|
|
|
|
HTML comments are passthrough in every major Markdown renderer:
|
|
- GitHub
|
|
- GitLab
|
|
- Obsidian
|
|
- Notion import
|
|
- pulldown-cmark
|
|
- marked
|
|
- markdown-it
|
|
|
|
Anchored output remains human-readable while machines can recover positional metadata.
|
|
|
|
### Round-trip
|
|
|
|
A round-trip property holds: extracting → parsing anchors → recovering the original block list (modulo inline styling, which is lossy in Markdown).
|
|
|
|
## Edge Cases
|
|
|
|
### Code Fences
|
|
|
|
HTML comments inside code fences (```) are not recognized by Markdown renderers—they're emitted verbatim. This is a limitation of the Markdown spec, not pdftract.
|
|
|
|
### Empty Blocks
|
|
|
|
Empty blocks (e.g., blank pages) still emit anchors with empty content following.
|
|
|
|
### Block Index
|
|
|
|
Block index is **per-page**, not global. Each page starts at block 0. Use the `page` field to compute global indices if needed.
|
|
|
|
## Examples
|
|
|
|
### Heading with Anchor
|
|
|
|
```markdown
|
|
<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
|
|
# Introduction
|
|
```
|
|
|
|
### Paragraph with Anchor
|
|
|
|
```markdown
|
|
<!-- pdftract: page=0 block=1 bbox=[72.0,600.0,540.0,630.0] kind=paragraph -->
|
|
This is the first paragraph of the document.
|
|
```
|
|
|
|
### Table with Anchor
|
|
|
|
```markdown
|
|
<!-- pdftract: page=1 block=0 bbox=[72.0,500.0,540.0,400.0] kind=table -->
|
|
| Column 1 | Column 2 |
|
|
|----------|----------|
|
|
| Cell 1 | Cell 2 |
|
|
```
|
|
|
|
## Integration Examples
|
|
|
|
### Python: Extract Anchors
|
|
|
|
```python
|
|
import re
|
|
|
|
ANCHOR_RE = re.compile(
|
|
r'<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->'
|
|
)
|
|
|
|
def extract_anchors(md_text):
|
|
"""Return list of (page, block, bbox, kind) tuples."""
|
|
anchors = []
|
|
for match in ANCHOR_RE.finditer(md_text):
|
|
page = int(match.group(1))
|
|
block = int(match.group(2))
|
|
bbox = [float(x) for x in match.group(3).split(',')]
|
|
kind = match.group(4)
|
|
anchors.append((page, block, bbox, kind))
|
|
return anchors
|
|
```
|
|
|
|
### JavaScript: Parse Anchors
|
|
|
|
```javascript
|
|
const ANCHOR_RE = /<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->/g;
|
|
|
|
function extractAnchors(md) {
|
|
const anchors = [];
|
|
let match;
|
|
while ((match = ANCHOR_RE.exec(md)) !== null) {
|
|
anchors.push({
|
|
page: parseInt(match[1]),
|
|
block: parseInt(match[2]),
|
|
bbox: match[3).split(',').map(Number),
|
|
kind: match[4]
|
|
});
|
|
}
|
|
return anchors;
|
|
}
|
|
```
|
|
|
|
## Version History
|
|
|
|
- **v0.1.0**: Initial release with `--md-anchors` flag and stable regex schema.
|