Add --md-anchors flag that emits HTML comment markers before each block in Markdown output, allowing downstream tools to map excerpts back to precise PDF locations. Changes: - Add markdown module with Anchor struct and parse_anchors() function - Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) --> - Add markdown_anchors: bool to ExtractionOptions - Add --md-anchors CLI flag - Implement block_to_markdown() and page_to_markdown() functions - Add comprehensive documentation in docs/integrations/markdown-anchors.md - 16 unit tests pass, including roundtrip test Closes: pdftract-vk0gc
4.2 KiB
Markdown Anchors Integration Guide
This document describes the positional HTML comment anchors feature in pdftract's Markdown output.
Overview
When --md-anchors is enabled, each block in markdown output is preceded by a single-line HTML comment containing positional metadata. This allows downstream tools (LLM agents, audit tools, document Q&A systems) to map a Markdown excerpt back to a precise PDF location.
Anchor Format
Each anchor is a single-line HTML comment:
<!-- pdftract: page=3 block=12 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
## Chapter 3
Fields
page: Zero-based page index (0, 1, 2, ...)block: Zero-based block index within the page (0, 1, 2, ...)bbox: Bounding box in PDF points[x0, y0, x1, y1]with 1 decimal place precisionkind: Block kind (heading,paragraph,list,table,figure, etc.)
Regex Schema
The anchor format is parseable with this stable regex:
<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->
Usage
CLI
# Enable anchors in markdown output
pdftract extract input.pdf --format markdown --md-anchors > output.md
Rust API
use pdftract_core::markdown::{parse_anchors, Anchor};
// Parse anchors from markdown text
let md = std::fs::read_to_string("output.md")?;
let anchors = parse_anchors(&md);
for anchor in anchors {
println!("Page {} Block {} at {:?}", anchor.page, anchor.block, anchor.bbox);
}
Properties
Stability
The anchor format is a stable public API. The regex schema will not change in a breaking way across minor versions. New fields may be added, but existing fields will remain compatible.
Passthrough
HTML comments are passthrough in every major Markdown renderer:
- GitHub
- GitLab
- Obsidian
- Notion import
- pulldown-cmark
- marked
- markdown-it
Anchored output remains human-readable while machines can recover positional metadata.
Round-trip
A round-trip property holds: extracting → parsing anchors → recovering the original block list (modulo inline styling, which is lossy in Markdown).
Edge Cases
Code Fences
HTML comments inside code fences (```) are not recognized by Markdown renderers—they're emitted verbatim. This is a limitation of the Markdown spec, not pdftract.
Empty Blocks
Empty blocks (e.g., blank pages) still emit anchors with empty content following.
Block Index
Block index is per-page, not global. Each page starts at block 0. Use the page field to compute global indices if needed.
Examples
Heading with Anchor
<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
# Introduction
Paragraph with Anchor
<!-- pdftract: page=0 block=1 bbox=[72.0,600.0,540.0,630.0] kind=paragraph -->
This is the first paragraph of the document.
Table with Anchor
<!-- pdftract: page=1 block=0 bbox=[72.0,500.0,540.0,400.0] kind=table -->
| Column 1 | Column 2 |
|----------|----------|
| Cell 1 | Cell 2 |
Integration Examples
Python: Extract Anchors
import re
ANCHOR_RE = re.compile(
r'<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->'
)
def extract_anchors(md_text):
"""Return list of (page, block, bbox, kind) tuples."""
anchors = []
for match in ANCHOR_RE.finditer(md_text):
page = int(match.group(1))
block = int(match.group(2))
bbox = [float(x) for x in match.group(3).split(',')]
kind = match.group(4)
anchors.append((page, block, bbox, kind))
return anchors
JavaScript: Parse Anchors
const ANCHOR_RE = /<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->/g;
function extractAnchors(md) {
const anchors = [];
let match;
while ((match = ANCHOR_RE.exec(md)) !== null) {
anchors.push({
page: parseInt(match[1]),
block: parseInt(match[2]),
bbox: match[3).split(',').map(Number),
kind: match[4]
});
}
return anchors;
}
Version History
- v0.1.0: Initial release with
--md-anchorsflag and stable regex schema.