jedarden 28c31ba0a1 feat(pdftract-vk0gc): implement markdown anchors with parser regex

Add --md-anchors flag that emits HTML comment markers before each block
in Markdown output, allowing downstream tools to map excerpts back to
precise PDF locations.

Changes:
- Add markdown module with Anchor struct and parse_anchors() function
- Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) -->
- Add markdown_anchors: bool to ExtractionOptions
- Add --md-anchors CLI flag
- Implement block_to_markdown() and page_to_markdown() functions
- Add comprehensive documentation in docs/integrations/markdown-anchors.md
- 16 unit tests pass, including roundtrip test

Closes: pdftract-vk0gc

2026-05-24 02:49:16 -04:00

4.2 KiB

Raw Blame History

Markdown Anchors Integration Guide

This document describes the positional HTML comment anchors feature in pdftract's Markdown output.

Overview

When --md-anchors is enabled, each block in markdown output is preceded by a single-line HTML comment containing positional metadata. This allows downstream tools (LLM agents, audit tools, document Q&A systems) to map a Markdown excerpt back to a precise PDF location.

Anchor Format

Each anchor is a single-line HTML comment:

<!-- pdftract: page=3 block=12 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
## Chapter 3

Fields

page: Zero-based page index (0, 1, 2, ...)
block: Zero-based block index within the page (0, 1, 2, ...)
bbox: Bounding box in PDF points [x0, y0, x1, y1] with 1 decimal place precision
kind: Block kind (heading, paragraph, list, table, figure, etc.)

Regex Schema

The anchor format is parseable with this stable regex:

<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->

Usage

CLI

# Enable anchors in markdown output
pdftract extract input.pdf --format markdown --md-anchors > output.md

Rust API

use pdftract_core::markdown::{parse_anchors, Anchor};

// Parse anchors from markdown text
let md = std::fs::read_to_string("output.md")?;
let anchors = parse_anchors(&md);

for anchor in anchors {
    println!("Page {} Block {} at {:?}", anchor.page, anchor.block, anchor.bbox);
}

Properties

Stability

The anchor format is a stable public API. The regex schema will not change in a breaking way across minor versions. New fields may be added, but existing fields will remain compatible.

Passthrough

HTML comments are passthrough in every major Markdown renderer:

GitHub
GitLab
Obsidian
Notion import
pulldown-cmark
marked
markdown-it

Anchored output remains human-readable while machines can recover positional metadata.

Round-trip

A round-trip property holds: extracting → parsing anchors → recovering the original block list (modulo inline styling, which is lossy in Markdown).

Edge Cases

Code Fences

HTML comments inside code fences (```) are not recognized by Markdown renderers—they're emitted verbatim. This is a limitation of the Markdown spec, not pdftract.

Empty Blocks

Empty blocks (e.g., blank pages) still emit anchors with empty content following.

Block Index

Block index is per-page, not global. Each page starts at block 0. Use the page field to compute global indices if needed.

Examples

Heading with Anchor

<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->
# Introduction

Paragraph with Anchor

<!-- pdftract: page=0 block=1 bbox=[72.0,600.0,540.0,630.0] kind=paragraph -->
This is the first paragraph of the document.

Table with Anchor

<!-- pdftract: page=1 block=0 bbox=[72.0,500.0,540.0,400.0] kind=table -->
| Column 1 | Column 2 |
|----------|----------|
| Cell 1   | Cell 2   |

Integration Examples

Python: Extract Anchors

import re

ANCHOR_RE = re.compile(
    r'<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->'
)

def extract_anchors(md_text):
    """Return list of (page, block, bbox, kind) tuples."""
    anchors = []
    for match in ANCHOR_RE.finditer(md_text):
        page = int(match.group(1))
        block = int(match.group(2))
        bbox = [float(x) for x in match.group(3).split(',')]
        kind = match.group(4)
        anchors.append((page, block, bbox, kind))
    return anchors

JavaScript: Parse Anchors

const ANCHOR_RE = /<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->/g;

function extractAnchors(md) {
    const anchors = [];
    let match;
    while ((match = ANCHOR_RE.exec(md)) !== null) {
        anchors.push({
            page: parseInt(match[1]),
            block: parseInt(match[2]),
            bbox: match[3).split(',').map(Number),
            kind: match[4]
        });
    }
    return anchors;
}

Version History

v0.1.0: Initial release with --md-anchors flag and stable regex schema.

4.2 KiB Raw Blame History