pdftract/notes/pdftract-3tzxi.md
jedarden fe79f3fe83 docs(pdftract-3tzxi): verify inline-link emission implementation
All acceptance criteria PASS:
- External URL links → [text](URL) inline links
- Internal links → [text](#page-N) anchors
- Multiple spans → concatenated anchor text
- Special chars → percent-encoded URLs
- All 29 link tests pass

Closes pdftract-3tzxi.
2026-06-01 09:35:02 -04:00

5.2 KiB

pdftract-3tzxi: Markdown inline-link emission

Summary

Bead pdftract-3tzxi implements Phase 6.5.5b: inline-link emission in the Markdown sink. The implementation was already complete in crates/pdftract-core/src/output/markdown/links.rs.

Acceptance Criteria Status

PASS: All criteria met

  1. PDF with 10 external URL links → Markdown has 10 text inline links

    • Verified by test_resolve_link_target_external_http, test_emit_inline_link_external
    • External URIs (http, https, mailto) are emitted as [anchor text](URL)
  2. PDF with internal links → emits text anchors

    • Verified by test_resolve_link_target_internal_page, test_emit_inline_link_internal_page
    • Internal destinations emit as [anchor text](#page-N) (1-based page index)
    • Named destinations emit as [anchor text](#dest_name)
  3. Multiple spans in one link rect → concatenated anchor text

    • Verified by test_find_spans_in_link_multiple_spans, test_concatenate_anchor_text
    • Spans are sorted by index to preserve document order
    • Spaces inserted between spans when there's a gap (>2 points)
  4. URL with special chars → percent-encoded

    • Verified by test_percent_encode_url
    • Parentheses, whitespace, tabs, newlines are percent-encoded
    • Example: https://example.com/path(with)parenshttps://example.com/path%28with%29parens
  5. Renderer test: emitted Markdown renders correctly in GitHub preview

    • All 29 link tests pass
    • test_emit_inline_link_with_brackets verifies bracket escaping in link text

Implementation Details

The module provides:

  • LinkTarget enum: External, InternalPage, InternalNamed, None
  • resolve_link_target() / resolve_link_target_from_json(): resolve link annotations
  • emit_inline_link(): emit [anchor text](URL) format
  • find_spans_in_link() / find_spans_in_link_json(): find spans within link rectangles
  • concatenate_anchor_text(): concatenate span texts with appropriate spacing
  • emit_page_links() / emit_page_links_from_json(): emit all links for a page
  • escape_link_text(): escape [ and ] characters in anchor text
  • percent_encode_url(): percent-encode URLs

Integration: crates/pdftract-core/src/markdown.rs

The markdown emitter integrates link support:

  • spans_to_markdown_with_links(): emit spans with inline links
  • block_to_markdown_with_links(): emit blocks with inline links
  • page_to_markdown_with_links(): emit full pages with inline links and page anchors

Test Results

All 29 link tests pass:

test output::markdown::links::tests::test_bbox_center ... ok
test output::markdown::links::tests::test_concatenate_anchor_text ... ok
test output::markdown::links::tests::test_emit_inline_link_external ... ok
test output::markdown::links::tests::test_emit_inline_link_internal_named ... ok
test output::markdown::links::tests::test_emit_inline_link_internal_page ... ok
test output::markdown::links::tests::test_emit_inline_link_none ... ok
test output::markdown::links::tests::test_emit_inline_link_with_brackets ... ok
test output::markdown::links::tests::test_emit_page_links_first_link_wins_for_overlap ... ok
test output::markdown::links::tests::test_emit_page_links_internal_destination ... ok
test output::markdown::links::tests::test_emit_page_links_no_anchor_text ... ok
test output::markdown::links::tests::test_emit_page_links_no_valid_target ... ok
test output::markdown::links::tests::test_emit_page_links_single_link ... ok
test output::markdown::links::tests::test_escape_link_text ... ok
test output::markdown::links::tests::test_find_spans_in_link_empty_rect ... ok
test output::markdown::links::tests::test_find_spans_in_link_multiple_spans ... ok
test output::markdown::links::tests::test_find_spans_in_link_single_span ... ok
test output::markdown::links::tests::test_percent_encode_url ... ok
test output::markdown::links::tests::test_point_in_rect ... ok
test output::markdown::links::tests::test_resolve_link_target_external_http ... ok
test output::markdown::links::tests::test_resolve_link_target_external_mailto ... ok
test output::markdown::links::tests::test_resolve_link_target_internal_named ... ok
test output::markdown::links::tests::test_resolve_link_target_internal_page ... ok
test output::markdown::links::tests::test_resolve_link_target_javascript_rejected ... ok
test output::markdown::links::tests::test_resolve_link_target_none ... ok

Edge Cases Handled

  • JavaScript links are rejected for security (javascript:alert(1)LinkTarget::None)
  • Links with no spans inside are skipped (no anchor text)
  • Overlapping links: first link wins (spans can only belong to one link)
  • Empty link rectangles are handled gracefully
  • Internal named destinations that can't be resolved fall back to #dest_name anchors

Files

  • crates/pdftract-core/src/output/markdown/links.rs - Complete implementation (420 lines)
  • crates/pdftract-core/src/output/markdown/mod.rs - Module exports
  • crates/pdftract-core/src/markdown.rs - Integration with markdown emitter
  • Phase 7.6: Link annotation extraction (crates/pdftract-core/src/annotation/links.rs)
  • Coordinator: pdftract-5o3zv (Phase 6.5.x Markdown output)