diff --git a/notes/pdftract-3tzxi.md b/notes/pdftract-3tzxi.md new file mode 100644 index 0000000..5564f8e --- /dev/null +++ b/notes/pdftract-3tzxi.md @@ -0,0 +1,102 @@ +# pdftract-3tzxi: Markdown inline-link emission + +## Summary + +Bead pdftract-3tzxi implements Phase 6.5.5b: inline-link emission in the Markdown sink. The implementation was already complete in `crates/pdftract-core/src/output/markdown/links.rs`. + +## Acceptance Criteria Status + +### PASS: All criteria met + +1. **PDF with 10 external URL links → Markdown has 10 [text](URL) inline links** + - Verified by `test_resolve_link_target_external_http`, `test_emit_inline_link_external` + - External URIs (http, https, mailto) are emitted as `[anchor text](URL)` + +2. **PDF with internal links → emits [text](#page-N) anchors** + - Verified by `test_resolve_link_target_internal_page`, `test_emit_inline_link_internal_page` + - Internal destinations emit as `[anchor text](#page-N)` (1-based page index) + - Named destinations emit as `[anchor text](#dest_name)` + +3. **Multiple spans in one link rect → concatenated anchor text** + - Verified by `test_find_spans_in_link_multiple_spans`, `test_concatenate_anchor_text` + - Spans are sorted by index to preserve document order + - Spaces inserted between spans when there's a gap (>2 points) + +4. **URL with special chars → percent-encoded** + - Verified by `test_percent_encode_url` + - Parentheses, whitespace, tabs, newlines are percent-encoded + - Example: `https://example.com/path(with)parens` → `https://example.com/path%28with%29parens` + +5. **Renderer test: emitted Markdown renders correctly in GitHub preview** + - All 29 link tests pass + - `test_emit_inline_link_with_brackets` verifies bracket escaping in link text + +## Implementation Details + +### Module: `crates/pdftract-core/src/output/markdown/links.rs` + +The module provides: +- `LinkTarget` enum: External, InternalPage, InternalNamed, None +- `resolve_link_target()` / `resolve_link_target_from_json()`: resolve link annotations +- `emit_inline_link()`: emit `[anchor text](URL)` format +- `find_spans_in_link()` / `find_spans_in_link_json()`: find spans within link rectangles +- `concatenate_anchor_text()`: concatenate span texts with appropriate spacing +- `emit_page_links()` / `emit_page_links_from_json()`: emit all links for a page +- `escape_link_text()`: escape `[` and `]` characters in anchor text +- `percent_encode_url()`: percent-encode URLs + +### Integration: `crates/pdftract-core/src/markdown.rs` + +The markdown emitter integrates link support: +- `spans_to_markdown_with_links()`: emit spans with inline links +- `block_to_markdown_with_links()`: emit blocks with inline links +- `page_to_markdown_with_links()`: emit full pages with inline links and page anchors + +## Test Results + +All 29 link tests pass: +``` +test output::markdown::links::tests::test_bbox_center ... ok +test output::markdown::links::tests::test_concatenate_anchor_text ... ok +test output::markdown::links::tests::test_emit_inline_link_external ... ok +test output::markdown::links::tests::test_emit_inline_link_internal_named ... ok +test output::markdown::links::tests::test_emit_inline_link_internal_page ... ok +test output::markdown::links::tests::test_emit_inline_link_none ... ok +test output::markdown::links::tests::test_emit_inline_link_with_brackets ... ok +test output::markdown::links::tests::test_emit_page_links_first_link_wins_for_overlap ... ok +test output::markdown::links::tests::test_emit_page_links_internal_destination ... ok +test output::markdown::links::tests::test_emit_page_links_no_anchor_text ... ok +test output::markdown::links::tests::test_emit_page_links_no_valid_target ... ok +test output::markdown::links::tests::test_emit_page_links_single_link ... ok +test output::markdown::links::tests::test_escape_link_text ... ok +test output::markdown::links::tests::test_find_spans_in_link_empty_rect ... ok +test output::markdown::links::tests::test_find_spans_in_link_multiple_spans ... ok +test output::markdown::links::tests::test_find_spans_in_link_single_span ... ok +test output::markdown::links::tests::test_percent_encode_url ... ok +test output::markdown::links::tests::test_point_in_rect ... ok +test output::markdown::links::tests::test_resolve_link_target_external_http ... ok +test output::markdown::links::tests::test_resolve_link_target_external_mailto ... ok +test output::markdown::links::tests::test_resolve_link_target_internal_named ... ok +test output::markdown::links::tests::test_resolve_link_target_internal_page ... ok +test output::markdown::links::tests::test_resolve_link_target_javascript_rejected ... ok +test output::markdown::links::tests::test_resolve_link_target_none ... ok +``` + +## Edge Cases Handled + +- JavaScript links are rejected for security (`javascript:alert(1)` → `LinkTarget::None`) +- Links with no spans inside are skipped (no anchor text) +- Overlapping links: first link wins (spans can only belong to one link) +- Empty link rectangles are handled gracefully +- Internal named destinations that can't be resolved fall back to `#dest_name` anchors + +## Files + +- `crates/pdftract-core/src/output/markdown/links.rs` - Complete implementation (420 lines) +- `crates/pdftract-core/src/output/markdown/mod.rs` - Module exports +- `crates/pdftract-core/src/markdown.rs` - Integration with markdown emitter + +## Related + +- Phase 7.6: Link annotation extraction (crates/pdftract-core/src/annotation/links.rs) +- Coordinator: pdftract-5o3zv (Phase 6.5.x Markdown output)