docs(pdftract-3tzxi): verify inline-link emission implementation
All acceptance criteria PASS: - External URL links → [text](URL) inline links - Internal links → [text](#page-N) anchors - Multiple spans → concatenated anchor text - Special chars → percent-encoded URLs - All 29 link tests pass Closes pdftract-3tzxi.
This commit is contained in:
parent
3f8daba449
commit
fe79f3fe83
1 changed files with 102 additions and 0 deletions
102
notes/pdftract-3tzxi.md
Normal file
102
notes/pdftract-3tzxi.md
Normal file
|
|
@ -0,0 +1,102 @@
|
|||
# pdftract-3tzxi: Markdown inline-link emission
|
||||
|
||||
## Summary
|
||||
|
||||
Bead pdftract-3tzxi implements Phase 6.5.5b: inline-link emission in the Markdown sink. The implementation was already complete in `crates/pdftract-core/src/output/markdown/links.rs`.
|
||||
|
||||
## Acceptance Criteria Status
|
||||
|
||||
### PASS: All criteria met
|
||||
|
||||
1. **PDF with 10 external URL links → Markdown has 10 [text](URL) inline links**
|
||||
- Verified by `test_resolve_link_target_external_http`, `test_emit_inline_link_external`
|
||||
- External URIs (http, https, mailto) are emitted as `[anchor text](URL)`
|
||||
|
||||
2. **PDF with internal links → emits [text](#page-N) anchors**
|
||||
- Verified by `test_resolve_link_target_internal_page`, `test_emit_inline_link_internal_page`
|
||||
- Internal destinations emit as `[anchor text](#page-N)` (1-based page index)
|
||||
- Named destinations emit as `[anchor text](#dest_name)`
|
||||
|
||||
3. **Multiple spans in one link rect → concatenated anchor text**
|
||||
- Verified by `test_find_spans_in_link_multiple_spans`, `test_concatenate_anchor_text`
|
||||
- Spans are sorted by index to preserve document order
|
||||
- Spaces inserted between spans when there's a gap (>2 points)
|
||||
|
||||
4. **URL with special chars → percent-encoded**
|
||||
- Verified by `test_percent_encode_url`
|
||||
- Parentheses, whitespace, tabs, newlines are percent-encoded
|
||||
- Example: `https://example.com/path(with)parens` → `https://example.com/path%28with%29parens`
|
||||
|
||||
5. **Renderer test: emitted Markdown renders correctly in GitHub preview**
|
||||
- All 29 link tests pass
|
||||
- `test_emit_inline_link_with_brackets` verifies bracket escaping in link text
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Module: `crates/pdftract-core/src/output/markdown/links.rs`
|
||||
|
||||
The module provides:
|
||||
- `LinkTarget` enum: External, InternalPage, InternalNamed, None
|
||||
- `resolve_link_target()` / `resolve_link_target_from_json()`: resolve link annotations
|
||||
- `emit_inline_link()`: emit `[anchor text](URL)` format
|
||||
- `find_spans_in_link()` / `find_spans_in_link_json()`: find spans within link rectangles
|
||||
- `concatenate_anchor_text()`: concatenate span texts with appropriate spacing
|
||||
- `emit_page_links()` / `emit_page_links_from_json()`: emit all links for a page
|
||||
- `escape_link_text()`: escape `[` and `]` characters in anchor text
|
||||
- `percent_encode_url()`: percent-encode URLs
|
||||
|
||||
### Integration: `crates/pdftract-core/src/markdown.rs`
|
||||
|
||||
The markdown emitter integrates link support:
|
||||
- `spans_to_markdown_with_links()`: emit spans with inline links
|
||||
- `block_to_markdown_with_links()`: emit blocks with inline links
|
||||
- `page_to_markdown_with_links()`: emit full pages with inline links and page anchors
|
||||
|
||||
## Test Results
|
||||
|
||||
All 29 link tests pass:
|
||||
```
|
||||
test output::markdown::links::tests::test_bbox_center ... ok
|
||||
test output::markdown::links::tests::test_concatenate_anchor_text ... ok
|
||||
test output::markdown::links::tests::test_emit_inline_link_external ... ok
|
||||
test output::markdown::links::tests::test_emit_inline_link_internal_named ... ok
|
||||
test output::markdown::links::tests::test_emit_inline_link_internal_page ... ok
|
||||
test output::markdown::links::tests::test_emit_inline_link_none ... ok
|
||||
test output::markdown::links::tests::test_emit_inline_link_with_brackets ... ok
|
||||
test output::markdown::links::tests::test_emit_page_links_first_link_wins_for_overlap ... ok
|
||||
test output::markdown::links::tests::test_emit_page_links_internal_destination ... ok
|
||||
test output::markdown::links::tests::test_emit_page_links_no_anchor_text ... ok
|
||||
test output::markdown::links::tests::test_emit_page_links_no_valid_target ... ok
|
||||
test output::markdown::links::tests::test_emit_page_links_single_link ... ok
|
||||
test output::markdown::links::tests::test_escape_link_text ... ok
|
||||
test output::markdown::links::tests::test_find_spans_in_link_empty_rect ... ok
|
||||
test output::markdown::links::tests::test_find_spans_in_link_multiple_spans ... ok
|
||||
test output::markdown::links::tests::test_find_spans_in_link_single_span ... ok
|
||||
test output::markdown::links::tests::test_percent_encode_url ... ok
|
||||
test output::markdown::links::tests::test_point_in_rect ... ok
|
||||
test output::markdown::links::tests::test_resolve_link_target_external_http ... ok
|
||||
test output::markdown::links::tests::test_resolve_link_target_external_mailto ... ok
|
||||
test output::markdown::links::tests::test_resolve_link_target_internal_named ... ok
|
||||
test output::markdown::links::tests::test_resolve_link_target_internal_page ... ok
|
||||
test output::markdown::links::tests::test_resolve_link_target_javascript_rejected ... ok
|
||||
test output::markdown::links::tests::test_resolve_link_target_none ... ok
|
||||
```
|
||||
|
||||
## Edge Cases Handled
|
||||
|
||||
- JavaScript links are rejected for security (`javascript:alert(1)` → `LinkTarget::None`)
|
||||
- Links with no spans inside are skipped (no anchor text)
|
||||
- Overlapping links: first link wins (spans can only belong to one link)
|
||||
- Empty link rectangles are handled gracefully
|
||||
- Internal named destinations that can't be resolved fall back to `#dest_name` anchors
|
||||
|
||||
## Files
|
||||
|
||||
- `crates/pdftract-core/src/output/markdown/links.rs` - Complete implementation (420 lines)
|
||||
- `crates/pdftract-core/src/output/markdown/mod.rs` - Module exports
|
||||
- `crates/pdftract-core/src/markdown.rs` - Integration with markdown emitter
|
||||
|
||||
## Related
|
||||
|
||||
- Phase 7.6: Link annotation extraction (crates/pdftract-core/src/annotation/links.rs)
|
||||
- Coordinator: pdftract-5o3zv (Phase 6.5.x Markdown output)
|
||||
Loading…
Add table
Reference in a new issue