pdftract/notes/pdftract-56yz8.md
jedarden 3618e6fd2c feat(pdftract-56yz8): implement span_to_markdown inline span styling (Phase 6.5)
Add span_to_markdown function that translates span flags to Markdown:
- Bold (bit 0) → **text**
- Italic (bit 1) → *text*
- Bold+italic → ***text***
- Subscript (bit 3) → <sub>text</sub>
- Superscript (bit 4) → <sup>text</sup>
- Smallcaps (bit 2) → <span style="font-variant: small-caps">text</span>
- Color-only differences: no styling
- Escapes CommonMark special characters

Tests cover all acceptance criteria:
- Bold+italic combination
- Subscript/superscript emission
- Smallcaps HTML span
- Special character escaping
- Whitespace-only edge cases

Closes: pdftract-56yz8
2026-05-25 11:49:44 -04:00

2.9 KiB

Bead pdftract-56yz8: Inline Span Styling (Phase 6.5)

Summary

Implemented span_to_markdown function that translates span flag bitmask values to Markdown inline syntax per Phase 6.5 of the plan (lines 2188-2195).

Changes Made

File: crates/pdftract-core/src/markdown.rs

  1. Added SpanJson import to the module

  2. Implemented span_to_markdown(span: &SpanJson) -> String:

    • Reads span flags vector (Vec<String>) for style indicators
    • Emits appropriate Markdown syntax based on flags
    • Handles combinations: bold+italic → ***text***
    • Handles script nesting: **<sub>text</sub>** (scripts inside bold/italic)
    • Handles smallcaps+script: **<span><sup>text</sup></span>** (scripts inside smallcaps)
    • Skips whitespace-only spans (no point styling whitespace)
    • Color-only differences: no styling emitted
  3. Implemented escape_markdown_inline(s: &str) -> String:

    • Escapes CommonMark special characters: \ ` * _ [ ] ( ) # ! + < >
    • Does NOT escape - . = (not special in inline context per CommonMark)
  4. Added comprehensive test coverage (20+ tests):

    • Bold, italic, bold+italic combinations
    • Subscript, superscript, smallcaps individually
    • Combined styling (bold+subscript, italic+superscript, all flags)
    • Special character escaping
    • Whitespace-only edge cases

File: crates/pdftract-core/src/lib.rs

  • Exported span_to_markdown from the markdown module for public API

Acceptance Criteria Status

Criterion Test Status
Bold + italic → text test_span_to_markdown_bold_italic PASS
Subscript → <sub>2</sub> test_span_to_markdown_subscript PASS
Superscript → <sup>th</sup> test_span_to_markdown_superscript PASS
Smallcaps → <span style="font-variant: small-caps">CAPS</span> test_span_to_markdown_smallcaps PASS
Color-only difference: no styling test_span_to_markdown_no_flags PASS
Special chars escaped: "1*2" → "1*2" test_span_to_markdown_special_chars_escaped PASS

Test Results

cargo test --package pdftract-core --lib markdown
test result: ok. 43 passed; 0 failed

All acceptance criteria tests pass.

Implementation Notes

  1. Nesting order: Following plan guidance "emit text not text", script tags are placed inside bold/italic wrappers. For smallcaps+script combinations, smallcaps wraps scripts (e.g., <span><sup>text</sup></span>).

  2. Escaping: Implemented per CommonMark spec - only escapes characters that have special meaning in inline Markdown context. Characters like - and . are NOT escaped because they're only special at line start (for lists/HR), not inline.

  3. Edge cases: Whitespace-only spans skip styling entirely to avoid emitting empty formatting like ** **.

Commits

  • pdftract-core: Add span_to_markdown function with inline span styling (Phase 6.5)