pdftract/notes/pdftract-56yz8.md
jedarden 3618e6fd2c feat(pdftract-56yz8): implement span_to_markdown inline span styling (Phase 6.5)
Add span_to_markdown function that translates span flags to Markdown:
- Bold (bit 0) → **text**
- Italic (bit 1) → *text*
- Bold+italic → ***text***
- Subscript (bit 3) → <sub>text</sub>
- Superscript (bit 4) → <sup>text</sup>
- Smallcaps (bit 2) → <span style="font-variant: small-caps">text</span>
- Color-only differences: no styling
- Escapes CommonMark special characters

Tests cover all acceptance criteria:
- Bold+italic combination
- Subscript/superscript emission
- Smallcaps HTML span
- Special character escaping
- Whitespace-only edge cases

Closes: pdftract-56yz8
2026-05-25 11:49:44 -04:00

66 lines
2.9 KiB
Markdown

# Bead pdftract-56yz8: Inline Span Styling (Phase 6.5)
## Summary
Implemented `span_to_markdown` function that translates span flag bitmask values to Markdown inline syntax per Phase 6.5 of the plan (lines 2188-2195).
## Changes Made
### File: `crates/pdftract-core/src/markdown.rs`
1. Added `SpanJson` import to the module
2. Implemented `span_to_markdown(span: &SpanJson) -> String`:
- Reads span flags vector (`Vec<String>`) for style indicators
- Emits appropriate Markdown syntax based on flags
- Handles combinations: bold+italic → `***text***`
- Handles script nesting: `**<sub>text</sub>**` (scripts inside bold/italic)
- Handles smallcaps+script: `**<span><sup>text</sup></span>**` (scripts inside smallcaps)
- Skips whitespace-only spans (no point styling whitespace)
- Color-only differences: no styling emitted
3. Implemented `escape_markdown_inline(s: &str) -> String`:
- Escapes CommonMark special characters: `\` `` ` `` `*` `_` `[` `]` `(` `)` `#` `!` `+` `<` `>`
- Does NOT escape `-` `.` `=` (not special in inline context per CommonMark)
4. Added comprehensive test coverage (20+ tests):
- Bold, italic, bold+italic combinations
- Subscript, superscript, smallcaps individually
- Combined styling (bold+subscript, italic+superscript, all flags)
- Special character escaping
- Whitespace-only edge cases
### File: `crates/pdftract-core/src/lib.rs`
- Exported `span_to_markdown` from the markdown module for public API
## Acceptance Criteria Status
| Criterion | Test | Status |
|-----------|------|--------|
| Bold + italic → ***text*** | `test_span_to_markdown_bold_italic` | PASS |
| Subscript → `<sub>2</sub>` | `test_span_to_markdown_subscript` | PASS |
| Superscript → `<sup>th</sup>` | `test_span_to_markdown_superscript` | PASS |
| Smallcaps → `<span style="font-variant: small-caps">CAPS</span>` | `test_span_to_markdown_smallcaps` | PASS |
| Color-only difference: no styling | `test_span_to_markdown_no_flags` | PASS |
| Special chars escaped: "1*2" → "1\*2" | `test_span_to_markdown_special_chars_escaped` | PASS |
## Test Results
```
cargo test --package pdftract-core --lib markdown
test result: ok. 43 passed; 0 failed
```
All acceptance criteria tests pass.
## Implementation Notes
1. **Nesting order**: Following plan guidance "emit **<sub>text</sub>** not <sub>**text**</sub>", script tags are placed inside bold/italic wrappers. For smallcaps+script combinations, smallcaps wraps scripts (e.g., `<span><sup>text</sup></span>`).
2. **Escaping**: Implemented per CommonMark spec - only escapes characters that have special meaning in inline Markdown context. Characters like `-` and `.` are NOT escaped because they're only special at line start (for lists/HR), not inline.
3. **Edge cases**: Whitespace-only spans skip styling entirely to avoid emitting empty formatting like `** **`.
## Commits
- `pdftract-core`: Add span_to_markdown function with inline span styling (Phase 6.5)