pdftract/notes/pdftract-8n270.md
jedarden d3c4ecd268 feat(pdftract-8n270): implement code block detection
Implement Phase 4.4 code block classification for detecting indented
monospace code blocks.

Features:
- is_monospace_font_name: Check font name for monospace indicators
  (mono, courier, code, fixed, console - case-insensitive)
- is_fixed_pitch_flag: Check FontDescriptor bit 0 (FixedPitch)
- classify_code: Classify block as code if all spans monospace AND
  indented ≥ 2em from column baseline
- classify_page_code_blocks: Post-processing pass to upgrade paragraph
  blocks to code kind

Acceptance criteria:
- All-Courier, indented 24pt, font_size 12pt (2em=24): Code ✓
- All-monospace, not indented: NOT Code ✓
- Mixed serif+monospace: NOT Code ✓
- One serif span at end: NOT Code ✓
- FixedPitch flag set, no "Mono" in name: STILL Code ✓

Closes: pdftract-8n270

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 10:04:22 -04:00

88 lines
3.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Code Block Detection (pdftract-8n270)
## Summary
Implemented code block classification (Phase 4.4) for detecting indented monospace code blocks.
## Implementation
Created new module `crates/pdftract-core/src/layout/code.rs` with:
1. **`is_monospace_font_name(font_name: &str) -> bool`**
- Checks if font name (with subset prefix stripped) contains monospace indicators
- Indicators: "mono", "courier", "code", "fixed", "console" (case-insensitive)
2. **`is_fixed_pitch_flag(flags: Option<u32>) -> bool`**
- Checks if FixedPitch flag (bit 0) is set in FontDescriptor flags
- Per PDF spec, bit 0 indicates fixed-pitch (monospace) fonts
3. **`is_monospace_span(font_name: &str, flags: Option<u32>) -> bool`**
- Combines both checks: monospace if name OR FixedPitch flag indicates it
4. **`classify_code<S>(block, column_baseline_x0, font_size) -> bool`**
- Classifies block as code if:
- ALL spans use monospace font
- Block is indented ≥ 2em from column baseline (2 × font_size)
5. **`compute_column_baseline<S>(blocks) -> f32`**
- Computes median x0 of non-code paragraph blocks in column
- Represents typical left edge of body text for indentation comparison
6. **`classify_page_code_blocks<S>(blocks)`**
- Post-processing pass that upgrades paragraph blocks to "code" kind
- Uses column baseline and monospace detection
## Acceptance Criteria
| Criterion | Status | Notes |
|-----------|--------|-------|
| All-Courier, indented 24pt, font_size 12pt (2em=24) | ✅ PASS | `classify_code` returns true |
| All-monospace, not indented | ✅ PASS | `classify_code` returns false |
| Mixed serif+monospace | ✅ PASS | `classify_code` returns false |
| One serif span at end | ✅ PASS | `classify_code` returns false |
| FixedPitch flag set, no "Mono" in name | ✅ PASS | Still classified as code |
## Files Modified
- `crates/pdftract-core/src/layout/code.rs` (new)
- `crates/pdftract-core/src/layout/mod.rs` (exported code module)
## Testing
All unit tests pass (107 passed, 0 failed):
```bash
cargo test --package pdftract-core --lib code
```
Test coverage includes:
- Font name matching (Courier, Mono, Code, Fixed, Console)
- FixedPitch flag detection
- Monospace span detection
- Code block classification
- Column baseline computation
- Page-level code block upgrade
## Design Notes
1. **MonospaceSpan trait**: Allows code detection to work with different span representations
2. **Font subset prefixes**: Correctly strips "ABCDEF+" prefixes before checking font names
3. **2em threshold**: As specified in plan, uses 2 × font_size for indentation requirement
4. **Post-processing approach**: Code detection runs after block formation (Phase 4.4)
5. **Median baseline**: Uses median (not mean) for robustness against outliers
## Integration
The code module is now exported from `layout::mod` and ready for integration into the extraction pipeline. The post-processing pass `classify_page_code_blocks` can be called after `group_lines_into_blocks` to upgrade paragraph blocks to code blocks.
## TODO
Per plan line 1726: "Indent threshold may miss flush-left code; add TODO."
- Flush-left code blocks (no indentation) are currently NOT detected as code
- This is intentional per the acceptance criteria ("not indented: NOT Code")
- Future enhancement could detect flush-left code via additional heuristics
## References
- Plan section: Phase 4.4 (line 1708)
- Bead: pdftract-8n270
- ISO 32000-1 Table 123 (FontDescriptor flags, bit 0 = FixedPitch)