pdftract/docs
jedarden 26bdd255c8 feat(pdftract-ilen): implement header row detection with bold+TH support
Implement header row detection for tables using two signals:
1. Bold font detection (fully implemented)
2. StructTree TH detection (stub pending MCID tracking)

Bold detection:
- is_bold_font(): detects bold fonts from PostScript name patterns
- is_cell_bold(): checks if all non-whitespace content in a cell is bold
- is_bold_header_row(): validates rows with >=2 bold cells
- count_header_rows(): counts contiguous bold headers from top
- Cell::mark_header_rows(): sets is_header_row flag on cells

TH detection (stub):
- is_th_header_row(): placeholder for StructTree TH detection
  Requires MCID tracking on TableSpan (future work)
  Will use ParentTree to map MCIDs to StructElems
  Will verify TR > TH chain structure

Combined detection:
- is_header_row(): combines bold and TH signals
- Bold wins on conflict per body data design principle

Documentation:
- Updated table-structure-reconstruction.md with full header detection spec
- Documented implemented vs pending signals
- Added implementation notes for TH detection

Tests:
- 45 tests covering all bold detection scenarios
- Tests for multi-row headers (contiguous from top)
- Tests for single-cell row exclusion
- Tests for empty/whitespace cell handling
- Placeholder tests for TH detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 23:32:54 -04:00
..
adr feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction 2026-05-23 12:30:26 -04:00
conformance feat(pdftract-5omc): implement SDK conformance test runner pattern 2026-05-18 01:22:23 -04:00
notes feat(pdftract-5omc): implement per-language conformance test runner pattern 2026-05-18 01:32:24 -04:00
plan docs(plan): SDKs are monorepo members, not separate repos 2026-05-22 07:21:45 -04:00
research feat(pdftract-ilen): implement header row detection with bold+TH support 2026-05-23 23:32:54 -04:00
security docs(pdftract-58kz): add security policy documentation 2026-05-20 19:39:24 -04:00
user-docs docs(pdftract-1g87): create mdBook scaffolding for user documentation 2026-05-18 00:38:51 -04:00
research-index.md Add parallel extraction research and comprehensive research index 2026-05-16 16:30:35 -04:00